CN105989161A - Big data processing method and apparatus - Google Patents

Big data processing method and apparatus Download PDF

Info

Publication number
CN105989161A
CN105989161A CN201510095692.7A CN201510095692A CN105989161A CN 105989161 A CN105989161 A CN 105989161A CN 201510095692 A CN201510095692 A CN 201510095692A CN 105989161 A CN105989161 A CN 105989161A
Authority
CN
China
Prior art keywords
noise
query result
query
susceptibility
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510095692.7A
Other languages
Chinese (zh)
Inventor
欧阳军
范伟
何诚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to CN201510095692.7A priority Critical patent/CN105989161A/en
Publication of CN105989161A publication Critical patent/CN105989161A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Embodiments of the invention provide a big data processing method and apparatus. The method comprises the steps of receiving a query instruction sent by a client and determining a query function K according to the query instruction; performing a query on a big data set D according to the query function to obtain a query result R, wherein the query result R equals {Rj}, j is greater than or equal to 1 and less than or equal to m, and m is a positive integer greater than or equal to 1; obtaining a sensitive degree S(K) of the query function K, wherein the sensitive degree S(K) represents the sensitiveness of the query function K; determining noise N required to be added in the query result R according to the query result R and the sensitive degree S(K), wherein the noise N equals {Nj} and noise components Nj of the noise N are in one-to-one correspondence with query result components Rj; and performing noise-adding processing on the query result components Rj according to the noise components Nj to obtain a noise-added query result R', wherein R' equals {R'j}. According to the method capable of performing noise-adding processing on big data, proposed by the embodiment of the invention, the data query result with differential privacy can be output.

Description

A kind of method and apparatus processing big data
Technical field
The present invention relates to data processing field, and more particularly, to the method processing big data and Device.
Background technology
The purpose of secret protection data mining is to protect individual privacy data, can promote user simultaneously Between data sharing.Differential privacy is a strict theory for describing and analyzing data publication method Model, its objective is to provide effective method to maximize the accurate of statistical query information from staqtistical data base Property, minimize the chance identifying individual record simultaneously.
The data handling procedure with differential privacy feasible at present can be only applied to data on a small scale, but For big data, each component of its Query Result vector has independent coordinate, and this is every Individual independent coordinate is the stochastic variable of an exponentially scale distribution, and therefore there is no can be on a large scale The effective way of differential privacy can be implemented in data.
Content of the invention
The embodiment of the present invention provides a kind of method and apparatus processing big data, can be on large-scale data Realize the purpose with the inquiry of differential privacy.
First aspect, embodiments provides a kind of method processing big data, comprising: receive visitor The query statement that family end sends, and determine query function K according to described query statement;According to described inquiry Function K carries out inquiry and obtains Query Result R, described Query Result R={R to large data sets Dj, its In 1≤j≤m, m is greater than or equal to the positive integer of 1;Obtain the susceptibility of described query function K S (K), described susceptibility S (K) characterizes the sensitiveness of described query function K;According to described Query Result R determines, with described susceptibility S (K), the noise N, described noise N={N needing to add to Query Result Rj, The noise component(s) N of described noise NjWith Query Result components RjOne_to_one corresponding;According to described noise component(s) NjTo described Query Result components RjCarry out adding process of making an uproar, obtain adding the Query Result R '={ R ' making an uproarj}。
In conjunction with first aspect, in the first possible implementation of first aspect, described acquisition is described The susceptibility S (K) of query function K (x) includes: obtain Query Result K (D1) and the data of data set D1 The Query Result K (D2) of collection D2;By described Query Result K (D1) with described Query Result K (D2) one In individual metric space, the minimum of a value of difference is as the value of described susceptibility S (K), wherein, and described data set D1 and two different subsets that described data set D2 is described large data sets D, described data set D1 and Record data are at most differed between described data set D2.
In conjunction with the first possible implementation of first aspect or first aspect, in the second of first aspect In kind possible implementation, described make an uproar with the determination of described susceptibility S (K) according to described Query Result R Sound N includes: generate the noise N ' meeting Laplacian noise distribution according to described Query Result R, its Described in noise N ' each noise component(s) separate;Make an uproar according to described susceptibility S (K) corrects Obtain described noise N, wherein said noise component(s) N after sound N 'jMeet the La Pu of described susceptibility S (K) Lars noise profile.
In conjunction with the possible implementation method of the first to the second of first aspect or first aspect, in first aspect The third possible implementation in, described query function K is hash function F, and described method includes: According to the training set of described large data sets D, training obtains described hash function F;Wherein, described training The subset that collection is described large data sets D, described training set also includes property set X and tag along sort Y, Described property set X is the set of the data characterizing element property in described training set, described tag along sort Y It is the set of the data characterizing element classification result in described training set.
Second aspect, embodiments provides a kind of device for processing big data, comprising: connect Receiving module, described receiver module is used for receiving the query statement that client sends, and refers to according to described inquiry Order determines query function K;First determining module, described first determining module is for according to described inquiry letter Number K carries out inquiry and obtains Query Result R, described Query Result R={R to large data sets Dj, wherein 1≤j≤m, m are greater than or equal to the positive integer of 1;Acquisition module, described acquisition module is used for obtaining The susceptibility S (K) of described query function K that described first determining module determines, described susceptibility S (K) Characterize the sensitiveness of described query function K;Second determining module, described second determining module is used for basis Described Query Result R and the described susceptibility S (K) obtaining according to described acquisition module determine to be needed to looking into Ask the noise N, described noise N={N that result R addsj, the noise component(s) N of described noise NjWith look into Ask result components RjOne_to_one corresponding;Adding module of making an uproar, the described module of making an uproar that adds is for according to described second determination mould The noise N that block determinesjTo described Query Result components RjCarry out adding making an uproar, obtain adding the Query Result made an uproar R '={ R 'j}。
In conjunction with second aspect, in the first possible implementation of second aspect, described acquisition module Specifically for: obtain the Query Result K (D2) of the Query Result K (D1) and data set D2 of data set D1; The Norm minimum value of described Query Result K (D1) and described Query Result K (D2) difference is set to described The value of susceptibility S (K), wherein said data set D1 and described data set D2 are described large data sets D Two different subsets, at most differ a record between described data set D1 and described data set D2 Data.
, wherein, in conjunction with second aspect or second aspect first to possible implementation, in second party In the possible implementation of the second in face, described second determining module specifically for: according to described inquiry Result R generates the noise N ' meeting Laplacian noise distribution, each noise in wherein said noise N ' Component is separate;Obtain described noise N after correcting described noise N ' according to described susceptibility S (K), The noise component(s) N of wherein said noise NjMeet the Laplacian noise distribution of described susceptibility S (K).
In conjunction with the possible implementation of the first to the second of second aspect or second aspect, in second aspect The third possible implementation in, described query function K is hash function F, described first determine Module is additionally operable to: according to the training set of described large data sets D, training obtains described hash function F;Its In, described training set is a subset of described large data sets D, and described training set includes property set X With tag along sort Y, described property set X is the set of the data characterizing element property in described training set, Described tag along sort Y is the set of the data characterizing element classification result in described training set.
The embodiment of the present invention is determined by the susceptibility of the query function of large data sets, true based on this susceptibility Surely need to Query Result add noise and by this noise add Query Result such that it is able to original greatly What the Query Result of data set carried out having differential privacy adds process of making an uproar, and finally gives and has differential privacy Query Result.Therefore, the present invention implements to carry out adding process of making an uproar to the big data of scale, and maximum can The leakage avoiding sensitive data of energy, it is achieved the purpose of differential privacy inquiry.
Brief description
In order to be illustrated more clearly that the technical scheme of the embodiment of the present invention, below will be in the embodiment of the present invention The required accompanying drawing using is briefly described, it should be apparent that, drawings described below is only this Some embodiments of invention, for those of ordinary skill in the art, are not paying creative work Under the premise of, other accompanying drawing can also be obtained according to these accompanying drawings.
Fig. 1 is the schematic diagram of the system scenarios example that can apply the embodiment of the present invention.
Fig. 2 is the flow chart of a kind of method processing big data of the embodiment of the present invention.
Fig. 3 is the flow chart of a kind of method processing big data of another embodiment of the present invention.
Fig. 4 is the flow chart of a kind of method processing big data of another embodiment of the present invention.
Fig. 5 is the schematic block diagram of a kind of device processing big data of the embodiment of the present invention.
Fig. 6 is the schematic block diagram of a kind of device processing big data of another embodiment of the present invention.
Detailed description of the invention
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out Clearly and completely describe, it is clear that described embodiment is a part of embodiment of the present invention, and not It is whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art are not making wound The every other embodiment being obtained on the premise of the property made work, all should belong to the scope of protection of the invention.
Fig. 1 is the schematic diagram of the system scenarios example that can apply the embodiment of the present invention.
When not carrying out the noise processed with differential privacy, client user is directly original quick to having The database of sense data proposes accurate inquiry request 1, and dashed lines shown in arrow 1, database is by essence True Query Result 2 returns client, dashed lines shown in arrow 2, is thus easy to original quick The individual private data of sense data is revealed in client.
By adding inquiry mechanism, client user can be by inquiry mechanism to having original sensitive data Database propose statistical query request 3, as shown in solid arrow 3 in figure, database can will be added up Query Result 4 returns client through inquiry mechanism, before returning client, and this statistical query result Through susceptibility noise processed, obtain adding the statistical query result 5 after making an uproar and return client, such as reality in figure Shown in line arrow 5, so that Query Result has secret protection and the avoiding number of individuals of maximum possible According to privacy leakage.
Fig. 2 is the flow chart of a kind of method processing big data of the embodiment of the present invention.As in figure 2 it is shown, The method includes:
Step 210, receives the query statement that client sends, and determines inquiry according to described query statement Function K.
Step 220, carries out inquiry according to query function K to large data sets D and obtains Query Result R, institute State Query Result R={Rj, wherein 1≤j≤m, m are greater than or equal to the positive integer of 1.
Step 230, obtains the susceptibility S (K) of query function K, looks into described in this susceptibility S (K) sign Ask the sensitiveness of function K.
Step 240, determining according to Query Result R and susceptibility S (K) needs to Query Result R addition Noise N, noise N={Nj, the noise component(s) N of noise NjWith Query Result components RjOne_to_one corresponding.
Step 250, according to noise component(s) NjTo Query Result components RjCarry out adding process of making an uproar, added Query Result R '={ the R ' making an uproarj}。
The embodiment of the present invention is determined by the susceptibility of the query function of large data sets, true based on this susceptibility Determine noise, the process of making an uproar that adds with differential privacy can be carried out to the Query Result of original large data sets, Obtain the Query Result with differential privacy eventually.Therefore, present invention enforcement can be to the big data of scale Carry out adding process of making an uproar, the leakage avoiding sensitive data of maximum possible, it is achieved the purpose of differential privacy inquiry.
Specifically, in step 210, after receiving the query statement that client sends, by obtaining client Evidence and information inquiry to relevant user data collection for the end requires, chooses concrete statistical query function K, example If this statistical query function can be summation (sum) or the function of (average) etc. of averaging, also Can be to train, based on classified inquiry result, the hash function obtaining, wherein, Query Result R be that basis is looked into Ask what function K was obtained by statistical query at large data sets D.
Specifically, in step 230, the susceptibility S (K) of query function K, this susceptibility S (K) are obtained Characterize the sensitiveness (sensitive) of described query function K, for the susceptibility of any one function F The definition of S (F) is: meet condition | | F (D1)-F(D2)||MThe minimum of a value of≤S (F), wherein data set D1 At most differ with D2 one record data, M represents a metric space, it should be understood that data set D1 and D2 differs record data and is meant that in the case that D1 with D2 data element number is identical, certain The numerical value of one element or value type are different.Alternatively, as one embodiment of the invention, in step In 220, the susceptibility S (K) obtaining query function K includes: calculate the Query Result of data set D1 The Query Result K (D2) of K (D1) and data set D2;By Query Result K (D1) and Query Result K (D2) In a metric space, the minimum of a value of difference is as the value of described susceptibility S (K), wherein data set D1 And at most differ between data set D2 one record data, data set D1 and data set D2 be described greatly The different subset of two of data set D, it should be noted that Query Result K (D1) and Query Result K (D2) here In a metric space, difference refers to the absolute of Query Result K (D1) and Query Result K (D2) difference Value.
Specifically, the susceptibility S (K) obtaining query function K includes calculating described susceptibility according to following formula S (K): S (K)=min | | K (D1)‐K(D2)||M, wherein, data set D1 and data set D2 at most differs One record data, M represents a metric space.
Alternatively, as one embodiment of the invention, described according to described Query Result R and described sensitivity Degree S (K) determines that noise N includes: generate the noise N ' meeting Laplacian noise distribution according to Query Result, Wherein in noise N ', each noise component(s) is separate;Obtain after correcting noise N ' according to susceptibility S (K) Noise N, wherein noise component(s) NjMeet the Laplacian noise distribution of susceptibility S (K).
Specifically, in step 230, according to Laplacian Differential Approach privacy theorem, select to add the mechanism of making an uproar, Generate noise N '=[N '1..., N 'j..., N 'm], wherein each noise component(s) of noise N ' is mutual Independent, this noise is calibrated by the susceptibility S (K) according to K, noise N=[N after being calibrated1..., Nj..., Nm], each noise component(s) in noise N after its alignment is separate.
Specifically, according to noise NjTo Query Result RjCarry out adding process of making an uproar, obtain that there is differential privacy Query Result R 'jReferring to join noise after calibration statistical query result, output has secret protection Query Result R '=[R '1..., R 'j..., R 'm]=R=[R1..., Rj..., Rm]+[N1..., Nj..., Nm]。
Alternatively, as one embodiment of the invention, according to the query demand to large data sets D for the client, Query function K determining Query Result R is hash function F, the Query Result component of Query Result R For Rj, wherein 1≤j≤m, m are greater than the positive integer equal to 1.
Alternatively, as one embodiment of the invention, according to the query demand to large data sets D for the client, Query function K determining Query Result R is that hash function F includes: according to the training of large data sets D Collection, training obtains hash function F, and generates the first Hash classification chart T of hash function F;Wherein, Training set includes that property set X and tag along sort Y, described property set X are to characterize unit in described training set The set of the data of element attribute, described tag along sort Y is sign element classification result in described training set The set of data.
Alternatively, as one embodiment of the invention, according to noise NjTo Query Result components RjCarry out Adding process of making an uproar, the Query Result R ' obtaining having differential privacy includes: according to noise NjTo hash function The first Hash classification chart T of F (x) carries out adding process of making an uproar, and obtains Query Result R ' and has the of differential privacy Two Hash classification chart T '.
The embodiment of the present invention is determined by the susceptibility of the query function of large data sets, true based on this susceptibility Surely need to Query Result add noise and by this noise add Query Result such that it is able to original greatly What the Query Result of data set carried out having differential privacy adds process of making an uproar, and finally gives and has differential privacy Query Result.Therefore, the present invention implements to carry out adding process of making an uproar to the big data of scale, and maximum can The leakage avoiding sensitive data of energy, it is achieved the purpose of differential privacy inquiry.
Fig. 3 is the flow chart of a kind of method processing big data of the embodiment of the present invention.As it is shown on figure 3, The method includes:
Step 310, receives the query statement that client sends, and determines inquiry according to described query statement Function F, this query function F is hash function.
Step 320, carries out inquiry according to query function F to large data sets D and obtains Query Result R, institute State Query Result R={Rj, wherein 1≤j≤m, m are greater than or equal to the positive integer of 1.
Step 330, obtains the susceptibility S (K) of query function F, and this susceptibility S (K) characterizes described inquiry The sensitiveness of function F.
Step 340, determining according to Query Result R and susceptibility S (K) needs to Query Result R addition Noise N, noise N={Nj, the noise component(s) N of noise NjWith Query Result components RjOne_to_one corresponding.
Step 350, according to noise component(s) NjTo Query Result components RjCarry out adding process of making an uproar, added Query Result R '={ the R ' making an uproarj}。
The embodiment of the present invention is determined by the susceptibility of the query function of large data sets, true based on this susceptibility Surely need the sound adding to Query Result and this noise is added Query Result such that it is able to original What the Query Result of large data sets carried out having differential privacy adds process of making an uproar, and finally gives and has differential privacy Query Result.Therefore, the present invention implements to carry out adding, to the big data of scale, process of making an uproar, maximum The possible leakage avoiding sensitive data, it is achieved the purpose of differential privacy inquiry.
Should be understood that in step 110, the query demand of above-mentioned large data sets R refers to, for example, can be right Certain attribute in a certain subset of R is sued for peace, it is thus achieved that statistical query function, wherein 1≤j≤m, M is greater than the positive integer equal to 1, can be micro-by construction according to the demand inquiring about large data sets R Point privacy Stochastic Decision-making Hash (it is English: Differentially Private Random Decision Hashing, Write a Chinese character in simplified form: DPRDH), training construction hash function F.
Alternatively, as one embodiment of the invention, obtain large data sets R's according to large data sets R Hash function F includes: according to the training set of large data sets D, construct the Stochastic Decision-making with differential privacy Hash, obtains hash function F with training, and generates the first Hash classification chart T of hash function F;Its In, training set includes property set X and tag along sort Y, described property set X is table in described training set Levying the set of the data of element property, described tag along sort Y is to characterize element classification knot in described training set The set of the data of fruit.Should be understood that according to the demand to large data sets D inquiry, can train and be obtained Obtaining m Hash classification chart, this m Hash classification chart is all the first Hash classification chart T, it should be appreciated that First Hash classification chart T is to be trained obtain to comprise at least one Kazakhstan by the training set of large data sets D One class Hash classification chart of uncommon classification chart.
Specifically, construct that to have the Stochastic Decision-making Hash procedure of differential privacy as follows: input training set { belongs to Property collection X, tag along sort Y}, the first Hash classification chart in comprise m initial Hash classification chart and L Class, wherein, property set X can be numeric type (numerical), classification type (categorical) and two System type (binary), and tag along sort Y is according to the label symbol obtaining after property set X classification, Corresponding L class under tag along sort Y, L is greater than the positive integer equal to 1;Export m Hash to divide Class table, this m Hash classification chart collection is combined into T=[T1..., Tj..., Tm], i.e. obtain the first Hash Classification chart T, for wherein any one sublist Tj=[bkkey1,bkkey2..., bkkeyL]。
Specifically, the parameter according to above-mentioned input and output, constructs the Stochastic Decision-making Kazakhstan with differential privacy Uncommon following flow process:
1. m masking-out vector (maskvector) of stochastic generation, wherein any one masking-out vector is maskvectorj
2. in the training set of pair input, property set X is unified carries out being encoded to binary type, obtains m Binary type Xbinary, wherein any one binary type is encoded to Xbinaryj
3. the training process constructing the random Harsh with differential privacy is as follows:
For i=1;i≤|X|;++i do
For j=1, j≤m;++j do
Calculation key, key=maskvectorjAndXbinaryj
Distribution key assignments is to Hash classification chart, bkkey=Tj[key];
Adjust the key assignments bk in Hash classification chartkey(Y)+=1;
End
End
For should be understood that corresponding to the outer loop in step 3, refer to each unit in property set X Element all carries out the circulation of an internal layer, to be assigned to their corresponding key assignments in m Hash classification chart; And correspond to interior loop, then be according to masking-out vector logical AND binary type coding result as key assignments, Each key assignments is assigned to any one Hash classification chart Tj, circulate m time to obtain m Hash classification Table T=[T1..., Tj..., Tm]。
4., through step 3, m Hash classification chart, T=[T can be obtained1..., Tj..., Tm], Wherein any one sublist Tj=[bkkey1, bkkey2..., bkkeyL] correspond to table 1, Each column vector Y=[Y of this table1bi,…,Yibi,…,YLbn] it is referred to as a masking-out arrow Amount.
Table 1
Tag along sort (L class) Bucket 1 Bucket 2 …… Bucket i …… Bucket n
Y1 Y1b1 Y1b2 Y1bi Y1bn
Y2 Y2b1 Y2b2 Y2bi Y2bn
Yi Yib1 Yib2 Yibi Yibn
YL YLb1 YLb2 YLbi YLbn
Alternatively, as one embodiment of the invention, the susceptibility S (F) calculating query function F (x) includes: Calculate the Query Result K (D2) of the Query Result K (D1) and data set D2 of data set D1;Inquiry is tied Really K (D1) and the minimum of a value of Query Result K (D2) difference in a metric space are as described sensitivity The value of degree S (K), wherein at most differs record data, number between data set D1 and data set D2 It is two different subsets of described large data sets D according to collection D1 and data set D2.
Alternatively, as one embodiment of the invention, the susceptibility S (F) of above-mentioned hash function F (x) by under Formula is calculated: S (F)=min | | F (D1)‐F(D2)||M, wherein, data set D1 and D2 at most differs One record data, M represents a metric space.
Alternatively, described determine that noise N includes according to Query Result R and susceptibility S (K): according to looking into Ask result and generate the noise N ' meeting Laplacian noise distribution, wherein each noise component(s) in noise N ' Separate;Obtain noise N, wherein noise component(s) N after correcting noise N ' according to susceptibility S (K)jFull The Laplacian noise distribution of foot susceptibility S (K), i.e. noise component(s) NjMeet Lap (S (F)/ε), so that The Query Result R ' after noise must be added to have ε-differential privacy.
Alternatively, as one embodiment of the invention, according to the query demand to large data sets D for the client, Query function K (x) determining Query Result R is that hash function F (x) includes: according to large data sets D Training set, training obtains described hash function F (x), and generates the first Hash classification of hash function F (x) Table T;Wherein, training set includes the property set X and tag along sort Y of large data sets D.
Alternatively, as one embodiment of the invention, according to noise NjTo described Query Result components Rj Carrying out adding process of making an uproar, the Query Result R ' obtaining having differential privacy includes:
According to noise NjCarry out adding process of making an uproar to the first Hash classification chart T of hash function F (x), obtain Second Hash classification chart T ' corresponding with the Query Result R ' with differential privacy.
Alternatively, as one embodiment of the invention, by construction differential privacy Stochastic Decision-making Hash classification Device (English: Differentially Private Random Decision Hashing Classifier, write a Chinese character in simplified form: DPRDHC) can predict that output has the Query Result R ' of differential privacy.
Specifically, differential privacy random Harsh grader is constructed, to predict that output Query Result R's ' is as follows Flow process:
1. input m the second Hash classification chart set, T '=[T '1... T 'j... T 'm] and divided The identity column X ' of class;
2. initialize tag along sort vector (label vectors), statistic of classification (label count and label average);
3. encode the row X ' being classified;
4. the prediction process constructing the random Harsh grader with differential privacy is as follows:
For j=1;j≤m;++j do
Calculation key, key=maskvectorjAndXbinaryj
Distribution key assignments is to Hash classification chart, bkkey=Tj[key];
Adjust the key assignments label count+=bk in Hash classification chartkey
End
5. the arithmetic average of the tag along sort in m Equations of The Second Kind Hash classification chart of calculating, label Avg=label count/m;
6. in m label, take maximum as tag along sort value, Y '=argmax (label avg);
7. output category label value Y '.
The embodiment of the present invention is determined by the susceptibility of the query function of large data sets, true based on this susceptibility Surely need the sound adding to Query Result and this noise is added Query Result such that it is able to original What the Query Result of large data sets carried out having differential privacy adds process of making an uproar, and finally gives and has differential privacy Query Result.Therefore, the present invention implements to carry out adding, to the big data of scale, process of making an uproar, maximum The possible leakage avoiding sensitive data, it is achieved the purpose of differential privacy inquiry.
Below in conjunction with concrete steps, describe in more detail the embodiment of the present invention.
Fig. 4 is the flow chart of a kind of method processing big data of another embodiment of the present invention.Such as Fig. 4 institute Show, the method 400 following steps:
Step 401, it is thus achieved that statistical query function F.
Step 402, generates separate noise N '=[N '1..., N 'j..., N 'm]。
Step 403, calculates the standard deviation D=[D of noise N '1..., Dj..., Dm]。
Step 404, the susceptibility S (F) of counting statistics query function.
Step 405, by calibrating the standard deviation D, the noise N=[N after being calibrated of noise N '1..., Nj..., Nm]。
Step 406, it is thus achieved that statistical query result R=[R1..., Rj..., Rm]。
Step 407, the noise N after calibration joins statistical query result, the inquiry knot of output secret protection Really R '=[R '1..., R 'j..., R 'm]=R=[R1..., Rj..., Rm]+[N1..., Nj..., Nm]。
The embodiment of the present invention is determined by the susceptibility of the query function of large data sets, true based on this susceptibility Surely need the sound adding to Query Result and this noise is added Query Result such that it is able to original What the Query Result of large data sets carried out having differential privacy adds process of making an uproar, and finally gives and has differential privacy Query Result.Therefore, the present invention implements to carry out adding, to the big data of scale, process of making an uproar, maximum The possible leakage avoiding sensitive data, it is achieved the purpose of differential privacy inquiry.
Alternatively, in step 401, the aggregation information inquiry according to client relevant user data collection is wanted Ask, choose concrete statistical query function F, for example, sue for peace or function etc. of averaging, it is also possible to be base Train the hash function obtaining in classified inquiry result.
Alternatively, in step 402, the Query Result according to statistical query function F, eligible conjunction Suitable noise mechanism is to generate separate noise N '=[N '1..., N 'j..., N 'm], this noise Each component in N ' is separate, and such as N ' can be distributed for meeting Laplacian noise, So wherein each component of N ' is separate and meets Laplacian noise distribution.Ying Li Solve, choose suitable noise mechanism and refer to according to Laplacian Differential Approach privacy theorem, select to add the mechanism of making an uproar.
Alternatively, in step 403, the standard deviation of each isolated component in noise N ' is calculated respectively Obtain standard deviation D=[D1..., Dj..., Dm]。
Alternatively, in step 404, the Query Result F (D1) and data set D2 of data set D1 are calculated Query Result F (D2);Take Query Result F (D1) and Query Result F (D2) in a metric space The minimum of a value of difference is as the value of described susceptibility S (K), wherein said data set D1 and described data set Record data are at most differed between D2.Specifically, the susceptibility S (F) of counting statistics query function F Including according to the described susceptibility S (F) of following formula calculating: S (F)=min | | F (D1)‐F(D2)||M, wherein, data Collection D1 and data set D2 at most differs record data, and M represents a metric space, data set D1 and two different subsets that data set D2 is large data sets D.
Alternatively, in step 405, by calibrating the standard deviation D of noise N ', after being calibrated Noise N=[N1..., Nj..., Nm] so that each component N in the noise N after calibrationjFull Foot Lap (S (F)/ε), in order to output has the Query Result of ε-differential privacy, and wherein ε codomain is in [0,1] Between, this ε can be specified by user.
Alternatively, in a step 406, statistical query result R=[R is obtained according to statistical query function F1..., Rj..., Rm], it should be appreciated that this step also can obtain before generating noise N ', and the present invention does not limits In this.
Alternatively, in step 407, the noise N after calibration is joined statistical query result, export hidden Query Result R '=[the R ' of private protection1..., R 'j..., R 'm]=R=[R1..., Rj..., Rm]+[N1..., Nj..., Nm], owing to each component in noise N is to calibrate according to statistical function susceptibility S (F) After obtain and meet Lap (S (F)/ε) distribution, therefore, the Query Result R ' of output has ε-differential privacy.
The embodiment of the present invention is determined by the susceptibility of the query function of large data sets, true based on this susceptibility Surely need the sound adding to Query Result and this noise is added Query Result such that it is able to original What the Query Result of large data sets carried out having differential privacy adds process of making an uproar, and finally gives and has differential privacy Query Result.Therefore, the present invention implements to carry out adding, to the big data of scale, process of making an uproar, maximum The possible leakage avoiding sensitive data, it is achieved the purpose of differential privacy inquiry.
Fig. 1 to Fig. 4 is the detailed process described in detail from method angle and process big data, below in conjunction with Fig. 5 to Fig. 6 is from the device describing in detail for processing big data.
Fig. 5 is the schematic block diagram of a kind of device processing big data of the embodiment of the present invention.Such as Fig. 5 institute Showing, device 500 includes: receiver module the 510th, the first determining module the 520th, computing module the 530th, second Determining module 540 and add module 550 of making an uproar.
Receiver module 510, is used for receiving the query statement that client sends, and according to described query statement Determine query function K.
First determining module 520, the first determining module is for entering to large data sets D according to query function K Row inquiry obtains Query Result R, Query Result R={Rj, wherein 1≤j≤m, m is greater than or equal to The positive integer of 1.
Acquisition module 530, acquisition module is quick for query function K of acquisition the first determining module determination Sensitivity S (K), this susceptibility S (K) characterizes the sensitiveness of described query function K.
Second determining module 540, the second determining module is for according to Query Result R with according to acquisition module The susceptibility S (K) obtaining determines the noise N, noise N={N needing to add to Query Result Rj, make an uproar The noise component(s) N of sound NjWith Query Result components RjOne_to_one corresponding.
Add module 550 of making an uproar, add module of making an uproar for the noise N determining according to the second determining modulejTo inquiry knot Really components RjCarry out adding making an uproar, obtain adding the Query Result R '={ R ' making an uproarj}。
The embodiment of the present invention is determined by the susceptibility of the query function of large data sets, true based on this susceptibility Surely need the sound adding to Query Result and this noise is added Query Result such that it is able to original What the Query Result of large data sets carried out having differential privacy adds process of making an uproar, and finally gives and has differential privacy Query Result.Therefore, the present invention implements to carry out adding, to the big data of scale, process of making an uproar, maximum The possible leakage avoiding sensitive data, it is achieved the purpose of differential privacy inquiry.
Specifically, receiver module 510 is looked into by obtaining evidence and information to relevant user data collection for the client Asking and requiring, choose concrete statistical query function K, such as this statistical query function can be summation (sum) Or the function of (average) etc. of averaging, it is also possible to it is to train, based on classified inquiry result, the Kazakhstan obtaining Uncommon function, wherein, Query Result R is to pass through statistical query according to query function K at large data sets D Obtain.
Alternatively, as one embodiment of the invention, acquisition module 530 specifically for: calculate data set The Query Result K (D2) of the Query Result K (D1) and data set D2 of D1;By Query Result K (D1) with The minimum of a value of the difference of Query Result K (D2) is set to the value of susceptibility S (K), wherein data set D1 and At most differing record data between described data set D2, data set D1 and data set D2 is big number Two different subsets according to collection D, it should be appreciated that data set D1 and D2 differs containing of record data Justice is in the case that D1 with D2 data element number is identical, the numerical value of some element or numerical value class Type is different.Also, it is noted that the difference of Query Result K (D1) and Query Result K (D2) refers to here The absolute value of difference between the two.
Specifically, acquisition module 520 is additionally operable to according to following formula calculating susceptibility S (K): S (K)=min | | K (D1)‐K(D2)||M, wherein, data set D1 and D2 at most differs record data, Data set D1 represents a degree with two different subsets that data set D2 is described large data sets D, M Quantity space.
Alternatively, as one embodiment of the invention, the second determining module 530 specifically for: according to looking into Ask result and generate the noise N ' meeting Laplacian noise distribution, each noise in wherein said noise N ' Component is separate;Obtain noise N, wherein noise component(s) after correcting noise N ' according to susceptibility S (K) NjMeet the Laplacian noise distribution of susceptibility S (K).
Alternatively, as one embodiment of the invention, the first determining module 510 is additionally operable to: according to client The query demand to large data sets D for the end, determines that query function K (x) of Query Result R is hash function The Query Result component of F (x), Query Result R is Rj, wherein 1≤j≤m, m are greater than equal to 1 Positive integer.
Alternatively, as one embodiment of the invention, the first determining module 510 is additionally operable to: according to big number According to the training set of collection D, training obtains hash function F (x), and generates the first Hash of hash function F (x) Classification chart T;Wherein, training set is a subset of large data sets D, and this training set includes property set X and tag along sort Y, described property set X are the set of the data characterizing element property in described training set, Described tag along sort Y is the set of the data characterizing element classification result in described training set.
Alternatively, as one embodiment of the invention, the first determining module 510 is additionally operable to make an uproar according to described Sound NjCarry out adding process of making an uproar to the first Hash classification chart T of hash function F (x), obtain Query Result R 'j There is the second Hash classification chart T ' of differential privacy.
The embodiment of the present invention is determined by the susceptibility of the query function of large data sets, true based on this susceptibility Surely need the sound adding to Query Result and this noise is added Query Result such that it is able to original What the Query Result of large data sets carried out having differential privacy adds process of making an uproar, and finally gives and has differential privacy Query Result.Therefore, the present invention implements to carry out adding, to the big data of scale, process of making an uproar, maximum The possible leakage avoiding sensitive data, it is achieved the purpose of differential privacy inquiry.
Fig. 6 is the schematic block diagram of a kind of device processing big data of another embodiment of the present invention.Should note Meaning, the equipment shown in Fig. 6 is corresponding with Fig. 2 to Fig. 4 embodiment, is capable of Fig. 1 to Fig. 4 embodiment Each process of the method processing big data, for avoiding repeating suitably to omit detailed description.Such as Fig. 6 institute A kind of device processing big data showing includes: processor the 610th, memory 620 and bus 630.Its In, processor 610 is connected by bus 630 with memory 620, and this memory 620 refers to for storage Order, this processor 610 is for performing the instruction of this memory 620 storage.Specifically, processor 610 For: receive the query statement that client sends, and determine query function K according to this query statement;Root According to query function K, inquiry is carried out to large data sets D and obtain Query Result R, Query Result R={Rj, Wherein 1≤j≤m, m are greater than or equal to the positive integer of 1;Obtain the susceptibility S (K) of query function K, This susceptibility S (K) characterizes the sensitiveness of query function K;True according to Query Result R and susceptibility S (K) Surely the noise N, noise N={N adding to Query Result R is neededj, the noise component(s) N of noise Nj With Query Result components RjOne_to_one corresponding;According to noise component(s) NjTo Query Result components RjCarry out adding making an uproar Process, obtain adding the Query Result R '={ R ' making an uproarj}。
Alternatively, as one embodiment of the invention, processor 610 is for obtaining looking into of data set D1 Ask the Query Result K (D2) of result K (D1) and data set D2;By Query Result K (D1) and Query Result The minimum of a value of K (D2) difference in a metric space is as the value of susceptibility S (K), wherein data set D1 And between data set D2, at most differing record data, data set D1 and data set D2 is big data Two different subsets of collection D.
Specifically, processor 610 is for according to following formula acquisition susceptibility S (K): S (K)=min | | K (D1)‐K(D2)||M, wherein, data set D1 and data set D2 at most differs a note Record data, data set D1 and two different subsets that data set D2 is large data sets D, M represents one Individual metric space.
Alternatively, as one embodiment of the invention, processor 610 is full for generating according to Query Result The noise N ' of foot Laplacian noise distribution, wherein in noise N ', each noise component(s) is separate;Root Obtain noise N, wherein noise component(s) N after susceptibility S (K) correction noise N 'jMeet susceptibility S (K) Laplacian noise distribution.
The embodiment of the present invention is determined by the susceptibility of the query function of large data sets, true based on this susceptibility Surely need the sound adding to Query Result and this noise is added Query Result such that it is able to original What the Query Result of large data sets carried out having differential privacy adds process of making an uproar, and finally gives and has differential privacy Query Result.Therefore, the present invention implements to carry out adding, to the big data of scale, process of making an uproar, maximum The possible leakage avoiding sensitive data, it is achieved the purpose of differential privacy inquiry.
Those of ordinary skill in the art are it is to be appreciated that combine described in the embodiments described herein Various method steps and unit, can with electronic hardware, computer software or the two be implemented in combination in, In order to clearly demonstrate the interchangeability of hardware and software, general according to function in the above description Describe step and the composition of each embodiment.These functions perform with hardware or software mode actually, Depend on application-specific and the design constraint of technical scheme.Those of ordinary skill in the art can be to often Individual specifically should being used for uses different methods to realize described function, but this realize it is not considered that Beyond the scope of this invention.
The method describing in conjunction with the embodiments described herein or step can be performed by hardware, processor Software program, or the combination of the two implements.Software program can be placed in random access memory (RAM), Internal memory, read-only storage (ROM), electrically programmable ROM, electrically erasable ROM, deposit In device, hard disk, moveable magnetic disc, CD-ROM or technical field known any other form of In storage medium.
Although by with reference to accompanying drawing and by way of combining preferred embodiment to the present invention have been described in detail, But the present invention is not limited to this.Without departing from the spirit and substance of the premise in the present invention, this area is common Technical staff can carry out modification or the replacement of various equivalence to embodiments of the invention, and these modifications or Replacing all should be in the covering scope of the present invention.

Claims (8)

1. the method processing big data, it is characterised in that include:
Receive the query statement that client sends, and determine query function K according to described query statement;
Carry out inquiry to large data sets D according to described query function K and obtain Query Result R, described look into Ask result R={Rj, wherein 1≤j≤m, m are greater than or equal to the positive integer of 1;
Obtaining the susceptibility S (K) of described query function K, described susceptibility S (K) characterizes described inquiry letter The sensitiveness of number K;
Need to Query Result R addition with the determination of described susceptibility S (K) according to described Query Result R Noise N, described noise N={Nj, the noise component(s) N of described noise NjWith Query Result components Rj One_to_one corresponding;
According to described noise component(s) NjTo described Query Result components RjCarry out adding process of making an uproar, obtain adding and make an uproar Query Result R '={ R 'j}。
2. method according to claim 1, it is characterised in that the described query function of described acquisition The susceptibility S (K) of K includes:
Obtain the Query Result K (D2) of the Query Result K (D1) and data set D2 of data set D1;
By described Query Result K (D1) and described Query Result K (D2) difference in a metric space Minimum of a value as the value of described susceptibility S (K), wherein said data set D1 and described data set D2 is The different subset of two of described large data sets D, between described data set D1 and described data set D2 extremely One record data of many differences.
3. the method according to according to any one of claim 1 or 2, it is characterised in that described basis With described susceptibility S (K), described Query Result R determines that noise N includes:
Generate the noise N ' meeting Laplacian noise distribution according to described Query Result R, wherein said In noise N ', each noise component(s) is separate;
Obtain described noise N after correcting described noise N ' according to described susceptibility S (K), wherein said make an uproar The noise component(s) N of sound NjMeet the Laplacian noise distribution of described susceptibility S (K).
4. according to the method in any one of claims 1 to 3, it is characterised in that described inquiry Function K is hash function F, and described method includes:
According to the training set of described large data sets D, training obtains described hash function F;
Wherein, described training set is a subset of described large data sets D, and described training set also includes belonging to Property collection X and tag along sort Y, described property set X be in described training set characterize element property data Set, described tag along sort Y is the set of data characterizing element classification result in described training set.
5. the device being used for processing big data, it is characterised in that include:
Receiver module, described receiver module is for receiving the query statement that client sends, and according to described Query statement determines query function K;
First determining module, described first determining module is used for according to described query function K to large data sets D carries out inquiry and obtains Query Result R, described Query Result R={Rj, wherein 1≤j≤m, m are big In or be equal to 1 positive integer;
Acquisition module, described acquisition module is for obtaining the described inquiry letter that described first determining module determines The susceptibility S (K) of number K, described susceptibility S (K) characterizes the sensitiveness of described query function K;
Second determining module, described second determining module is for according to described Query Result R with according to described The described susceptibility S (K) that acquisition module obtains determines the noise N needing to add to Query Result R, institute State noise N={Nj, the noise component(s) N of described noise NjWith Query Result components RjOne_to_one corresponding;
Adding module of making an uproar, the described module of making an uproar that adds is for the noise N determining according to described second determining modulejRight Described Query Result components RjCarry out adding making an uproar, obtain adding the Query Result R '={ R ' making an uproarj}。
6. device according to claim 5, it is characterised in that described acquisition module specifically for:
Obtain the Query Result K (D2) of the Query Result K (D1) and data set D2 of data set D1;
The Norm minimum value of described Query Result K (D1) and described Query Result K (D2) difference is set to The value of described susceptibility S (K), wherein said data set D1 and described data set D2 are described big data Two different subsets of collection D, at most differ one between described data set D1 and described data set D2 Record data.
7. the device according to according to any one of claim 5 or 6, it is characterised in that described second Determining module specifically for:
Generate the noise N ' meeting Laplacian noise distribution according to described Query Result R, wherein said In noise N ', each noise component(s) is separate;
Obtain described noise N after correcting described noise N ' according to described susceptibility S (K), wherein said make an uproar The noise component(s) N of sound NjMeet the Laplacian noise distribution of described susceptibility S (K).
8. the device according to according to any one of claim 7 to 7, it is characterised in that described inquiry Function K is hash function F, and described first determining module is additionally operable to:
According to the training set of described large data sets D, training obtains described hash function F;
Wherein, described training set is a subset of described large data sets D, and described training set includes attribute Collection X and tag along sort Y, described property set X are the data characterizing element property in described training set Set, described tag along sort Y is the set of the data characterizing element classification result in described training set.
CN201510095692.7A 2015-03-04 2015-03-04 Big data processing method and apparatus Pending CN105989161A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510095692.7A CN105989161A (en) 2015-03-04 2015-03-04 Big data processing method and apparatus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510095692.7A CN105989161A (en) 2015-03-04 2015-03-04 Big data processing method and apparatus

Publications (1)

Publication Number Publication Date
CN105989161A true CN105989161A (en) 2016-10-05

Family

ID=57038338

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510095692.7A Pending CN105989161A (en) 2015-03-04 2015-03-04 Big data processing method and apparatus

Country Status (1)

Country Link
CN (1) CN105989161A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108664488A (en) * 2017-03-28 2018-10-16 华为技术有限公司 A kind of processing method and processing device of traffic statistics achievement data
CN109492429A (en) * 2018-10-30 2019-03-19 华南师范大学 A kind of method for secret protection of data publication
CN113157541A (en) * 2021-04-20 2021-07-23 贵州优联博睿科技有限公司 Distributed database-oriented multi-concurrent OLAP (on-line analytical processing) type query performance prediction method and system
CN113553363A (en) * 2021-09-23 2021-10-26 支付宝(杭州)信息技术有限公司 Query processing method and device

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108664488A (en) * 2017-03-28 2018-10-16 华为技术有限公司 A kind of processing method and processing device of traffic statistics achievement data
CN108664488B (en) * 2017-03-28 2020-11-10 华为技术有限公司 Method and device for processing voice system index data
CN109492429A (en) * 2018-10-30 2019-03-19 华南师范大学 A kind of method for secret protection of data publication
CN109492429B (en) * 2018-10-30 2020-10-16 华南师范大学 Privacy protection method for data release
CN113157541A (en) * 2021-04-20 2021-07-23 贵州优联博睿科技有限公司 Distributed database-oriented multi-concurrent OLAP (on-line analytical processing) type query performance prediction method and system
CN113157541B (en) * 2021-04-20 2024-04-05 贵州优联博睿科技有限公司 Multi-concurrency OLAP type query performance prediction method and system for distributed database
CN113553363A (en) * 2021-09-23 2021-10-26 支付宝(杭州)信息技术有限公司 Query processing method and device
WO2023045504A1 (en) * 2021-09-23 2023-03-30 支付宝(杭州)信息技术有限公司 Query processing method and apparatus

Similar Documents

Publication Publication Date Title
Zhang et al. Community detection in networks with node features
Lai et al. Fast global k-means clustering using cluster membership and inequality
CN110022531B (en) Localized differential privacy urban garbage data report and privacy calculation method
CN105989161A (en) Big data processing method and apparatus
Ma et al. Fast-convergent federated learning with class-weighted aggregation
CN114580651A (en) Federal learning method, device, equipment, system and computer readable storage medium
Erpolat Taşabat A Novel Multicriteria Decision‐Making Method Based on Distance, Similarity, and Correlation: DSC TOPSIS
Nosovskiy et al. Automatic clustering and boundary detection algorithm based on adaptive influence function
CN113807415B (en) Federal feature selection method, federal feature selection device, federal feature selection computer device, and federal feature selection storage medium
Nepomuceno et al. On the use of interval extensions to estimate the largest Lyapunov exponent from chaotic data
CN113468382A (en) Multi-party loop detection method, device and related equipment based on knowledge federation
US20110138264A1 (en) Verification Of Data Stream Computations Using Third-Party-Supplied Annotations
CN114036581A (en) Privacy calculation method based on neural network model
Sheela et al. Partition based perturbation for privacy preserving distributed data mining
US20210056586A1 (en) Optimizing large scale data analysis
Ma et al. Fuzzy nodes recognition based on spectral clustering in complex networks
CN115601283A (en) Image enhancement method and device, computer equipment and computer readable storage medium
Wang et al. Variable selection in semiparametric quantile modeling for longitudinal data
Moreno-García et al. An edit distance between graph correspondences
Qin et al. New uncertainty measure of rough fuzzy sets and entropy weight method for fuzzy‐target decision‐making tables
Kukush et al. Simultaneous estimation of baseline hazard rate and regression parameters in Cox proportional hazards model with measurement error
Fu et al. PPA-DBSCAN: Privacy-preserving ρ-Approximate Density-based Clustering
KR102377535B1 (en) Anonymization of big data personal information and method of combining anonymized data
CN113537308A (en) Two-stage k-means clustering processing system and method based on localized differential privacy
CN113158088A (en) Position recommendation method based on graph neural network

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20161005