CN107358115B

CN107358115B - It is a kind of consider practicability multiattribute data go privacy methods

Info

Publication number: CN107358115B
Application number: CN201710496086.5A
Authority: CN
Inventors: 陈为; 王叙萌; 关会华; 陈文龙; 劳天溢
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2017-06-26
Filing date: 2017-06-26
Publication date: 2019-09-20
Anticipated expiration: 2037-06-26
Also published as: CN107358115A

Abstract

The invention discloses a kind of multiattribute datas for considering practicability to go privacy methods, comprising the following steps: step 1: importing pretreated multiattribute data；Step 2: according to attribute description defining required attributes and Sensitive Attributes, the pre- rule of classification of indispensable attributes is set, according to the sequence of attributive character defining required attributes；Step 3: building privacy exposure tree, the foundation for respectively generating the attribute sequence in step 2 as the hierarchic sequence of privacy exposure tree and every level branch with pre- rule of classification；Step 4: risk measurement result information is encoded on the node of privacy exposure tree and side according to Sensitive Attributes；The present invention neatly can select suitable method to solve privacy concern from several common grammer anonymizations and difference privacy model, to meet a variety of privacy requirements to different data；Wherein privacy exposure tree utilizes the distinctive advantage of tree construction, realizes the spaces compact design to Multidimensional-collection.

Description

It is a kind of consider practicability multiattribute data go privacy methods

Technical field

The present invention relates to Information Hiding Techniques field, in particular to a kind of multiattribute data for considering practicability goes to privacy side Method.

Background technique

Data owner needs to consider whether the data are related in display data or before disclosing data and being used for analysis To the sensitive information of individual.If being related to relevant issues, then data need to carry out privacy in advance to handle.

It is main that method is closed in the prior art including the following three aspects: first aspect secret protection model, in secret protection Field, many automated process have been suggested, wherein semantic anonymity model and difference privacy model are the most common privacies of two classes Protect model.Wherein k-anonymity (L.Sweeney. k-anonymity:A model for protecting privacy.International Journal of Uncertainty,Fuzziness and Knowledge-Based Systems,10(05):557–570, 2002.)、l-diversity(A.Machanavajjhala,D.Kifer, J.Gehrke,and M. Venkitasubramaniam.l-diversity:Privacy beyond k-anonymity.ACM Transactions on Knowledge Discovery from Data, 1 (1): 3,2007.) and t-closeness (N.Li,T.Li,and S.Venkatasubramanian.t-closeness:Privacy beyond k- anonymityandl-diversity.InProceedingsoftheIEEE23rdInternational Conference on Data Engineering, pp.106-115.IEEE, 2007.) it is three kinds of most classic semantic anonymity models.They respectively from The data item of equivalence class, in equivalence class in the quantity of the representative value of Sensitive Attributes and equivalence class the distribution of Sensitive Attributes and Measure whether current data set reaches secret protection standard simultaneously in terms of difference three of Sensitive Attributes distribution in entirety set Merge set with crossing, obscures the mode of equivalence class to provide secret protection.Difference privacy model is then by related data Attribute value adds suitable noise to protect the sensitive information in data.Regrettably, all there is some lack in these automated process It falls into, for example semantic anonymity model is difficult to handle high dimensional data, difference privacy methods can lose correlation of data etc..

Second aspect is the choice of privacy and practicability, and some semanteme anonymous methods are combined by merging collection away from reacting letter Cease the degree of loss.Based on above-mentioned standard, Loukides (J.Xu, W.Wang, J.Pei, X.Wang, B.Shi, and A.W.- C.Fu.Utility-based anonymization using local recoding.In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data Mining, pp.785-790,2006.) and Shao (Data utility and privacy protection trade-off in k-anonymisation.In Proceedings of the international workshop on Privacy And anonymity in information society, pp.36-45.ACM, 2008.) it is optimal poly- by dividing and determining Class solves to weigh: first listing the feasible selection of parameter setting and optimal requirement, passes through the choosing of the curve of privacy and effectiveness later Select combination appropriate.In addition, Rastogi et al. proposes α β anonymity algorithm (V.Rastogi, D.Suciu, and S. Hong.The boundary between privacy and utility in data publishing.In Proceedings of the 33rd international conference on Very large data bases,pp. 531-542.VLDB Endowment, 2007.), privacy and practicability are considered as bounded opponent by it.About difference privacy model, About information flow information theory frame (M.S.Alvim, M.E. Andr ' es, K.Chatzikokolakis, P.Degano, and C.Palamidessi.Differential privacy:on the trade-off between utility and information leakage.In Proceedings of the International Workshop on Formal Aspects in Security and Trust, pp.39-54.Springer, 2011.) it can be leaked with quantitative information and effectiveness. Ghosh et al. (A.Ghosh, T.Roughgarden, and M.Sundararajan.Universally utility- maximizing privacy mechanisms.SIAM Journal on Computing, 41(6):1673–1693, 2012.) also meet the constraint of difference privacy by adding random Laplacian noise, while minimizing information as much as possible and losing To construct geometry mechanism.

However, the practicability maintaining method that the above method is proposed both for model itself, there is no really from point The angle of analysis goes the feature for considering data comprehensively.

The third aspect is the visual research about privacy, this partial content is broadly divided into through theoretical research and phase relation System.In theoretical side, Van proposes a model (J.J.Van Wijk.The about data mapping to user's sensation influence value of visualization.In Proceedings of the Visualization.IEEE,pp.79–86, 2005.).Dasgupta and Kosara (A.Dasgupta and R.Kosara.Adaptive privacy-preserving visualization using parallel coordinates.Proceedings of the IEEE transactions 17 (12): on visualization and computer graphics 2241-2248,2011.) thinks that only vision is poly- Class and data clusters could be by increasing uncertain offer privacy reassurances, and data clustering method can be reduced with practicability Secret protection is realized for cost.Related system is mainly Chou et al. (J.-K.Chou, C. Bryan, and K.- L.Ma.Privacy preserving visualization for social network data with ontology information.2017.)(J.-K.Chou,Y.Wang,and K.-L.Ma. Privacy preserving event sequence data visualization using a sankey diagram-like representation.In Proceedings of the SIGGRAPH ASIA Symposium on Visualization, 2016.) what is proposed is directed to Track data and diagram data go private data processing system.

But above-mentioned existing method for secret protection does not provide a user enough practicability feedbacks.

Summary of the invention

The present invention provides a kind of multiattribute datas for considering practicability to go privacy methods, helps user to measure in real time practical Property loss, and can be setup flexibly and find privacy concern involved in data.

It is a kind of consider practicability multiattribute data go privacy methods, comprising the following steps:

Step 1: importing pretreated multiattribute data；

Step 2: according to attribute description defining required attributes and Sensitive Attributes, setting the pre- rule of classification of indispensable attributes, root According to the sequence of attributive character defining required attributes；Indispensable attributes refer to the data needed by handling and showing；Sensitive Attributes are The attribute of reference and privacy.The sequence of pre- rule of classification and indispensable attributes herein can carry out it is artificially defined, in order to more preferable Realize the present invention in ground, it is preferred that Sensitive Attributes are placed on rear layer, the attribute for being grouped less is placed on front layer.Doing so can be More information is observed in treatment process, and keeps treatment process more flexible, in general, in conjunction with the pre- of common sense setting indispensable attributes Rule of classification.

Step 3: building privacy exposure tree, respectively using the attribute in step 2 sequentially with pre- rule of classification as privacy The foundation that the hierarchic sequence of exposure tree and every level branch generate；Specific building mode is as follows, each node on behalf of tree One set, the set representated by i-th layer of node is only by the 1st, 2 ..., and attribute value corresponding to i layers is limited.Meanwhile For simplified topology, child node according to this layer of corresponding attribute value, is merged into cluster node by node by each layer.Two kinds of nodes It can be simultaneously displayed in view.

Step 4: risk measurement result information is encoded on the node of privacy exposure tree and side according to Sensitive Attributes；

Risk measurement result information includes: side coding risk increment and nodes encoding value-at-risk, according to three classical semantemes The thought of anonymity model (k-anonymity, l-diversity, t-closeness) exposes wind to measure the privacy of each set Danger.Three measurement methods specifically: the data item quantity in the corresponding set of k；The Sensitive Attributes value of data item in the corresponding set of l Quantity；T corresponds to the difference of Sensitive Attributes distribution and all property distributions.If there is Sensitive Attributes A₁,A₂,…,A_n, then successively produce Raw 1+2n risk measurement index, it is sorted are as follows: k, l (A₁),t(A₁),l(A₂),t(A₂),…,l(A_n),t(A_n), lead to respectively The transparency for crossing three kinds of different colours encodes them on node in ribbon form.Since child node quantity is more, space It is smaller, only use the maximum value in the transparency coding all risk measured value of grey.In addition to this, also with grey on side Transparency encodes risk increment between father node and child node.

Preferably, further include step 5: carrying out the attribute value based on semantic anonymous methods on privacy exposure tree and close And.When dragging node merging, (accurate) value of all data respective attributes is replaced in two set representated by two nodes For the same fuzzy ranges, this range includes all original attribute values in two set.Data after merging will be accurate Attribute value anonymization, to protect privacy.

Preferably, further include step 6: carrying out the spy to specific collection based on difference privacy on privacy exposure tree Determine attribute and adds different size of noise.To the Response Property values of data in selected set based on the noise of difference privacy.Noise The uncertainty of data is increased, so that the people of observation data be enabled to be difficult to determine actual property value.

Preferably, further comprising the steps of:

Step 7: two-dimensional matrix is unfolded according to indispensable attributes, each grid in the upper right corner shows initial data in two-dimensional matrix Corresponding Joint Distribution, diagonal line display indicate the statistical chart of respective attributes distribution, data after each grid display processing in the lower left corner Corresponding Joint Distribution.Statistical chart can use histogram, line chart etc..The two-dimensional matrix and privacy exposure tree join Dynamic, user can be distributed and be measured by the figure of two-dimensional matrix during carrying out data processing based on privacy exposure tree Change numerical value and in real time, comprehensively understands current practicability variation.

Preferably, in step 7, the type of the Joint Distribution of initial data includes: classification type-classification type original graph: being passed through The radius code homogeneous data item quantity of point；Classification type-numeric type original graph: being rectangle by point deformation, by rectangle along classification The length coding homogeneous data item quantity of axis direction；Numeric type-numeric type original graph: scatter plot.

Preferably, in step 7, the type of the Joint Distribution of data includes: based on semantic anonymous methods processing knot after processing Fruit: matrix diagram；Based on difference privacy methods processing result: scatter plot；Integrated treatment result: the combination chart of matrix diagram and scatter plot Table.

Preferably, further comprising the steps of:

Step 8: practicability index being calculated to each indispensable attributes in real time, and is shown and is updated in two-dimensional matrix.

The specific method is as follows: the attribute value before and after each data processing is set to f_D(x) and f_D’(x).If the category Property be Numeric Attributes, respectively to initial data and processing after data set in the attribute value carry out ascending sort, it is assumed that There are m data item, f_D(x) it is arranged in original data set i-th, f_D’(x) it arranges j-th, is then calculated in data after treatment:

u(f_D(x),f_D’(x))=1- | i-j |/(m-1)；

For the classification type data of not hierarchical information, if f_D(x)=f_D’(x), then u (f_D(x), fD ' (x)) it is 1, it is no It is then 0.

For there is the classification type data of hierarchical information, calculating:

u(f_D(x),f_D’(x))=level (f_D(x),f_D’(x))/H；

Wherein level (f_D(x),f_D’It (x)) is f_D(x) and f_D’(x) level of common ancestor, H are the level of whole tree.

Finally according to u (f_D(x),f_D’(x)) the practicability index of each attribute is calculated:

U(f_D(x),f_D’(x))=Σ u (f_D(x),f_D’(x))/n。

Practicability matrix is updated after each data processing, and shows the practicability index value of current each attribute and preceding Index value size caused by once-through operation changes.

There is provided more intuitive distribution comparative approach, it is preferred that further comprising the steps of:

Step 9: it is respectively that the Joint Distribution of data after initial data in step 7 and processing is unified to same granularity, if number According to original granularity it is inconsistent, then data are reduced into its original granularity and are uniformly distributed, further according to the granularity newly defined into Row data item quantity statistics；

Two charts of above-mentioned unified granularity are made the difference, the data item quantity of data after the processing in each grid is subtracted The data item quantity of initial data obtains a difference, and the size of the value to zero to negative from being just respectively mapped to centered on white A pair of of contrastive colours color gradient on.The depth of color and distribution clearly can convey two-dimentional Joint Distribution in data to user Specific change information.

Beneficial effects of the present invention:

The multiattribute data of consideration practicability of the invention goes privacy methods, can neatly hide from several common grammers Suitable method is selected to solve privacy concern in nameization and difference privacy model, to meet a variety of privacy need to different data It asks；Wherein privacy exposure tree utilizes the distinctive advantage of tree construction, realizes the spaces compact design to Multidimensional-collection；Together When, the design of the polymerizable function of branch and aggregation can help user efficiently to browse multidimensional data, the increment on tree construction side Coding can more assist quickly positioning privacy concern source；And practicability matrix provides the view that previous automatic algorithms can not provide Feel feedback, this safeguards practicability meaningful.

Detailed description of the invention

Fig. 1 is the process schematic that the method for the present invention constructs privacy exposure tree.

Fig. 2 is the schematic diagram of the privacy exposure tree of the method for the present invention construction.

Fig. 3 is the schematic diagram of the practicability matrix of the method for the present invention construction.

Fig. 4 is that the method for the present invention passes through the knot after semantic anonymous methods processing data by processing result compared with initial data Fruit.

Fig. 5 is that the method for the present invention passes through the knot after difference privacy methods processing data by processing result compared with initial data Fruit.

Fig. 6 is that the branch for having main problem is packed up to the privacy that can more intuitively observe other parts in the method for the present invention Schematic diagram when exposure.

Fig. 7 is that after merging Liang Ge branch in the method for the present invention, can compare and find out with Fig. 2, most of privacy exposure wind Danger is all solved.

Specific embodiment

In the present embodiment, the data set used is the open microdata sample collected when the generaI investigation of the Wyoming State of the U.S. in 2015 The partial data collection of data (PUMS).Information of many of this data set as unit of family.

The considerations of the present embodiment practicability multiattribute data go privacy methods the following steps are included:

Step 1: importing pretreated multiattribute data, following four attribute is extracted from data set: " insurance expenditure " (annual), " family income " (in past 1 year), " children " (personnel amounts of under-18s in family), and " old man " (family The personnel amount of over-65s in front yard)；Wherein family income is considered as needing Sensitive Attributes to be protected.

Step 2: according to attribute description defining required attributes and Sensitive Attributes, setting the pre- rule of classification of indispensable attributes, root According to the sequence of attributive character defining required attributes；In conjunction with the pre- rule of classification of common sense setting indispensable attributes, by classification type data Attribute value is directly as group basis；For Numeric Attributes, according to specific point of the definition of the quantile of attribute meaning and data Group rule；The attribute for being grouped few is placed on front layer for Sensitive Attributes are placed on rear layer by sequence.

Step 3: building privacy exposure tree, respectively using the attribute in step 2 sequentially with pre- rule of classification as privacy The foundation that the hierarchic sequence of exposure tree and every level branch generate；As illustrated in fig. 1 and 2, privacy exposure tree is constructed, Wherein side represents the inclusion relation between father and son's node, its shade encodes sorted risk increment；Child node refers to The subset that whole Attribute Associations of current layer or more are formed；Cluster node refers to the subset only classified by current attribute.Cause This, after the second layer, cluster node usually contains multiple child nodes.

By observing privacy exposure tree shown in Fig. 2, can quickly find: have more than a elderly family more It has been easy privacy exposure.

Step 4: risk measurement result information is encoded on the node of privacy exposure tree and side according to Sensitive Attributes；Such as Shown in Fig. 2, according to the thought of three classical semantic anonymity models (k-anonymity, l-diversity, t-closeness) To measure the privacy exposure of each set.Three measurement methods specifically:

Data item quantity in the corresponding set of k；

The quantity of the Sensitive Attributes value of data item in the corresponding set of l；

T corresponds to the difference of Sensitive Attributes distribution and all property distributions.

Existing Sensitive Attributes family income then generates 3 risk measurement indexes, it is sorted are as follows: k, l (family income), t (family income) is respectively encoded them on node by the transparency of three kinds of different colours in ribbon form.Due to son Number of nodes is more, and space is smaller, only uses the maximum value in the transparency coding all risk measured value of grey.Except this it Outside, risk increment between father node and child node also is encoded with the transparency of grey on side.

Step 5: carrying out the attribute value based on semantic anonymous methods on privacy exposure tree and merge.Packing up two most has When: there is the branch of an old man in the branch of problem and has the branch of two or more old men, as shown in fig. 6, the color of whole tree is all It takes off, that is to say, that most problems are related to the two attribute values.In order to solve this problem, selection is that will have privacy The branch of problem merges, and Fig. 7 shows the privacy exposure tree after merging.Find that two nodes are residual behind third layer later Some risks have been stayed, in first time attempts, have continued selection combining operation.However, being observed originally when by practicability matrix When correlation, the chart-information before finding largely is affected, as shown in Fig. 7.

Step 6: the particular community addition to specific collection based on difference privacy is carried out on privacy exposure tree not With the noise of size.

Step 7: two-dimensional matrix is unfolded according to indispensable attributes, each grid in the upper right corner shows initial data in two-dimensional matrix Corresponding Joint Distribution, diagonal line display indicate the statistical chart of respective attributes distribution, data after each grid display processing in the lower left corner Corresponding Joint Distribution.As shown in figure 3, two-dimensional matrix is practicability matrix, wherein diagonal line is the one-dimensional distribution of each attribute Figure；It is original graph above diagonal line, wherein children-old man is classification type-classification type original graph；Insuring expenditure-family income is Numeric type-numeric type original graph；Remaining is all numeric type-classification type original graph；For based on semantic anonymous methods below diagonal line Processing result figure, wherein two attributes of old man and children still show original graph because without processed.In the present embodiment Each dimension on pre- cluster be relatively rough.But we can still find out quickly, family income and insurance premium are just It is related.Compared with family's (most of family spends 500-900 first every year) of not child, there is (the wherein big portion, family of child Subfamily spends 900-1300 first every year) often it is ready to buy more insurances.

Step 8: practicability index being calculated to each indispensable attributes in real time, and is shown and is updated in two-dimensional matrix. As known to the top data of Fig. 3.

Step 9: respectively that the Joint Distribution of data after initial data in step 7 and processing is unified to same granularity.Factor According to original granularity it is inconsistent (treated, and data granularity is thicker), data are reduced into its original granularity and are uniformly distributed, Data item quantity statistics are carried out further according to the granularity newly defined；

Two charts of above-mentioned unified granularity are made the difference, the data item quantity of data after the processing in each grid is subtracted The data item quantity of initial data obtains a difference, and the size of the value to zero to negative from being just respectively mapped to centered on white A pair of of contrastive colours color gradient on.

Fig. 4 is the result observed using above scheme: the part that black surround outlines is it is desirable that the distribution retained is special Sign, but it can be seen from the figure that this part colours is very deep, have changed a lot.Therefore, we return backward, then Using difference privacy technology.In FIG. 5, it can be seen that the part colours specifically outlined are shallower, thus practicability loss by Control is in more acceptable level.

Claims

1. a kind of multiattribute data for considering practicability goes privacy methods, which comprises the following steps:

Step 1: importing pretreated multiattribute data；

Step 2: according to attribute description defining required attributes and Sensitive Attributes, the pre- rule of classification of indispensable attributes is set, according to category The sequence of property characterizing definition indispensable attributes；

Step 3: building privacy exposure tree respectively exposes the attribute sequence in step 2 with pre- rule of classification as privacy The foundation that the hierarchic sequence of risk tree and every level branch generate；

Step 4: risk measurement result information is encoded on the node of privacy exposure tree and side according to Sensitive Attributes.

2. considering that the multiattribute data of practicability goes privacy methods as described in claim 1, which is characterized in that further include step 5: carrying out the attribute value based on semantic anonymous methods on privacy exposure tree and merge.

3. considering that the multiattribute data of practicability goes privacy methods as described in claim 1, which is characterized in that further include step 6: carrying out the particular community to specific collection based on difference privacy on privacy exposure tree and add different size of noise.

4. considering that the multiattribute data of practicability goes privacy methods as described in claim 1, which is characterized in that further include following Step:

Step 7: two-dimensional matrix is unfolded according to indispensable attributes, each grid in the upper right corner shows the corresponding of initial data in two-dimensional matrix Joint Distribution, diagonal line display indicate the statistical chart of respective attributes distribution, the phase of data after each grid display processing in the lower left corner Answer Joint Distribution.

5. considering that the multiattribute data of practicability goes privacy methods as claimed in claim 4, which is characterized in that former in step 7 The type of the Joint Distribution of beginning data includes: classification type-classification type original graph: passing through the radius code homogeneous data item number of point Amount；Classification type-numeric type original graph: being rectangle by point deformation, by rectangle along the length coding homogeneous data of classification axis direction Item quantity；Numeric type-numeric type original graph: scatter plot.

6. considering that the multiattribute data of practicability goes privacy methods as claimed in claim 4, which is characterized in that in step 7, place The type of the Joint Distribution of data includes: based on semantic anonymous methods processing result after reason: matrix diagram；Based on difference privacy methods Processing result: scatter plot；Integrated treatment result: the mixing chart of matrix diagram and scatter plot.

7. considering that the multiattribute data of practicability goes privacy methods as claimed in claim 4, which is characterized in that further include following Step:

8. considering that the multiattribute data of practicability goes privacy methods as claimed in claim 7, which is characterized in that further include following Step:

Step 9: the Joint Distribution of data after initial data in step 7 and processing is unified to same granularity respectively, if data Original granularity is inconsistent, then is reduced into data in its original granularity and is uniformly distributed, counted further according to the granularity newly defined According to item quantity statistics；

Two charts of above-mentioned same granularity are made the difference, the data item quantity of data after the processing in each grid are subtracted original The data item quantity of data obtains a difference, and the size of the value from positive to negative, is mapped on the graduated colors between contrastive colours, The gradual change is using white as gradual change center, and corresponding zero.