CN107832631A

CN107832631A - The method for secret protection and system of a kind of data publication

Info

Publication number: CN107832631A
Application number: CN201711115389.4A
Authority: CN
Inventors: 唐雪琴
Original assignee: Shanghai Feixun Data Communication Technology Co Ltd
Current assignee: Taizhou Jiji Intellectual Property Operation Co.,Ltd.
Priority date: 2017-11-13
Filing date: 2017-11-13
Publication date: 2018-03-23

Abstract

The invention discloses a kind of method for secret protection of data publication, the described method comprises the following steps：S10：Data are received, diversity judgement is carried out to the species of the Sensitive Attributes value of the data；Ensure that follow-up equivalence class partition has identical diversity；S20：According to diversity judged result, data equivalence class partition is carried out；S30：Result after equivalence class partition is subjected to data segmentation.The present invention realizes simple and convenient, and the data after being handled using the present invention have higher secret protection degree, relatively low information loss degree and preferable availability, practical, can resist a variety of privacies pry attacks.

Description

The method for secret protection and system of a kind of data publication

Technical field

The present invention relates to the method for secret protection and system in information safety protection field, more particularly to a kind of data publication.

Background technology

With the high speed development of internet, dependence of the people to network is also progressively deepened, and data message amount rapidly increases, when While network provides convenient to people, such as shopping online, transfer accounts, order air ticket and need not walk out door, can be square on network Just quickly realize, there is also substantial amounts of information leakage risk, such as individual privacy information, medical data, account number cipher, bank card Information, trade secret information etc. are easily utilized after spreading through the internet, and cause identity leakage, damage to property etc., Serious even can life threatening health.As can be seen here, the importance of information safety protection.After " prism door " event, respectively Also all in the security protection of Strengthens network, this brings new opportunities and challenges to data safety and secret protection for state.

For ensure data privacy, carry out data publication and it is shared while, it is necessary to data carry out secret protection Processing.At present, issue tables of data is generally divided into three generic attributes：(1) individual marking attribute (Individually Identifier Attribute, ID), individual identity attribute can be identified；(2) quasi- identity property (Quasi-dientifier, QI), deposits simultaneously In privacy table and appearance, mark can be utilized link to and deduce individual information, Sensitive Attributes (Sensitive Attribute, SA), the packet of record is not intended to by the privacy information known to other people containing user.

Privacy leakage can not be prevented by only deleting QI attributes or ID attributes when being issued to above-mentioned three generic attribute, when issue Data and other data be attached and may result in identity information and Sensitive Attributes are revealed；The it is proposeds such as Sweeney in 2002 K- anonymity secret protection models, can effectively prevent connection from attacking, but k- is anonymous without constraint Sensitive Attributes value, enter without Background knowledge attack and homogeneity sexual assault can be prevented；Effectively to solve the above problems, l- diversity (l-diversity), (α, k) Anonymous, t approaches (t closeness) etc. and is suggested successively, and these secret protection models mainly use to the processing procedure of data Generalization, extensive realization, this processing method maintain original semantic information substantially, but can cause information loss and reduce data Effectiveness.

In recent years, clustering algorithm is largely used in data mining, and the data publication of secret protection requires the number announced Identical standard identifier is arrived according to each cluster generalization of concentration, this is quite similar with the cluster process in data mining, then just has Using clustering method realize the multifarious researchs of l-.

Patent document if notification number is CN104317904A discloses " a kind of extensive method of Weight community network ", Including：Descending sort is carried out according to node degree to node and is grouped；The weight on extensive existing side, and calculate side and exist generally Rate；All node Sensitive Attributes formation Sensitive Attributes bags are extracted after traveling through all anonymous group collection；Sensitive Attributes between calculate node The maximum comparability of bag, according to extensive tree, obtain the extensive bag of Sensitive Attributes bag；The anonymous group collection of K- weights is traveled through, is finally given Meet that K-Weighted-inv-l-diversityanonymous schemes.The invention considers the weight on side, and how quick considers The problem of feeling attribute so that method for secret protection is more applicable for actual community network, can more completely protect Weight Multi-sensitive attributes in figure.

And for example Publication No. CN106874788A patent document discloses a kind of " secret protection in sensitive data issue Method, including：Receive the data set from user and corresponding multiple generalization input trees, each group of number that ergodic data is concentrated According to, and judge that each column data in this group of data whether there is corresponding generalization input tree successively, if it is present according to this The property value of data node corresponding to lookup in corresponding generalization input tree, and the information of the node is input to coordinate array In, so as to obtain m row coordinate arrays, and it is every if it does not exist, then directly by the property value input coordinate array of the data The flag bit that individual coordinate array addition initial value is 0, establishes p cluster, wherein p rows coordinate is randomly choosed from m row coordinate arrays Central point of the array respectively as p cluster of foundation.The method generally changed again by first clustering, improves computational efficiency, is data Privacy issue provide guarantee.

The content of the invention

The technical problem to be solved in the present invention is to be directed to above-mentioned the deficiencies in the prior art, there is provided one kind realize it is simple and convenient, The method for secret protection of data publication with higher secret protection degree, relatively low information loss degree and preferable availability and System.

To achieve these goals, the technical solution adopted by the present invention is：

A kind of method for secret protection of data publication, the described method comprises the following steps：

S10：Data are received, diversity judgement is carried out to the species of the Sensitive Attributes value of the data；

S20：According to diversity judged result, data equivalence class partition is carried out；

S30：Result after equivalence class partition is subjected to data segmentation.

Further, the Sensitive Attributes value of data described in the step S10 species carry out diversity judge be specially： The species of the Sensitive Attributes value of the data is compared with diversity parameters L；

Comprise the following steps in the step S20：

S21：If the species of Sensitive Attributes value is more than or equal to diversity parameters L, selection performs single equivalence class partition；

S22：If the species of Sensitive Attributes value is less than diversity parameters L, selection performs candidate's equivalence class partition.

Further, advance line number Data preprocess is judged in progress diversity after reception data in the step S10, including Following steps：

S11：Each standard identifier property value is standardized, the standard identifier property value is mapped to [0,1] scope；

S12：Calculate the weight of standard identifier attribute；

S13：The integrated value of every record is calculated, the calculation formula of the integrated value is as follows：

Wherein：W_iRepresent integrated value, w_jRepresent weight, x_ijThe standard identifier property value after standardization is represented, n indicates n Individual standard identifier attribute, η indicate that η bars record.

Further, the data prediction also includes：

S14：Every record in data is ranked up according to integrated value size.

Further, the equivalence class partition that data are performed described in the step S20 comprises the following steps：

S201：The record of predetermined number is selected to be divided into same equivalence class successively according to integrated value size；

S202：Determine whether that remaining record is not carried out step S201；

S203：If there is remaining record to be not carried out step S201, candidate's equivalence class partition is performed.

Further, execution candidate's equivalence class partition is specially：

Judge that data whether there is candidate's equivalence class, if in the presence of data being referred into best candidate equivalence class, if not depositing Data are then being referred to optimal equivalence class.

Further, it is further comprising the steps of after the step S30：

S40：Issued after data after segmentation are attached by predicable.

A kind of intimacy protection system of data publication, the system include：

Judge module, for receiving data, diversity judgement is carried out to the species of the Sensitive Attributes value of the data；

Equivalence class partition module, for according to diversity judged result, carrying out data equivalence class partition；

Split module, for the result after equivalence class partition to be carried out into data segmentation.

Further, the system also includes：

Release module, for being issued after the data after segmentation are attached by predicable；

The judge module includes：

Computing unit, for calculating the integrated value of every record；

Sequencing unit, for being ranked up to every record in data according to integrated value size；

Comparing unit, for the species of the Sensitive Attributes value of the data to be compared with diversity parameters L.

Further, the equivalence class partition module includes：

First division unit, for performing single equivalence class partition；

Second division unit, for performing candidate's equivalence class partition；

First division unit includes：

First judgment sub-unit, for determining whether that remaining record is not carried out selecting to preset successively according to integrated value size The record of quantity is divided into same equivalence class；

Second division unit includes：

Second judgment sub-unit, for judging that data whether there is candidate's equivalence class；

Sorting out subelement, if for candidate's equivalence class be present, data being referred to best candidate equivalence class, if being not present Candidate's equivalence class, then data are referred to optimal equivalence class.

After adopting the above technical scheme, the beneficial effects of the invention are as follows：

Equivalence class partition is carried out by the way that data are based on into diversity, does not destroy the original letter of data during equivalence class partition Breath, similar element set is combined together, advantageously allowing data has preferable availability, more standardizes；

By the separated contact for reducing data of the data after equivalence class partition, Sensitive Attributes, standard identifier category are not changed The initial data of property, it is smaller to the degree of loss of data, while be advantageous to secret protection after data publication so that data are safer；

By the way that equivalence class partition is divided into the single equivalence class partition dividing mode different with candidate's equivalence class, can more there is pin To property and more complete processing data, the secret protection of each record is ensured；

By calculating in data the integrated value of every record, and according to integrated value size by the record with optimal similarity Assemble an equivalence class so that data processing simply easily realizes that data assignment is more orderly, and practicality is more preferable, is advantageous to itself Excavation and integration to data so that the availability of data greatly improves, and whole process information loss degree is low；

Issued after data after segmentation are attached by predicable, the predicable can drawing according to equivalence class Divide result Custom Attributes, it is general that real information is disturbed using increase redundant information, the purpose of data-privacy protection is reached, is made Data degree of loss it is low and secret protection effect is good.

Brief description of the drawings

In order to illustrate the embodiments of the present invention more clearly or prior art technical scheme, accompanying drawing is as follows：

Fig. 1 is a kind of method for secret protection flow chart for data publication that the embodiment of the present invention 1 provides；

Fig. 2 is a kind of method for secret protection flow chart for data publication that the embodiment of the present invention 2 provides；

Fig. 3 is a kind of method for secret protection flow chart for data publication that the embodiment of the present invention 3 provides；

Fig. 4 is a kind of intimacy protection system block diagram for data publication that the embodiment of the present invention 4 provides.

Embodiment

It is the specific embodiment of the present invention and with reference to accompanying drawing below, technical scheme is further described, But the present invention is not limited to these embodiments.

The data publication of secret protection mainly handles standard identifier attribute, either using generalization/hiding method still Clustering method, all it is in alignment with identifier attribute and is handled.

Of the invention mainly use clusters and split two technologies, realizes the secret protection of issue data, particularly static number According to secret protection.

Clustering technique is to be combined together similar element set, the similitude of analyze data, and different scenes demand is different, is gathered The algorithm of class is also not quite similar.The design of Clustering Model is to realize a step more crucial in the present invention, different Clustering Models Have different clustering algorithms.

The it is proposed of cutting techniques, new approaches and method are provided for the secret protection of data publication, it does not change Sensitive Attributes, The raw value of quasi- identity property, reach the purpose of secret protection by reducing contacting for Sensitive Attributes and quasi- identity property, Usually, the standard identifier attribute (QI) in data set and Sensitive Attributes (SA) are divided into two using cutting techniques to be mutually not attached to Two datasets after issued, multiple data sets can also be divided into according to the related law of data set.

Present invention more detailed description is as follows：

Embodiment 1

As shown in figure 1, the present embodiment provides a kind of method for secret protection of data publication, methods described includes following step Suddenly：

In this step, the data for having secret protection demand, the individual of a generally tables of data, first understanding tables of data are received Identity property, quasi- identity property, Sensitive Attributes, the sensitive data in initial data is determined, that is, checks the Sensitive Attributes value of data, Counting the species of different Sensitive Attributes values in the Sensitive Attributes column of tables of data has several, such as the species of Sensitive Attributes value is S_c, based on S_cSize carry out tables of data different demarcation, typically by set a diversity parameters l, judge S_cWith l pass System, will meet that the multifarious records of l- are sorted out, two diversity parameters l of the setting also having and l ', l and l ' value difference, will be quick Sense attribute is divided into two classes (such as main Sensitive Attributes, auxiliary Sensitive Attributes) and carries out the more targeted division of data afterwards.

This step is the cluster process of data, and in this step, the diversity of the species of Sensitive Attributes value is different, performs difference Equivalence class partition mode.If for example, judge the species S of Sensitive Attributes value_c>=l, then single equivalence class partition is performed, by data Record in table per l-1 quantity is classified as same equivalence class, and has higher similarity between these records, if single etc. There is remaining record after the division of valency class terminates, in raw data table without division classification, then perform candidate's equivalence class partition, check The equivalence class for whether having candidate can be classified as one kind with it, and candidate's DEFINED BY EQUIVALENT CLASS is with higher similarity and with identical In the classification of Sensitive Attributes value, if judging the species S of Sensitive Attributes value_c＜ l, then directly perform candidate's equivalence class partition.

In this step, tables of data that step S20 clusterings have been sorted out, according to the dividing condition of multiple equivalence classes, point Multiple single tables of data are cut into, point of tables of data can be carried out with individual marking attribute, quasi- identity property, Sensitive Attributes classification Cut, lower the contact between data, so as to reach the purpose of secret protection.

The data table information distortion factor after the step of performing the present embodiment is low, and after issue, the probability of privacy leakage is low, data The availability in later stage is good.

Embodiment 2

As shown in Fig. 2 the present embodiment and the difference of embodiment before are, the present embodiment provides a kind of with specific cluster The method for secret protection of the data publication of algorithm, the species progress of the Sensitive Attributes value of data described in the step S10 are various Property judge be specially：For the species of the Sensitive Attributes value of the data compared with diversity parameters L, L values are number set in advance Value, generally, such as Sensitive Attributes value species be 7 kinds, then L values be set as 7 divided by 2 round after value 3, i.e. L=3；

Comprise the following steps in the step S20：

S21：If the species of Sensitive Attributes value is more than or equal to diversity parameters L, selection performs single equivalence class partition；This During single equivalence class is divided in step, whether cycle criterion equivalence class length is less than L, when more than L, performs next time Equivalence class partition process, ensure that each equivalence class meets L diversity.

Embodiment 3

As shown in figure 3, the present embodiment and the difference of embodiment 1 are, it is more in progress after reception data in the step S10 Sample judges advance line number Data preprocess, and data prediction is to consider that attribute different in data issuing process has different power Weight, exploitation right weight cluster data is realized simply, can more be embodied the similarity of data, be specifically included following steps：

S11：Each standard identifier property value is standardized, the standard identifier property value is mapped to [0,1] scope；Logarithm The property value of value type, [0,1] scope, new standard identifier property value=(former fiducial mark are mapped to using extreme difference standardized calculation formula Know symbol property value-minimum)/(maximum-minimum)；

Specifically, it is assumed that raw data table T=＜ ID, QI₁,QI₂,...,QI_n, SA ＞, there is n standard identifier attribute, η record, i.e., the object of η cluster, each clustering object have n key element；

Assuming that attribute codomain is [x_min,x_max], the calculation formula standardized using extreme difference is mapped to [0,1] scope, specifically Formula is as follows：

Wherein, x_ijFor the standard identifier property value after standardization, x hereafter_ijAlso this implication, x are represented_i'_jRepresent i-th Numerical value corresponding to j-th of quasi- identity property (QI) of individual record, i.e., former standard identifier property value；

To categorical attribute, then property value is first mapped to natural sequence, such as sex, has man, two kinds of female, be mapped as 1 He 2, then reuse extreme difference standardized calculation formula and be mapped to [0,1] scope, it is necessary to which explanation is the category when classification only has two kinds Property value codomain need be expanded to [0,3] scope, such calculation error is relatively small.

S12：Calculate the weight of standard identifier attribute；Each standard identifier attribute presses column distribution in tables of data, per dependent of dead military hero Property variance can reflect the tight ness rating of attribute value, when formed objects occur for smaller variance attribute and the attribute of greater variance During change, the attribute information loss amount of greater variance is smaller, i.e. the attribute of greater variance occupies bigger weight, so this step Middle to calculate weight using variance calculation formula, the calculation formula is as follows：

Wherein, V_j(1≤j≤n) be per Column Properties variance, avg_j(1≤j≤n) be per Column Properties average value, w_j(1 ≤ j≤n) be per Column Properties weight.

Further, the data prediction also includes：

S14：To sequence of the every record according to the progress of integrated value size from small to large or from big to small in data.

Assuming that raw data table is table 1：

Table 1

Name	Age	Race	Sex	Disease
					Alicy	21	Black	F	Flu
Lucy	45	White	F	HIV
					Tom	36	Black	M	Gastritis
Helen	18	White	M	Obesity
					David	56	White	M	Cancer
Bob	21	Black	M	Dyspepsia
					Linda	43	Black	F	Gastritis

Wherein { Name } is individual marking attribute, and attribute set { Age, Race, Sex } is standard identifier attribute, { Disease } is Sensitive Attributes.

The new data table such as table 2 after the present embodiment data prediction：

Table 2

Name	Age	Race	Sex	Disease	Integrated value
						Helen	0.086956522	0.333	0.667	Obesity	0.277806565
Alicy	0.152173913	0.667	0.333	Flu	0.312889739
						Bob	0.152173913	0.667	0.667	Dyspepsia	0.390053425
Lucy	0.673913043	0.333	0.333	HIV	0.516391444
						Tom	0.47826087	0.667	0.667	Gastritis	0.565469295
Linda	0.630434783	0.667	0.333	Gastritis	0.570166348
						David	0.913043478	0.333	0.667	Cancer	0.722193435

S201：The record of predetermined number is selected to be divided into same equivalence class successively according to integrated value size；In this step according to Equivalence class is divided according to the similarity of integrated value, can obtain the maximum similarity cluster data between different records.

S202：Determine whether that remaining record is not carried out step S201；The integrality of data equivalence class partition can be ensured.

It is described execution candidate's equivalence class partition be specially：

Judge that data whether there is candidate's equivalence class, candidate's DEFINED BY EQUIVALENT CLASS is integrated value difference minimum and had identical quick Feel the equivalence class of property value；

If in the presence of data are referred into best candidate equivalence class, if being not present, data are referred into optimal equivalence Class, optimal DEFINED BY EQUIVALENT CLASS are the minimum equivalence class of integrated value difference.

Data are as shown in table 3 after above-mentioned table 2 performs the present embodiment equivalence class partition：

Table 3

Wherein GroupID numbers for equivalence class, and numbering identical is same equivalence class.

After equivalence class partition terminates in the present embodiment, Sensitive Attributes and standard identifier attribute are separated using cutting techniques, The data of table 3 are divided into as shown in 5 two tables of table 4 and table：

Table 4

Age	Race	Sex	GroupID
				18	White	M	1
21	Black	F	1
				21	Black	M	2
45	White	F	2
				36	Black	M	3
43	Black	F	3
				56	White	M	3

Table 5

GroupID	Disease
		1	Obesity
1	Flu
		2	Dyspepsia
2	HIV
		3	Gastritis
3	Gastritis
		3	Cancer

It is further comprising the steps of after the step S30：

S40：Issued after data after segmentation are attached by predicable.

In this step, the data after segmentation are attached using cartesian product, privacy is reached with the information for producing unnecessary The purpose of protection, and will not destroy and reduce the degree of loss of information.

Tables of data after being connected using cartesian product is as shown in table 6：

Table 6

The method for secret protection for a kind of data publication that the present embodiment provides so that data obtain more orderly, more complete Arrangement, practicality more preferably, can resisting various attacks, the privacy for issuing data obtained further protection.

Embodiment 4

As shown in figure 4, the present embodiment provides a kind of intimacy protection system of data publication, the system is used for above-mentioned implementation A kind of realization of 1 and 2 method for secret protection of data publication of example, the system include：

Judge module 100, for receiving data, diversity judgement is carried out to the species of the Sensitive Attributes value of the data；

Equivalence class partition module 200, for according to diversity judged result, carrying out data equivalence class partition；

Split module 300, for the result after equivalence class partition to be carried out into data segmentation.

The system also includes：

Release module 400, for being issued after the data after segmentation are attached by predicable；

The judge module 100 includes：

Computing unit 110, for calculating the integrated value of every record；

Sequencing unit 120, for being ranked up to every record in data according to integrated value size；

Comparing unit 130, for the species of the Sensitive Attributes value of the data to be compared with diversity parameters L.

The equivalence class partition module 200 includes：

First division unit 210, for performing single equivalence class partition；

Second division unit 220, for performing candidate's equivalence class partition；

First division unit 210 includes：

First judgment sub-unit 211, for determining whether that remaining record is not carried out selecting successively according to integrated value size The record of predetermined number is divided into same equivalence class；

Second division unit 220 includes：

Second judgment sub-unit 221, for judging that data whether there is candidate's equivalence class；

Sorting out subelement 222, if for candidate's equivalence class be present, data being referred to best candidate equivalence class, if not Candidate's equivalence class be present, then data are referred to optimal equivalence class.

The present embodiment provides a kind of intimacy protection system of data publication, using the more preferable clustering algorithm of practicality, will have There are the data clusters of optimal similarity, the integrality of data is higher, and the availability of data is higher, also general by the way of input is set Change data, the degree of loss of whole process data reduces, and connection increase redundant information disturbs data to data again after reasonable segmentation, So that the secret protection degree of data greatly improves.

Specific embodiment described herein is only to spirit explanation for example of the invention.Technology belonging to the present invention is led The technical staff in domain can be made various modifications or supplement to described specific embodiment or be replaced using similar mode Generation, but without departing from the spiritual of the present invention or surmount scope defined in appended claims.

Claims

1. a kind of method for secret protection of data publication, it is characterised in that the described method comprises the following steps：

2. the method for secret protection of a kind of data publication according to claim 1, it is characterised in that in the step S10 The species of the Sensitive Attributes value of the data carries out diversity judgement：The species of the Sensitive Attributes value of the data with it is more Sample parameter L compares；

Comprise the following steps in the step S20：

3. the method for secret protection of a kind of data publication according to claim 1, it is characterised in that in the step S10 Judge advance line number Data preprocess in progress diversity after reception data, comprise the following steps：

S12：Calculate the weight of standard identifier attribute；

<mrow> <msub> <mi>W</mi> <mi>i</mi> </msub> <mo>=</mo> <munderover> <mo>&Sigma;</mo> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>n</mi> </munderover> <msub> <mi>w</mi> <mi>j</mi> </msub> <mo>&times;</mo> <msub> <mi>x</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> </msub> <mo>,</mo> <mn>1</mn> <mo>&le;</mo> <mi>i</mi> <mo>&le;</mo> <mi>&eta;</mi> </mrow>

Wherein：W_iRepresent integrated value, w_jRepresent weight, x_ijThe standard identifier property value after standardization is represented, n indicates n fiducial mark Know symbol attribute, η indicates that η bars record.

A kind of 4. method for secret protection of data publication according to claim 3, it is characterised in that the data prediction Also include：

S14：Every record in data is ranked up according to integrated value size.

5. the method for secret protection of a kind of data publication according to claim 4, it is characterised in that in the step S20 The equivalence class partition for performing data comprises the following steps：

S202：Determine whether that remaining record is not carried out step S201；

6. the method for secret protection of a kind of data publication according to claim 2 or 5, it is characterised in that described to perform time The equivalence class partition is selected to be specially：

Judge that data whether there is candidate's equivalence class, if in the presence of, data are referred to best candidate equivalence class, if being not present, Data are then referred to optimal equivalence class.

A kind of 7. method for secret protection of data publication according to claim 1, it is characterised in that the step S30 it It is further comprising the steps of afterwards：

S40：Issued after data after segmentation are attached by predicable.

8. a kind of intimacy protection system of data publication, it is characterised in that the system includes：

9. the intimacy protection system of a kind of data publication according to claim 8, it is characterised in that the system is also wrapped Include：

The judge module includes：

Computing unit, for calculating the integrated value of every record；

10. the intimacy protection system of a kind of data publication according to claim 8, it is characterised in that the equivalence class is drawn Sub-module includes：

First division unit, for performing single equivalence class partition；

Second division unit, for performing candidate's equivalence class partition；

First division unit includes：

First judgment sub-unit, for determining whether that remaining record is not carried out selecting predetermined number successively according to integrated value size Record be divided into same equivalence class；

Second division unit includes：

Sorting out subelement, if for candidate's equivalence class be present, data being referred to best candidate equivalence class, if candidate is not present Equivalence class, then data are referred to optimal equivalence class.