CN109446844A

CN109446844A - A kind of method for secret protection and system towards big data publication

Info

Publication number: CN109446844A
Application number: CN201811356234.4A
Authority: CN
Inventors: 徐雅斌
Original assignee: Beijing Information Science and Technology University
Current assignee: Beijing Jiuweiwei'an Technology Co ltd
Priority date: 2018-11-15
Filing date: 2018-11-15
Publication date: 2019-03-08
Anticipated expiration: 2038-11-15
Also published as: CN109446844B

Abstract

The invention discloses a kind of method for secret protection and system towards big data publication, this method retrieves data according to user data range of needs first；Then user security grade is determined according to user identity and data use, and thereby determines that corresponding anonymization scheme and initial Privacy parameter；The quality of data of secret protection requirement and user side further according to data providing requires to carry out secret protection effect assessment.The adjusting parameter if being unsatisfactory for requiring；Project setting is carried out if parameter adjustment is invalid.Every time after the completion of adjustment, secret protection effect assessment need to be re-started.Evaluation then carries out secret protection processing to the data to be released retrieved according to selected de-identification method and parameter after passing through, and is formed and final issues data.Most suitable de-identification method and privacy parameters can be chosen using the present invention, data not only can achieve secret protection effect desired by data set provider so that treated, but also can satisfy requirement of the user side to availability of data.

Description

A kind of method for secret protection and system towards big data publication

Technical field

The present invention relates to field of information security technology, in particular to it is a kind of towards big data publication method for secret protection and System.

Background technique

In data trade and data sharing field, how under the premise of not revealing privacy to need data person to provide number According to becoming a stubborn problem.In order to solve this problem, industry proposes many solution data publication secret protection skills Art.The difference of common secret protection technical basis implementation method is roughly divided into following a few classes: data transfer device, data anonymous Change method, multi-party computations method and mixed method.Wherein, de-identification method because its significant safety and validity and It has obtained more being widely applied.

Foremost algorithm is k-anonymity algorithm in de-identification method.1998, Sweeney et al. was proposed first K-anonymity algorithm, the algorithm can effectively prevent link from attacking, i.e., attacker is the case where grasping common data Under, by being matched with certain attributes in announced record, to know privacy information.K-anonymity algorithm is wanted It asks and the record in data set is divided into several equivalence classes, in each equivalence class, the attribute that may release privacy information all has There is identical value, and at least k item records in each equivalence class, the probability of link attack so is just not more than 1/k. Machanavajjhala et al. proposed l-diversity algorithm in 2006, and l-diversity algorithm requires each equivalence At least contain l Sensitive Attributes value in class.Subsequent Li Ning hui proposes t-closeness algorithm, it requires quick in grouping The distributional difference in the distribution and tables of data of attribute value is felt no more than threshold value t.

Although the above method realizes data publication secret protection to a certain extent, also suffer from certain drawbacks. If k-anonymity algorithm does not do any constraint to Sensitive Attributes data, when one grouping in it is all record possess it is same When Sensitive Attributes value, sensitive information can be uniquely determined, therefore attacker can obtain privacy information, i.e. k- easily Anonymity algorithm loses caused by data and changes smaller, but secret protection degree is lower.L-diversity algorithm is protected Sensitive Attributes value different in each equivalence class is demonstrate,proved more than or equal to l, but when some great feelings of class value accounting in grouping Under condition, be inferred to the value be sensitive information a possibility that it is very big, will lead to privacy leakage, the i.e. privacy of l-diversity algorithm Degree of protection is higher than k-anonymity algorithm, but the information loss caused by data is also above k-anonymity algorithm.t- Closeness algorithm requires the distribution in grouping in the distribution and tables of data of Sensitive Attributes value approximate, solves l-diversity The problems of algorithm, but since t-closeness secret protection requires stringenter, the quality of data and other two methods Compared to being more difficult to reach the requirement of user, limitation is larger, therefore the data for meeting t-closeness constraint are difficult to use in data It excavates, the application such as data analysis, but because it is most strong to the degree of protection of privacy, is suitably applied the higher data publication field of risk Scape.

It can be seen that the advantages of above-mentioned three kinds of de-identification methods have oneself and limitation, every kind of de-identification method it is hidden Private parameter also affects data-privacy protecting effect and the quality of data.And in the actual environment, user is according to using data purpose Difference, can data be proposed with different secret protection requirements, and the sensitivity of data type is also not quite similar, only lean on one Kind de-identification method is difficult to meet the secret protection demand of the data of a variety of data uses.Therefore, for different secret protections How demand scientific and rational chooses most appropriate method, and the privacy by being automatically found optimal parameter to guarantee data There is no actual applicable achievements for the research of protecting effect.

Summary of the invention

The object of the present invention is to provide a kind of method for secret protection and system towards big data publication, can choose and most close Suitable de-identification method and privacy parameters, so that treated, data both can achieve secret protection desired by data set provider Effect, and can satisfy the availability requirement of data consumer.

To achieve the above object, the present invention provides following schemes:

A kind of method for secret protection towards big data publication, the method for secret protection include:

Step 101: obtaining identity information, data requirements range and the data description of use of user；

Step 102: user security grade being determined according to the identity information of the user and data description of use, according to described The data requirements range retrieval user requested data of user；

Step 103: according to the user security grade and security level and the anonymization processing scheme table of comparisons, determination is hidden Nameization scheme and the corresponding initial privacy parameters value of the anonymization scheme；The anonymization scheme includes directly providing inspection Rope data processing scheme, k-anonymity processing scheme, l-diversity processing scheme and the processing side t-closeness Case；The corresponding privacy parameters value of the k-anonymity processing scheme is k value；The l-diversity processing scheme is corresponding Privacy parameters value is l value；The corresponding privacy parameters value of the t-closeness processing scheme is t value；

Step 104: according to the corresponding initial privacy parameters value of the anonymization scheme and the anonymization scheme Determine privacy leakage probability and data quality value；

Step 105: judging whether the privacy leakage probability is less than maximum privacy leakage threshold value and the data quality value Whether it is greater than quality of data threshold value, obtains the first judging result；Wherein, the maximum privacy leakage threshold value is mentioned by data providing For the quality of data threshold value is provided by user side；

Step 106: if first judging result indicates that the privacy leakage probability is less than the maximum privacy leakage threshold It is worth and the data quality value is greater than the quality of data threshold value, then the maximum privacy is less than using the privacy leakage probability It reveals threshold value and the data quality value is greater than anonymization scheme and anonymization scheme pair corresponding to the quality of data threshold value The privacy parameters value answered carries out secret protection processing to the user requested data that retrieves, obtains sending out cloth data, and to Cloth data are sent out described in user's publication；

Step 107: if first judging result indicates that the privacy leakage probability is more than or equal to the maximum privacy and lets out Reveal threshold value or the data quality value is less than or equal to the quality of data threshold value, then judges that the anonymization scheme is corresponding hidden Whether private parameter value can be adjusted, and obtain the second judging result；

Step 108: if second judging result indicates the anonymization scheme, corresponding privacy parameters value can be carried out Adjustment, then adjust the corresponding privacy parameters value of the anonymization scheme, and return step 104, according to the anonymization scheme with And the anonymization scheme privacy parameters value adjusted redefines privacy leakage probability and data quality value；

Step 109: if the corresponding privacy parameters value of the anonymization scheme cannot be adjusted, judging the anonymity Whether change scheme can be adjusted, and obtain third judging result；

Step 110: if the third judging result indicates that the anonymization scheme can be adjusted, adjusting anonymization Scheme, and return step 104, according to anonymization scheme adjusted and the corresponding privacy parameters of anonymization scheme adjusted Value redefines privacy leakage probability and data quality value；

Step 111: if the third judging result indicates that the anonymization scheme cannot be adjusted, described in reduction Quality of data threshold value, and return step 104, the Rule of judgment until meeting step 105 stop.

Optionally, the data requirements range retrieval user requested data according to the user, specifically includes:

Using SQL query statement, the data for meeting user data range of needs are extracted, and from source database with tables of data Form storage.

Optionally, the security level and the anonymization processing scheme table of comparisons include four kinds of corresponding relationships, are respectively as follows: when peace Anonymization scheme is directly to provide retrieval data processing scheme when congruent grade is 0, and when security level is 1, anonymization scheme is k- Anonymity processing scheme, when security level is 2, anonymization scheme is l-diversity processing scheme, when security level is Anonymization scheme is t-closeness processing scheme when 3.

Optionally, described before judging whether the corresponding privacy parameters value of the anonymization scheme can be adjusted Big data issues method for secret protection further include:

Determine the value range of the corresponding privacy parameters value of the anonymization scheme.

Optionally, the value range of the corresponding privacy parameters value of the determination anonymization scheme, specifically includes:

Determine k value set, l value set and t value set；The k value set is in the k-anonymity processing scheme The value range of k value parameter, the l value set are the value range of l value parameter in the l-diversity processing scheme, institute State the value range that t value set is t value parameter in the t-closeness processing scheme.

Optionally, the determining k value set, specifically includes:

According to the maximum privacy leakage threshold value, the minimum value in the k value set is calculated；

Determine quality of data threshold value, and the maximum value in the k value set according to the quality of data threshold calculations；It is described Quality of data threshold value is determined under the premise of tables of data meets the maximum privacy leakage threshold value；The tables of data is retrieval number According to storage table；

According to the maximum value in the minimum value and the k value set in the k value set, k value set is determined.

Optionally, the determining l value set, specifically includes:

The maximum value of the l value set is determined according to comentropy system；

Determine the least equivalence class of Sensitive Attributes classification in the tables of data retrieved, and the Sensitive Attributes classification is minimum The equivalence class Sensitive Attributes classification number that is included be determined as the minimum value of the l value set；The tables of data is retrieval data Storage table；

According to the maximum value in the minimum value and the l value set in the l value set, l value set is determined.

Optionally, the determining t value set, specifically includes:

The tables of data is adjusted, so that tables of data adjusted meets the maximum privacy leakage threshold value and the data matter Threshold value is measured, and determines the equivalence class set of tables of data adjusted；

T value set is determined according to the equivalence class set；The t value set is all D_iThe set of value, D_iIt indicates i-th The distribution of Sensitive Attributes value is at a distance from overall situation distribution in equivalence class.

Optionally, described to judge whether the corresponding privacy parameters value of the anonymization scheme be adjusted, it is specific to wrap It includes:

According to the value range of the corresponding privacy parameters value of the anonymization scheme, opened from the minimum value of the value range Begin, successively carried out according to sequence from small to large, is adjusted to be more than described when the corresponding privacy parameters value of the anonymization scheme After the maximum value of value range, the adjustment of the corresponding privacy parameters value of the anonymization scheme is no longer carried out.

A kind of intimacy protection system towards big data publication, the intimacy protection system include:

User profile acquisition module, for obtaining identity information, data requirements range and the data description of use of user；

User security level determination module, for determining user according to the identity information and data description of use of the user Security level；

User requested data determining module, for the data requirements range retrieval user requested data according to the user；

Anonymization scheme and the corresponding privacy parameters determining module of anonymization scheme, for according to described user security etc. Grade and security level and the anonymization processing scheme table of comparisons determine that anonymization scheme and the anonymization scheme are corresponding just The privacy parameters value of beginning；The anonymization scheme includes directly providing retrieval data processing scheme, the processing side k-anonymity Case, l-diversity processing scheme and t-closeness processing scheme；The k-anonymity processing scheme is corresponding hidden Private parameter value is k value；The corresponding privacy parameters value of the l-diversity processing scheme is l value；At the t-closeness The corresponding privacy parameters value of reason scheme is t value；

Privacy leakage probability and data quality value determining module, for according to the anonymization scheme and the anonymization The corresponding initial privacy parameters value of scheme determines privacy leakage probability and data quality value；

First judging result obtains module, for judging whether the privacy leakage probability is less than maximum privacy leakage threshold value And whether the data quality value is greater than quality of data threshold value, obtains the first judging result；Wherein, the maximum privacy leakage threshold Value is provided by data providing, and the quality of data threshold value is provided by user side；

Cloth data publication module is sent out, for indicating the privacy leakage probability less than described when first judging result It is small using the privacy leakage probability when maximum privacy leakage threshold value and the data quality value are greater than the quality of data threshold value In the maximum privacy leakage threshold value and the data quality value is greater than anonymization scheme corresponding to the quality of data threshold value Privacy parameters value corresponding with anonymization scheme carries out secret protection processing to the user requested data retrieved, is intended Data are issued, and send out cloth data to described in user's publication；

Second judging result obtains module, for indicating that the privacy leakage probability is greater than when first judging result When the maximum privacy leakage threshold value or the data quality value are less than or equal to the quality of data threshold value, hide described in judgement Whether the corresponding privacy parameters value of nameization scheme can be adjusted, and obtain the second judging result；

Privacy parameters adjust module, for indicating the corresponding privacy ginseng of the anonymization scheme when second judging result When numerical value can be adjusted, the corresponding privacy parameters value of the anonymization scheme is adjusted, and return to privacy leakage probability sum number According to mass value determining module；

Third judging result obtains module, for that cannot be adjusted when the corresponding privacy parameters value of the anonymization scheme When whole, judge whether the anonymization scheme can be adjusted, obtains third judging result；

Anonymization project setting module, for indicating that the anonymization scheme can be adjusted when the third judging result When whole, anonymization scheme is adjusted, and return to privacy leakage probability and data quality value determining module；

Quality of data threshold value reduces module, for indicating that the anonymization scheme cannot be into when the third judging result When row adjustment, the quality of data threshold value is reduced, and return to privacy leakage probability and data quality value determining module.

The specific embodiment provided according to the present invention, the invention discloses following technical effects:

The present invention provides a kind of method for secret protection and system towards big data publication, and this method is according to user demand User demand data are retrieved, anonymization scheme and the corresponding privacy parameters of anonymization scheme are determined according to user security grade, And privacy leakage probability and data quality value are determined using anonymization scheme and the corresponding privacy parameters of anonymization scheme；Then Judge whether privacy leakage threshold value is less than maximum privacy leakage threshold value and whether data quality value is greater than quality of data threshold value, if It then directlys adopt anonymization scheme and anonymization scheme corresponding privacy parameters and privacy is carried out to the user demand data retrieved Data are protected and issue secret protection treated, anonymization scheme and the corresponding privacy parameters of anonymization scheme are otherwise adjusted, Stop after meeting condition.With the application of the invention, most suitable de-identification method and privacy parameters can be chosen, so that after processing Data not only can achieve secret protection effect desired by data set provider, but also can satisfy the availability of data consumer and want It asks.

Detailed description of the invention

It in order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, below will be to institute in embodiment Attached drawing to be used is needed to be briefly described, it should be apparent that, the accompanying drawings in the following description is only some implementations of the invention Example, for those of ordinary skill in the art, without creative efforts, can also obtain according to these attached drawings Obtain other attached drawings.

Fig. 1 is the flow diagram that the embodiment of the present invention issues method for secret protection based on the big data of anonymization；

Fig. 2 is the structural schematic diagram for the intimacy protection system that the embodiment of the present invention is issued towards big data；

Fig. 3 is the structural schematic diagram that big data of the present invention issues secret protection platform；

Fig. 4 is the flow diagram of k value choosing method in k-anonymity algorithm of the present invention；

Fig. 5 is the flow diagram of l value choosing method in k-anonymity algorithm of the present invention；

Fig. 6 is the flow diagram of t value choosing method in t-closeness algorithm of the present invention.

Specific embodiment

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.

In order to make the foregoing objectives, features and advantages of the present invention clearer and more comprehensible, with reference to the accompanying drawing and specific real Applying mode, the present invention is described in further detail.

Term is explained

Anonymization: Fuzzy processing is carried out to achieve the purpose that secret protection to data.

Aobvious identity property: the attribute of the energy single individual of unique identification, such as ID card No., name, phone.

Quasi- identity property: from the attribute of different aspect mark individual, such as date of birth, gender, address, quasi- identity property It joins together, may determine some individual.

Sensitive Attributes: the attribute comprising privacy information, such as health status, disease, income.

It is extensive: to be the typical method for realizing anonymous systems.Its main thought is the precision that attribute value is known by reducing fiducial mark, So that the number for knowing the tuple of attribute value in tables of data with identical fiducial mark increases.

Equivalence class/grouping: knowing the set of the tuple of attribute value with identical fiducial mark, and in an equivalence class, attacker is logical Quasi- identity property is crossed to know that the probability of individual identity or sensitive information can be greatly reduced.

K-anonymity algorithm: k anonymization algorithm.

L-diversity algorithm: l diversity algorithm.

T-closeness algorithm: t close algorithm.

Fig. 1 is the flow diagram that the embodiment of the present invention issues method for secret protection based on the big data of anonymization, such as Fig. 1 Shown, the big data publication method for secret protection provided in an embodiment of the present invention based on anonymization specifically includes following step Suddenly.

Step 101: obtaining identity information, data requirements range and the data description of use of user.

Step 102: user security grade being determined according to the identity information of the user and data description of use, according to described The data requirements range retrieval user requested data of user.

In the present invention, using SQL query statement, the data for meeting user data range of needs are extracted from source database And it is stored in the form of tables of data.

Step 103: according to the user security grade and security level and the anonymization processing scheme table of comparisons, determination is hidden Nameization scheme and the corresponding initial privacy parameters value of the anonymization scheme；The anonymization scheme includes directly providing inspection Rope data processing scheme, k-anonymity processing scheme, l-diversity processing scheme and the processing side t-closeness Case；The corresponding privacy parameters value of the k-anonymity processing scheme is k value；The l-diversity processing scheme is corresponding Privacy parameters value is l value；The corresponding privacy parameters value of the t-closeness processing scheme is t value.

Step 104: according to the corresponding initial privacy parameters value of the anonymization scheme and the anonymization scheme Determine privacy leakage probability and data quality value.

Followed by progress secret protection effect assessment.

Step 105: judging whether the privacy leakage probability is less than maximum privacy leakage threshold value and the data quality value Whether it is greater than quality of data threshold value, obtains the first judging result；If first judging result indicates the privacy leakage probability Less than the maximum privacy leakage threshold value and the data quality value is greater than the quality of data threshold value, thens follow the steps 106；If First judging result indicates that the privacy leakage probability is more than or equal to the maximum privacy leakage threshold value or the data Mass value is less than or equal to the quality of data threshold value, thens follow the steps 107；Wherein, the maximum privacy leakage threshold value is by data Provider provides, and the quality of data threshold value is provided by user side.

Step 106: the maximum privacy leakage threshold value and the data quality value are less than using the privacy leakage probability Greater than anonymization scheme corresponding to the quality of data threshold value and the corresponding privacy parameters value of anonymization scheme to retrieving The user requested data carries out secret protection processing, obtains sending out cloth data, and send out cloth data to described in user's publication.

Step 107: judging whether the corresponding privacy parameters value of the anonymization scheme can be adjusted, obtain second and sentence Disconnected result；If second judging result indicates the anonymization scheme, corresponding privacy parameters value can be adjusted, and be held Row step 108；If the corresponding privacy parameters value of the anonymization scheme cannot be adjusted, 109 are thened follow the steps.

Before judging whether the corresponding privacy parameters of the anonymization scheme can be adjusted, first have to described in determination The value range of the corresponding privacy parameters value of anonymization scheme determines k value set, l value set and t value set；The k value Collection is combined into the value range of k value parameter in the k-anonymity processing scheme, and the l value set is the l-diversity The value range of l value parameter in processing scheme, the t value set are that t value parameter takes in the t-closeness processing scheme It is worth range.

Wherein it is determined that k value set, specifically includes:

According to the maximum privacy leakage threshold value, the minimum value in the k value set is calculated.

Determine quality of data threshold value, and the maximum value in the k value set according to the quality of data threshold calculations；It is described Quality of data threshold value is determined under the premise of tables of data meets the maximum privacy leakage threshold value；The tables of data is retrieval number According to storage table.

It determines l value set, specifically includes:

The maximum value of the l value set is determined according to comentropy system.

Determine the least equivalence class of Sensitive Attributes classification in the tables of data retrieved, and the Sensitive Attributes classification is minimum The equivalence class Sensitive Attributes classification number that is included be determined as the minimum value of the l value set；The tables of data is retrieval data Storage table.

T value set, specifically includes:

The tables of data is adjusted, so that tables of data adjusted meets the maximum privacy leakage threshold value and the data matter Threshold value is measured, and determines the equivalence class set of tables of data adjusted.

T value set is determined according to the equivalence class set；The t value set is D_iThe set of all values, D_iIt indicates i-th The distribution of Sensitive Attributes value is at a distance from overall situation distribution in equivalence class.

Step 107 specifically includes: according to the value range of the corresponding privacy parameters value of the anonymization scheme, taking from described The minimum value of value range starts, and successively carries out according to sequence from small to large, when the corresponding privacy parameters of the anonymization scheme After value is adjusted to the maximum value more than the value range, the tune of the corresponding privacy parameters value of the anonymization scheme is no longer carried out It is whole.

Step 108: the corresponding privacy parameters value of the anonymization scheme, and return step 104 are adjusted, according to the anonymity Change scheme and anonymization scheme privacy parameters value adjusted redefine privacy leakage probability and data quality value.

Step 109: judging whether the anonymization scheme can be adjusted, obtain third judging result；If described Three judging results indicate that the anonymization scheme can be adjusted, and then follow the steps 110；If the third judging result indicates The anonymization scheme cannot be adjusted, and then follow the steps 111.

Step 110: adjustment anonymization scheme, and return step 104, after anonymization scheme adjusted and adjustment The corresponding privacy parameters value of anonymization scheme redefine privacy leakage probability and data quality value.

Step 111: reducing the quality of data threshold value, and return step 104, the Rule of judgment until meeting step 105 Stop.

In actual data publication scene, according to data volume required by data providing and user side and data item, Method and safety to data mining is different, and subsequent anonymization processing scheme also can be different.It is calculated in k-anonymity Method, l-diversity algorithm, in these three classical anonymization algorithms of t-closeness algorithm, secret protection effect is also Incremental, but the extent of the destruction of data will also be gradually increased.K-anonymity algorithm is minimum to the change of data, and data can It is best with property, but privacy effect is worst；The secret protection effect of t-closeness algorithm is best but very big to data change, can With property and worst；The data change and availability of l-diversity algorithm are in middle reaches.

All privacy concerns are able to solve currently without an omnipotent anonymization algorithm, each anonymization algorithm is all There are the advantage and disadvantage of oneself, it can only be by weighing the advantages and disadvantages, thering is house to have guarantor.So selecting appropriate anonymization for different data purposes Algorithm is very necessary.Table 1 as determines data security levels according to user identity and data use, and at corresponding selection anonymization The table of comparisons of reason scheme.

1 security level of table and the anonymization processing scheme table of comparisons

It can be seen that the big data publication method for secret protection provided in an embodiment of the present invention based on anonymization can be for not Same service object, different data purposes provide controlled data access, can better meet user side availability of data and The secret protection requirement of data providing.

To achieve the above object, the present invention also provides a kind of intimacy protection systems towards big data publication.

Fig. 2 is the structural schematic diagram for the intimacy protection system that the embodiment of the present invention is issued towards big data, as shown in Fig. 2, The big data provided by the invention issues intimacy protection system

User profile acquisition module 1, for obtaining identity information, data requirements range and the data description of use of user.

User security level determination module 2 is used for being determined according to the identity information and data description of use of the user Family security level.

User requested data determining module 3, for the data requirements range retrieval user requested data according to the user.

Anonymization scheme and the corresponding privacy parameters determining module 4 of anonymization scheme, for according to the user security Grade and security level and the anonymization processing scheme table of comparisons, determine that anonymization scheme and the anonymization scheme are corresponding Initial privacy parameters value；The anonymization scheme includes directly providing retrieval data processing scheme, the processing side k-anonymity Case, l-diversity processing scheme and t-closeness processing scheme；The k-anonymity processing scheme is corresponding hidden Private parameter value is k value；The corresponding privacy parameters value of the l-diversity processing scheme is l value；At the t-closeness The corresponding privacy parameters value of reason scheme is t value.

Privacy leakage probability and data quality value determining module 5, for according to the anonymization scheme and the anonymity The corresponding initial privacy parameters value of change scheme determines privacy leakage probability and data quality value.

First judging result obtains module 6, for judging whether the privacy leakage probability is less than maximum privacy leakage threshold Value and whether the data quality value is greater than quality of data threshold value, obtains the first judging result；Wherein, the maximum privacy leakage Threshold value is provided by data providing, and the quality of data threshold value is provided by user side.

Cloth data publication module 7 is sent out, for indicating that the privacy leakage probability is less than institute when first judging result When stating maximum privacy leakage threshold value and the data quality value and being greater than the quality of data threshold value, using the privacy leakage probability Less than the maximum privacy leakage threshold value and the data quality value is greater than Anonymizer corresponding to the quality of data threshold value Case and the corresponding privacy parameters value of anonymization scheme carry out secret protection processing to the user requested data retrieved, obtain Cloth data are sent out, and send out cloth data to described in user's publication.

Second judging result obtains module 8, for indicating that the privacy leakage probability is greater than when first judging result When being less than or equal to the quality of data threshold value equal to the maximum privacy leakage threshold value or the data quality value, described in judgement Whether the corresponding privacy parameters value of anonymization scheme can be adjusted, and obtain the second judging result.

Privacy parameters adjust module 9, for indicating the corresponding privacy of the anonymization scheme when second judging result When parameter value can be adjusted, adjust the corresponding privacy parameters value of the anonymization scheme, and return privacy leakage probability and Data quality value determining module 5.

Third judging result obtains module 10, for that cannot carry out when the corresponding privacy parameters value of the anonymization scheme When adjustment, judge whether the anonymization scheme can be adjusted, obtains third judging result.

Anonymization project setting module 11, for indicating that the anonymization scheme can carry out when the third judging result When adjustment, anonymization scheme is adjusted, and return to privacy leakage probability and data quality value determining module 5.

Quality of data threshold value reduces module 12, for indicating that the anonymization scheme cannot when the third judging result When being adjusted, the quality of data threshold value is reduced, and return to privacy leakage probability and data quality value determining module 5.

Fig. 3 is the structural schematic diagram that big data of the present invention issues secret protection platform, as shown in figure 3, user carries out first Login/registration, big data issue secret protection land identification user identity, and following user submits the data requirements range of oneself With data description of use, after data Layer receives the data requirements of user, SQL query statement is executed, is retrieved required for user Data.Security level judgment module then determines security level according to the identity of user and data description of use for user, and will Class information passes to secret protection processing part.Secret protection processing part is according to the security level of user in secret protection journey It spends in three kinds of different anonymization algorithms and selects suitable anonymization processing scheme for it.After determining anonymization processing scheme, then Privacy parameters selection is carried out for the anonymization processing scheme.Then, according to the secret protection constraint condition of data providing into Row effect assessment carries out privacy parameters adjustment if being unsatisfactory for requiring, re-starts recruitment evaluation.Privacy parameters adjustment is completed Afterwards, secret protection data processing module carries out the data retrieved according to selected anonymization processing scheme and privacy parameters Secret protection processing, formation can issue data.User is distributed to by data publication module finally, data can be issued.

K value in k-anonymity algorithm is chosen.

To tables of data, (tables of data stores retrieval data to k value, and it is anonymous that k- is known as in k-anonymity processing scheme Table) influence it is as follows: k value is bigger, and each equivalence class scale is bigger in k- anonymity table, in order to meet k- anonymity constraint condition It needs, extensive attribute value is also more (extensive range is bigger), therefore the quality of data is poorer, but simultaneously, each equivalence class pair The entity answered is also more, therefore the probability of each entity sensitive information of conjecture, with regard to smaller, secret protection degree is also higher.K value Smaller (extensive range is smaller), equivalence class scale is extensive in order to meet k- anonymity constraint condition needs with regard to smaller in k- anonymity table Attribute value it is fewer, therefore the quality of data is better, but simultaneously, and the corresponding entity of each equivalence class is also fewer, therefore guesses The probability of each entity sensitive information is bigger, and secret protection degree is also poorer.

Therefore, the selection of k value is extremely important, needs that (i.e. maximum privacy leakage threshold value, is by data from secret protection degree Provider provides) and two aspects of availability of data (refer to data quality threshold, being provided by data requirements person (user side)) Consider, it is improper to choose, and just will affect secret protection effect.For this purpose, the embodiment of the invention provides k value choosing method, purpose It is to choose while meeting the k value that summed data availability requirement is wanted in secret protection.On how to measure secret protection degree, this hair Bright embodiment proposes maximum privacy leakage threshold value P_max, calculation method is the Sensitive Attributes in the equivalence class in k- anonymity table The ratio between the maximum number of repetitions of value and equivalence class number of tuples, representative meaning is that a certain entity is deduced from any equivalence class Maximum privacy leakage threshold value P is not to be exceeded in the probability of privacy information_max。

Fig. 4 is the flow diagram of k value choosing method in k-anonymity algorithm of the present invention, as shown in figure 4, the k value is selected The method is taken to include:

Step 1: determining maximum privacy leakage threshold value.Maximum privacy leakage threshold value P_maxIt is determined by data providing.

Step 2: obtaining k minimum value；Specially according to the maximum privacy leakage threshold value P_max, calculate in k value set Minimum value；Wherein the calculation formula of the minimum value in k value set isT is indicated in tables of data The maximum number of repetitions of the Sensitive Attributes value of equivalence class；K_minIt is expressed as the minimum value of k value set.

Step 3: determining quality of data threshold value, i.e., quality of data threshold value above-mentioned.

About parameter used in availability of data is measured, the embodiment of the present invention uses identification module (discern Abilitymetric) as the evaluation criteria to the quality of data.In order to more accurately calculate the number of k value parameter and the quality of data Relationship, the embodiment of the present invention improve data quality calculation formula, and improved quality of data calculation formula isWherein, Q indicates the scale of i-th of equivalence class in tables of data, and N is equivalence class number.C_DMIt is worth smaller, illustrates to count More uniform according to equivalence class scale in table, the quality of data of tables of data is preferable；C_DMValue is bigger, illustrates equivalence class scale phase in tables of data Difference is larger, and the quality of data of tables of data is poor.Tables of data is carried out to meet K_minThe processing of constraint obtains initial C_DMValue counts According to quality threshold.

Step 4: obtaining k maximum value；Specially according to the quality of data threshold value, the maximum value in k value set is calculated, Wherein the calculation formula of the maximum value in k value set is

Step 5: obtaining k optimal value；Specially according in the k value set minimum value and the k value set in most Big value, determines the selection range [K of k value set_min, K_max].When k value is minimized, the quality of data of tables of data is best；Work as k When value is maximized, then the secret protection degree highest of tables of data is first gradually increased since the minimum value of k value set, but No more than maximum value, until meeting the requirements, optimal k value is obtained.

If needing to pay attention to a little as K_maxLess than K_min, then explanation can not meet availability of data constraint condition and privacy simultaneously Protect constraint condition.

L value in l-diversity algorithm is chosen

Selection for l value in l-diversity algorithm, presently mainly using the method for determining l value based on comentropy. The embodiment of the present invention is proposed on the basis of being determined the method for l value based on comentropy and meets given secret protection constraint item Part (i.e. maximum privacy leakage threshold value, provided by data providing) and quality of data constraint condition (refer to data quality threshold, Be by data requirements person (user side) provide) choosing method.

Fig. 5 is the flow diagram of l value choosing method in k-anonymity algorithm of the present invention, as shown in figure 5, the l value is selected The method is taken to include:

Step 1: calculating l maximum value.

The process for the l value set maximum value that comentropy system determines is as follows:

Identity property subject to A, SA are Sensitive Attributes, the determination formula of maximum value in l value set are as follows:

Wherein, S={ S_i…S_jIt is Sensitive Attributes value, equivalence class collection is combined into E={ E_i…E_j}.P, E, s are equivalence class E_iIn The frequency that Sensitive Attributes value s occurs,For equivalence class E_iComentropy.

Comentropy reflects the distribution situation of attribute.Comentropy is bigger, it is meant that the distribution of Sensitive Attributes value in equivalence class It is more uniform, derive that the difficulty of specific individual is also bigger.According to above-mentioned formula, it can be deduced that the maximum value of l value set.

The minimum value value of l value set is by the Sensitive Attributes that the least equivalence class of Sensitive Attributes classification includes in tables of data Classification number.

Step 2: the value since the minimum value of l value set, is gradually increased, terminate when maximum value.

Specifically:

Judge whether tables of data meets l-diversity constraint condition；L-diversity constraint condition includes secret protection Constraint condition and quality of data constraint condition.If satisfied, then privacy parameters adjustment terminates.Privacy parameters are to take under the step herein The l value obtained.If not satisfied, modification tables of data meets l-diversity constraint, i.e. addition and the least equivalence of Sensitive Attributes classification The identical tuple of the value of quasi- identity property in table.The Sensitive Attributes value classification of addition tuple is in tables of data, and should The classification not included in the least equivalent table of Sensitive Attributes classification, until tables of data meets l-diversity constraint condition.

Calculate the data quality value (C of tables of data at this time_DMValue), judge whether the data quality value of tables of data is less than l- The quality of data threshold value of quality of data constraint condition in diversity.If so, explanation after l-diversity is handled, counts Meet user according to the availability of data of table and require privacy parameters, adjustment terminates.If it is not, then l value increases by 1, until l value increases to most Big value.

If l value is attempted to maximum value, it is still unsatisfactory for condition, then privacy parameters choose failure.

The privacy parameters that the l value choosing method obtains meet given secret protection constraint condition and the quality of data simultaneously Constraint condition.If privacy parameters choose failure, illustrate that secret protection constraint condition mutually conflicts with quality of data constraint condition, l value Choosing method cannot meet given secret protection constraint condition and quality of data constraint condition simultaneously.

T value is chosen in t-closeness algorithm

L-diversity algorithm has done very big improvement compared to k-anonymity algorithm, but is still unavoidable from phase Like sexual assault.So-called similar attack refers in the case where a certain Sensitive Attributes value accounting is excessive, and attacker has high probability can be with Release individual privacy.For these shortcomings, t-closeness algorithm further considers the distribution problem of Sensitive Attributes value, It is required that the distributional difference of sensitive Distribution value in any group of equivalence class and the attribute in entire tables of data is no more than preparatory Given threshold t, to solve the problems, such as Similarity Attack.

It is similar with the privacy parameters choosing method of l-diversity algorithm, it is therefore an objective to which that selection meets availability of data about The t value of beam condition.Mainly by iteration adjustment parameter t, so that C_DM(meet t-closeness minimum value less than initial value The C of the tables of data of constraint_DMValue).The availability of data of tables of data meets user's requirement at this time.

Fig. 6 is the flow diagram of t value choosing method in t-closeness algorithm of the present invention, as shown in fig. 6, the t value is selected The method is taken to include:

Step 1: determining equivalence class set；Tables of data is enabled to meet k-anonymity constraint, equivalence class collection is combined into E= {E_i…E_j, P D_iThe set of all values, D_iIndicate the distribution of Sensitive Attributes value in i-th of equivalence class and global distribution away from From.

Step 2: determining t value set；It enablesD_min=min { D_i, D_max=max { D_i}。

Wherein, EMD (Earth Mover ' s Distance) is land mobile distance, is that one kind is used to measure distributional difference Algorithm.It indicates to be transferred to minimum cost required for another is distributed from a distribution herein.

Third step carries out t-closeness processing to user requested data, calculates data quality value；Specially enable data It is D that table, which meets t value,_minT-closeness constraint, calculate the data quality value of tables of data.

Step 4: judging whether the data quality value of tables of data is less than quality of data threshold value；

Step 405: if then meeting availability of data constraint condition, privacy parameters selection terminates, and the value of t is D_min

Step 5: if not satisfied, then t value increases D_min, return step third step, until t >=D_maxStop.

If t >=D_maxWhen be still unsatisfactory for, then privacy parameters choose failure.

Since the constraint of t-closeness algorithm is stringenter, the quality of data requires to be more difficult to compared with other two methods Reach the requirement of user, limitation is larger.Therefore the data for meeting t-closeness constraint are difficult to use in data mining, data The application such as analysis, but because it is most strong to the degree of protection of privacy, it is suitably applied the higher data publication scene of risk.

Used herein a specific example illustrates the principle and implementation of the invention, and above embodiments are said It is bright to be merely used to help understand method and its core concept of the invention；At the same time, for those skilled in the art, foundation Thought of the invention, there will be changes in the specific implementation manner and application range.In conclusion the content of the present specification is not It is interpreted as limitation of the present invention.

Claims

1. a kind of method for secret protection towards big data publication, which is characterized in that the method for secret protection includes:

Step 102: user security grade being determined according to the identity information of the user and data description of use, according to the user Data requirements range retrieval user requested data；

Step 103: according to the user security grade and security level and the anonymization processing scheme table of comparisons, determining anonymization Scheme and the corresponding initial privacy parameters value of the anonymization scheme；The anonymization scheme includes directly providing retrieval number According to processing scheme, k-anonymity processing scheme, l-diversity processing scheme and t-closeness processing scheme；Institute Stating the corresponding privacy parameters value of k-anonymity processing scheme is k value；The corresponding privacy of the l-diversity processing scheme Parameter value is l value；The corresponding privacy parameters value of the t-closeness processing scheme is t value；

Step 104: according to the anonymization scheme and the anonymization scheme, corresponding initial privacy parameters value is determined Privacy leakage probability and data quality value；

Step 105: judging whether the privacy leakage probability be less than maximum privacy leakage threshold value and the data quality value Greater than quality of data threshold value, the first judging result is obtained；Wherein, the maximum privacy leakage threshold value is provided by data providing, The quality of data threshold value is provided by user side；

Step 106: if first judging result indicate the privacy leakage probability be less than the maximum privacy leakage threshold value and The data quality value is greater than the quality of data threshold value, then is less than the maximum privacy leakage using the privacy leakage probability Threshold value and the data quality value are greater than anonymization scheme corresponding to the quality of data threshold value and anonymization scheme is corresponding Privacy parameters value carries out secret protection processing to the user requested data that retrieves, obtains sending out cloth data, and to user Cloth data are sent out described in publication；

Step 107: if first judging result indicates that the privacy leakage probability is more than or equal to the maximum privacy leakage threshold Value or the data quality value are less than or equal to the quality of data threshold value, then judge the corresponding privacy ginseng of the anonymization scheme Whether numerical value can be adjusted, and obtain the second judging result；

Step 108: if second judging result indicates the anonymization scheme, corresponding privacy parameters value can be adjusted, The corresponding privacy parameters value of the anonymization scheme, and return step 104 are then adjusted, according to the anonymization scheme and described Anonymization scheme privacy parameters value adjusted redefines privacy leakage probability and data quality value；

Step 109: if the corresponding privacy parameters value of the anonymization scheme cannot be adjusted, judging the Anonymizer Whether case can be adjusted, and obtain third judging result；

Step 110: if the third judging result indicates that the anonymization scheme can be adjusted, adjusting Anonymizer Case, and return step 104, according to anonymization scheme adjusted and the corresponding privacy parameters value of anonymization scheme adjusted Redefine privacy leakage probability and data quality value；

Step 111: if the third judging result indicates that the anonymization scheme cannot be adjusted, reducing the data Quality threshold, and return step 104, the Rule of judgment until meeting step 105 stop.

2. method for secret protection according to claim 1, which is characterized in that the data requirements model according to the user Retrieval user requested data is enclosed, is specifically included:

Using SQL query statement, the data for meeting user data range of needs are extracted, and from source database with the shape of tables of data Formula storage.

3. method for secret protection according to claim 1, which is characterized in that the security level and anonymization processing scheme The table of comparisons includes four kinds of corresponding relationships, and being respectively as follows: the anonymization scheme when security level is 0 is directly to provide retrieval data processing Scheme, when security level is 1, anonymization scheme is k-anonymity processing scheme, the anonymization scheme when security level is 2 For l-diversity processing scheme, when security level is 3, anonymization scheme is t-closeness processing scheme.

4. method for secret protection according to claim 1, which is characterized in that judging that the anonymization scheme is corresponding hidden Before whether private parameter value can be adjusted, the big data issues method for secret protection further include:

5. method for secret protection according to claim 4, which is characterized in that the determination anonymization scheme is corresponding The value range of privacy parameters value, specifically includes:

Determine k value set, l value set and t value set；The k value set is k value in the k-anonymity processing scheme The value range of parameter, the l value set are the value range of l value parameter in the l-diversity processing scheme, the t Value set is the value range of t value parameter in the t-closeness processing scheme.

6. method for secret protection according to claim 5, which is characterized in that the determining k value set specifically includes:

Determine quality of data threshold value, and the maximum value in the k value set according to the quality of data threshold calculations；The data Quality threshold is determined under the premise of tables of data meets the maximum privacy leakage threshold value；The tables of data is retrieval data Storage table；

7. method for secret protection according to claim 5, which is characterized in that the determining l value set specifically includes:

Determine the least equivalence class of Sensitive Attributes classification in the tables of data retrieved, and the Sensitive Attributes classification is least etc. The Sensitive Attributes classification number that valence class is included is determined as the minimum value of the l value set；The tables of data is to retrieve depositing for data Store up table；

8. method for secret protection according to claim 6, which is characterized in that the determining t value set specifically includes:

The tables of data is adjusted, so that tables of data adjusted meets the maximum privacy leakage threshold value and the quality of data threshold Value, and determine the equivalence class set of tables of data adjusted；

T value set is determined according to the equivalence class set；The t value set is the set of all Di values, and Di indicates i-th of equivalence The distribution of Sensitive Attributes value is at a distance from overall situation distribution in class.

9. method for secret protection according to claim 4, which is characterized in that the judgement anonymization scheme is corresponding Whether privacy parameters value can be adjusted, and specifically include:

According to the value range of the corresponding privacy parameters value of the anonymization scheme, since the minimum value of the value range, It is successively carried out according to sequence from small to large, is adjusted to be more than the value when the corresponding privacy parameters value of the anonymization scheme After the maximum value of range, the adjustment of the corresponding privacy parameters value of the anonymization scheme is no longer carried out.

10. a kind of intimacy protection system towards big data publication, which is characterized in that the intimacy protection system includes:

User security level determination module, for determining user security according to the identity information and data description of use of the user Grade；

Anonymization scheme and the corresponding privacy parameters determining module of anonymization scheme, for according to the user security grade with And security level and the anonymization processing scheme table of comparisons, determine that anonymization scheme and the anonymization scheme are corresponding initial Privacy parameters value；The anonymization scheme includes directly providing retrieval data processing scheme, k-anonymity processing scheme, l- Diversity processing scheme and t-closeness processing scheme；The corresponding privacy ginseng of the k-anonymity processing scheme Numerical value is k value；The corresponding privacy parameters value of the l-diversity processing scheme is l value；The processing side t-closeness The corresponding privacy parameters value of case is t value；

Privacy leakage probability and data quality value determining module, for according to the anonymization scheme and the anonymization scheme Corresponding initial privacy parameters value determines privacy leakage probability and data quality value；

First judging result obtains module, for judging whether the privacy leakage probability is less than maximum privacy leakage threshold value and institute It states whether data quality value is greater than quality of data threshold value, obtains the first judging result；Wherein, the maximum privacy leakage threshold value by Data providing provides, and the quality of data threshold value is provided by user side；

Cloth data publication module is sent out, for indicating that the privacy leakage probability is less than the maximum when first judging result When privacy leakage threshold value and the data quality value are greater than the quality of data threshold value, institute is less than using the privacy leakage probability It states maximum privacy leakage threshold value and the data quality value is greater than anonymization scheme corresponding to the quality of data threshold value and hides The corresponding privacy parameters value of nameization scheme carries out secret protection processing to the user requested data retrieved, obtains sending out cloth Data, and cloth data are sent out to described in user's publication；

Second judging result obtains module, for indicating that the privacy leakage probability is more than or equal to institute when first judging result When stating maximum privacy leakage threshold value or the data quality value less than or equal to the quality of data threshold value, the anonymization is judged Whether the corresponding privacy parameters value of scheme can be adjusted, and obtain the second judging result；

Privacy parameters adjust module, for indicating the corresponding privacy parameters value of the anonymization scheme when second judging result When can be adjusted, the corresponding privacy parameters value of the anonymization scheme is adjusted, and return to privacy leakage probability and data matter Magnitude determining module；

Third judging result obtains module, for that cannot be adjusted when the corresponding privacy parameters value of the anonymization scheme When, judge whether the anonymization scheme can be adjusted, obtains third judging result；

Anonymization project setting module, for indicating that the anonymization scheme can be adjusted when the third judging result When, anonymization scheme is adjusted, and return to privacy leakage probability and data quality value determining module；

Quality of data threshold value reduces module, for indicating that the anonymization scheme cannot be adjusted when the third judging result When whole, the quality of data threshold value is reduced, and return to privacy leakage probability and data quality value determining module.