CN106570412A

CN106570412A - Privacy protection algorithm for incremental distribution of stream-type biologic data

Info

Publication number: CN106570412A
Application number: CN201610876548.1A
Authority: CN
Inventors: 吴响; 俞啸; 魏裕阳; 林童; 王换换
Original assignee: Xuzhou Medical University
Current assignee: Xuzhou Medical University
Priority date: 2016-10-08
Filing date: 2016-10-08
Publication date: 2017-04-19
Anticipated expiration: 2036-10-08
Also published as: CN106570412B

Abstract

The invention discloses a privacy protection algorithm for incremental distribution of stream-type biologic data and relates to the technical field of anonymous privacy protection. According to the invention, based on a k-anonymous model, a tuple which arrives at soonest is extracted from the stream-type biological data and inserted into a set Setw which stores tuples temporarily for later release; and then, the relation between the tuple in the Setw with the longest wait time and a time delay constraint Delta is judged, so that corresponding countermeasures can be adopted. The algorithm is characterized in that ideas of time delay constraints is applied, so information loss in the incremental anonymous release of the stream-type biologic data can be controlled effectively. The experiment shows that the algorithm can make the stream-type biologic data anonymous effectively, and can also ensure high usability of the released biologic data. The algorithm is significantly advantageous in processing of the stream-type biologic data.

Description

A kind of increment issues the Privacy preserving algorithms of streaming biological data

Technical field

The present invention relates to the anonymous secret protection technical field in data publication, it is biological that specifically a kind of increment issues streaming The Privacy preserving algorithms of data.

Background technology

As the development of DNA sequencing technology, DNA sequencing cost decline rapidly, " Human Genome Project " is accomplished. After this, the biological data based on gene data is still produced incessantly in a large number, and these biological datas are by sharing It is widely used in medical research and clinical diagnosises.When the dynamic biological data of separate sources reach collection in the form of data flow Fang Hou, the data can be updated in a timely manner in announced data set.However, there is potential privacy to let out for the issue of biological data Dew problem, easily causes data set provider identity and is identified.This will hinder sharing for biological data, cause biological data to be difficult to It is provided in medical research.Therefore, biological data should avoid supplier's identity from being identified when issuing, and carry out rational privacy guarantor Shield.

At present, the anonymous calculations of k- of the method for biological data secret protection predominantly based on the extensive lattice of DNA in Fig. 2 are directed to Method --- DNALA algorithms, the algorithm directly carry out extensive operation to genome sequence, make the biological data table of issue meet 2- and hide Name.In DNALA, Malin is had been proven that if k>2, then the genomic data after anonymity easily cause excessively extensive, make to send out The data set of cloth has relatively low effectiveness.For the availability of retention data, DNALA algorithms ensure sequential polymerization into two-by-two as far as possible One group of cluster, then carries out extensive to each cluster, and in making each cluster, genome has identical base sequence.As DNALA is calculated Method forms a small amount of cluster comprising three tuples in processing data, thus DNALA ensure that meet 2- it is anonymous while remain The availability of data.But, DNALA algorithms are a kind of algorithms for processing static biological data, the algorithm process dynamic data Increment is issued and is required a great deal of time, it is impossible to newly arrived biological data is issued in time.Based on this, Li is proposed Hybrid algorithms, the algorithm can be anonymous in time and issue streaming biological data, but Hybrid algorithms are often formed in a large number Cluster comprising three genomes, causes the data set availability issued relatively low.

The content of the invention

In order to overcome the shortcoming of above-mentioned prior art, the present invention to provide the privacy guarantor that a kind of increment issues streaming biological data Shield algorithm, significantly lifts the practicality of the DNA data sets of issue so as to higher tap value.

This is realized with following technical scheme：A kind of increment issues the Privacy preserving algorithms of streaming biological data, defeated Enter：Streaming biological data collection S；Published data collection A；Delay constraint δ；Average distance AD (the Average of published data collection A Distance)；M cluster (n of cluster result of published data collection A₁,n₂,...,n_m), wherein, any n_iWith n_jNot comprising identical Tuple, and any one tuple cluster n_iIn comprising tuple quantity be 2 or 3, the tuple in published data collection A is present In this m cluster；Output：Anonymous Table A after renewal '；Comprise the following steps that：

1) first, it is provided with null set Set_wFor the data to be released such as depositing；

2) when data set S non-NULLs, minimum tuple s of ts values in streaming biological data collection S is taken out, Set is inserted it into_w In, ts reaches the time of collection side for tuple；

If 3) null set Set_wMiddle tuple number is not more than δ, then execution step is 4)；If Set_wMiddle tuple number is more than δ, then Execution step is 6)；

4) find null set Set_wInterior sequence r nearest away from tuple s, calculate r and s apart from dist (r, s)；

If 5), dist (r, s) less than published data collection A average distance AD when, from null set Set_wTake out tuple r The cluster formed with s is put in published data collection A, and extensive r and s, and then execution step is 7)；Otherwise, direct execution step is 7)； 6) obtain null set Set_wMiddle ts minimum tuple a, finds in data set A apart from a nearest sequence b, a is added to containing b's Tuple cluster n_iIn；For the new tuple cluster n for being formed_iThe difference of contained element number, takes respective handling mode：If n now_i When middle tuple number is 3, then extensive n_i；If n_iMiddle tuple number is 4, then n_iIt is divided into element number equal g and h two Cluster, and guarantee that the inner elements of two packets are minimum apart from sum, then extensive g and h；

7) jump to step 2), until streaming biological data collection S is sky；

Anonymous Table A after 8) being updated '.

The invention has the beneficial effects as follows：Existing Hybrid algorithms can be overcome to hide the effective secret protection of biological data Name streaming biological data easily causes excessively extensive defect, issues more accurate data set, increases substantially issue biological The availability of data set.

Description of the drawings

Fig. 1 is FB(flow block) of the present invention；

Fig. 2 is the extensive lattice schematic diagrams of DNA under DNALA algorithms；

Fig. 3 is Multiple Sequence Alignment mechanism (MSA) and pair-wise alignment mechanism (PSA) schematic diagram；

Fig. 4 is that newly arrived biological data updates to published data the exemplary plot concentrated under Hybrid algorithms；

Fig. 5 is that newly arrived biological data updates to published data the exemplary plot concentrated under NSPSGD algorithms；

Fig. 6 a are data set I, the contrast of NSPSGD algorithms and Hybrid algorithms anonymity effect during δ=40；

Fig. 6 b are data set II, the contrast of NSPSGD algorithms and Hybrid algorithms anonymity effect during δ=40；

Fig. 6 c are data set III, the contrast of NSPSGD algorithms and Hybrid algorithms anonymity effect during δ=80；

Fig. 7 a are data set I, average distance and time delay δ and issue functional relationship between data volume；

Fig. 7 b are data set II, average distance and time delay δ and issue functional relationship between data volume；

Fig. 7 c are data set III, average distance and time delay δ and issue functional relationship between data volume.

Specific embodiment

The present invention mainly proposes the Privacy preserving algorithms that a kind of increment issues streaming biological data, makes for the present invention below The k- anonymity concepts used and the concept of streaming genomic data.

Define 1 k- anonymity models：In the data set of issue, per bar, record is at least recorded with k-1 undistinguishables, then issue Data set to meet k- anonymous.According to this principle, k- anonymity models guarantee to redefine a people in the data set announced Probability is less than 1/k.With specific reference to 1 result of table.Table 1 is the anonymous transition diagram of original data set and its k-.Its middle age The attribute of age and sex is extensive, and last entry is suppressed in table.As can be seen from the table, the data after conversion It is anonymous that collection meets 2-.

Table 1

The k- for defining 2 streaming genomic datas is anonymous：Assume that S is one and has attribute A_S=(pid, DNA sequence, Ts streaming genomic data collection), the personal serial number of wherein pid marks, DNA are gene order, when ts is the arrival of tuple in S Between.Assume that S' is the data after S anonymous, then do not include pid, ts attribute in S'.If it is anonymous that S' meets k-, need to meet bar Part：

(1) forT' is extensive by t and obtains,

(2) for| EQ (t') | >=k, the tuple in all EQ (t') are identical with t', and | EQ (t') | is represented The number of | EQ (t') |, then S' is named as one and meets the anonymous streaming gene data collection of k- by us.For example shown in table 2, In form, the data set on the left side is original streaming gene data, and the data on the right are then to meet the anonymous data sets of 2- after anonymity. It is an EQ (t') that wherein pid is 3201 and 3202 tuple, now | EQ (t') |=2.

Table 2

Define 3 deferred constraint δ：If P is the anonymous plan of a dynamic gene group data set, if the satisfaction exported by P K- anonymous data set S' meets：t'.ts-t.ts<δ.Wherein, t is corresponding with t' tuple in S, and δ is one Given real number and δ>0.So, we claim P to meet delay constraint δ.

For existing DNALA algorithms and the defect of Hybrid algorithm process dynamic gene data, one kind is we have proposed Improved k- anonymity algorithms.First, DNALA is a kind of static genomic data, and which processes dynamic sequence and spends the time longer.Its It is secondary, in DNALA, it has therefore proved that excessively extensive to including easily causing when clustering and carrying out extensive for three tuples, reduce data Availability, and Hybrid algorithms can form a large amount of clusters comprising tlv triple when dynamic biological data are processed, and cause data set It is excessively extensive.To solve this problem, the algorithm in the present invention cause as much as possible tuple be polymerized two-by-two cluster and carry out it is extensive, The tables of data after anonymity is made while k=2 is met, polymerization more includes the cluster of two tuples.

As shown in figure 1, based on a kind of foregoing, Privacy preserving algorithms (NSPSGD of increment issue streaming biological data Algorithm), input：Streaming biological data collection S；Published data collection A；Delay constraint δ；The average distance AD of published data collection A (Average Distance)；M cluster (n of cluster result of published data collection A₁,n₂,...,n_m), wherein, any n_iWith n_jNo Comprising identical tuple, and any one tuple cluster n_iIn comprising tuple quantity be 2 or 3, the unit in published data collection A Group is present in this m cluster；Output：Anonymous Table A after renewal '；Comprise the following steps that：

7) jump to step 2), until streaming biological data collection S is sky；

Anonymous Table A after 8) being updated '.

From tuple s that earliest arrival in S 2) is taken out the step of above step, NSPSGD algorithms, and it is inserted into one The interim storage set Set to be released such as individual_w.Step 3) judge Set_wIn whether have data latency time to exceed time delay.Step 4)～step 5) if judging Set_wMiddle number of tuples is less than δ, then Set_wIn waiting time for reaching at first be less than δ.From Set_wLook for To from s nearest tuple r, spacing dist (r, s) of r and s is calculated.If, less than AD, the cluster constituted by r and s is more for dist (r, s) Newly in A, when this step ensure that published data collection increases cluster newly, its information loss amount will not be increased.Step 6) if Set_w Number is not less than δ, then Set_wIn waiting time of a for reaching at first have been over given time delay, take out tuple a, in A Tuple b of its nearest neighbours is found, a is inserted in the cluster including b, this step ensure that all of sequence will all be published.Such as Cluster after fruit a insertions includes four sequences, is divided into two tuftlets, and two tuples are only included in making each cluster, then extensive These sequences；And the cluster for newly constituting is when including three sequences, then direct extensive these sequences.

Fig. 3 is Multiple Sequence Alignment mechanism (MSA) and pair-wise alignment mechanism (PSA) schematic diagram；Although NSPSGD algorithms Cannot ensure two genomes are comprised only in each cluster in the data set issued, but by delay constraint, effectively can subtract The formation of the cluster comprising three tuples less, makes the data set of issue have higher availability, while, it is ensured that gene data Safety, prevents the leakage of individual privacy.It can thus be appreciated that the NSPSGD algorithms streaming gene data tool more anonymous than Hybrid algorithm There is less information loss.Fig. 5 is the illustration of NSPSGD algorithm process stream datas, and Fig. 4 is Hybrid algorithm process streaming numbers According to illustration.As seen from the figure, in Fig. 5, the number of the cluster comprising three sequences is fewer than Fig. 4, therefore, NSPSGD algorithms have higher Degree of accuracy.

Experimental verification and interpretation of result

Experimental data set and environment：In order to assess NSPSGD algorithms, and test which when newest arrival gene data is processed Performance, experiment using from NCBI three data sets, comprising tuple number be respectively：327th, 540 and 711.Details such as table 3 It is shown.To simulate high amount of traffic, test the 1/3 of these data as static treatment data set, using Hybrid and other MWM-based algorithms carry out anonymous process to which.Hereafter remaining 2/3 is updated the data as dynamic, and is calculated by NSPSGD Method carries out dynamic anonymity process.

Table 3

The experiment porch configuration of test NSPSGD algorithms is as follows：AMD Athlon (tm) II 2.1GHz CPU/4GB internal memories, 10 systems of Window.Following obtained experimental data is on the basis of 10 experiments of operation the meansigma methodss for taking its result.

Interpretation

Fig. 6 a, Fig. 6 b and Fig. 6 c are the situation of change that average distance updates quantity with stream gene order.Can from Fig. 6 a To find out, the average distance formed after NSPSGD algorithm process is less than Hybrid algorithms, and the data of NSPSGD algorithms concealment are put down Distance constantly reduction, Hybrid algorithms are totally presented tortuous decline.In this process, Hybrid algorithms are extensive generates perhaps Many three sequence clusters, so as to cause average distance to increase, and NSPSGD algorithms can find some Set_wIn appropriate two sequence clusters So that average distance reduces.Therefore, the data hidden by NSPSGD algorithms have more compared to the result of Hybrid algorithms Little average distance and IL.Fig. 6 b and Fig. 6 c also show same conclusions：When stream data is processed, NSPSGD algorithms ratio Hybrid algorithms have higher precision.

Fig. 7 a, Fig. 7 b and Fig. 7 c mainly represent the assessment between the parameter that NSPSGD algorithms itself have and effect, In figure, data represent average distance and issue the functional relationship between sequence amount and time delay δ.It can be seen that universal rule, with The increase of time delay, in the case of the same amount of updating the data, what average distance reduced therewith.

In sum, compared to Hybrid algorithms, NSPSGD algorithm overall performances are more excellent.Meanwhile, test result indicate that should Algorithm follows general rule：In whole process, suppression threshold value is bigger, and information loss must be fewer.It can retain Hybrid algorithms The characteristics of, effective secret protection is carried out to biological data, overcomes existing Hybrid algorithms to generate a large amount of three Sequence clusterings Defect, issues more accurate data sets, while shortening the time of anonymous increment streaming biological data so that the biological number of issue Greatly enhance according to the practicality of collection.

Claims

1. a kind of increment issues the Privacy preserving algorithms of streaming biological data, it is characterised in that：Input：Streaming biological data collection S； Published data collection A；Delay constraint δ；The average distance AD of published data collection A；Cluster result m of published data collection A Cluster (n₁,n₂,...,n_m), wherein, any n_iWith n_jNot comprising identical tuple, and any one tuple cluster n_iIn comprising tuple Quantity is 2 or 3, and the tuple in published data collection A is present in this m cluster；Output：Anonymous Table A after renewal '；Tool Body step is as follows：

2) when data set S non-NULLs, minimum tuple s of ts values in streaming biological data collection S is taken out, Set is inserted it into_wIn, ts The time of collection side is reached for tuple；

If 3) null set Set_wMiddle tuple number is not more than δ, then execution step is 4)；If Set_wMiddle tuple number is more than δ, then perform Step 6)；

If 5), dist (r, s) less than published data collection A average distance AD when, from null set Set_wTake out tuple r and s shapes Into cluster be put in published data collection A, and extensive r and s, then execution step is 7)；Otherwise, direct execution step is 7)；

6) obtain null set Set_wMiddle ts minimum tuple a, finds in data set A apart from a nearest sequence b, a is added to containing b Tuple cluster n_iIn；For the new tuple cluster n for being formed_iThe difference of contained element number, takes respective handling mode：If now n_iWhen middle tuple number is 3, then extensive n_i；If n_iMiddle tuple number is 4, then n_iIt is divided into element number equal g and h two Individual cluster, and guarantee that the inner elements of two packets are minimum apart from sum, then extensive g and h；

7) jump to step 2), until streaming biological data collection S is sky；

Anonymous Table A after 8) being updated '.