CN106570412A - Privacy protection algorithm for incremental distribution of stream-type biologic data - Google Patents
Privacy protection algorithm for incremental distribution of stream-type biologic data Download PDFInfo
- Publication number
- CN106570412A CN106570412A CN201610876548.1A CN201610876548A CN106570412A CN 106570412 A CN106570412 A CN 106570412A CN 201610876548 A CN201610876548 A CN 201610876548A CN 106570412 A CN106570412 A CN 106570412A
- Authority
- CN
- China
- Prior art keywords
- tuple
- data
- cluster
- data collection
- anonymous
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/60—Protecting data
- G06F21/62—Protecting access to data via a platform, e.g. using keys or access control rules
- G06F21/6218—Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
- G06F21/6245—Protecting personal data, e.g. for financial or medical purposes
- G06F21/6254—Protecting personal data, e.g. for financial or medical purposes by anonymising data, e.g. decorrelating personal data from the owner's identification
Abstract
The invention discloses a privacy protection algorithm for incremental distribution of stream-type biologic data and relates to the technical field of anonymous privacy protection. According to the invention, based on a k-anonymous model, a tuple which arrives at soonest is extracted from the stream-type biological data and inserted into a set Setw which stores tuples temporarily for later release; and then, the relation between the tuple in the Setw with the longest wait time and a time delay constraint Delta is judged, so that corresponding countermeasures can be adopted. The algorithm is characterized in that ideas of time delay constraints is applied, so information loss in the incremental anonymous release of the stream-type biologic data can be controlled effectively. The experiment shows that the algorithm can make the stream-type biologic data anonymous effectively, and can also ensure high usability of the released biologic data. The algorithm is significantly advantageous in processing of the stream-type biologic data.
Description
Technical field
The present invention relates to the anonymous secret protection technical field in data publication, it is biological that specifically a kind of increment issues streaming
The Privacy preserving algorithms of data.
Background technology
As the development of DNA sequencing technology, DNA sequencing cost decline rapidly, " Human Genome Project " is accomplished.
After this, the biological data based on gene data is still produced incessantly in a large number, and these biological datas are by sharing
It is widely used in medical research and clinical diagnosises.When the dynamic biological data of separate sources reach collection in the form of data flow
Fang Hou, the data can be updated in a timely manner in announced data set.However, there is potential privacy to let out for the issue of biological data
Dew problem, easily causes data set provider identity and is identified.This will hinder sharing for biological data, cause biological data to be difficult to
It is provided in medical research.Therefore, biological data should avoid supplier's identity from being identified when issuing, and carry out rational privacy guarantor
Shield.
At present, the anonymous calculations of k- of the method for biological data secret protection predominantly based on the extensive lattice of DNA in Fig. 2 are directed to
Method --- DNALA algorithms, the algorithm directly carry out extensive operation to genome sequence, make the biological data table of issue meet 2- and hide
Name.In DNALA, Malin is had been proven that if k>2, then the genomic data after anonymity easily cause excessively extensive, make to send out
The data set of cloth has relatively low effectiveness.For the availability of retention data, DNALA algorithms ensure sequential polymerization into two-by-two as far as possible
One group of cluster, then carries out extensive to each cluster, and in making each cluster, genome has identical base sequence.As DNALA is calculated
Method forms a small amount of cluster comprising three tuples in processing data, thus DNALA ensure that meet 2- it is anonymous while remain
The availability of data.But, DNALA algorithms are a kind of algorithms for processing static biological data, the algorithm process dynamic data
Increment is issued and is required a great deal of time, it is impossible to newly arrived biological data is issued in time.Based on this, Li is proposed
Hybrid algorithms, the algorithm can be anonymous in time and issue streaming biological data, but Hybrid algorithms are often formed in a large number
Cluster comprising three genomes, causes the data set availability issued relatively low.
The content of the invention
In order to overcome the shortcoming of above-mentioned prior art, the present invention to provide the privacy guarantor that a kind of increment issues streaming biological data
Shield algorithm, significantly lifts the practicality of the DNA data sets of issue so as to higher tap value.
This is realized with following technical scheme:A kind of increment issues the Privacy preserving algorithms of streaming biological data, defeated
Enter:Streaming biological data collection S;Published data collection A;Delay constraint δ;Average distance AD (the Average of published data collection A
Distance);M cluster (n of cluster result of published data collection A1,n2,...,nm), wherein, any niWith njNot comprising identical
Tuple, and any one tuple cluster niIn comprising tuple quantity be 2 or 3, the tuple in published data collection A is present
In this m cluster;Output:Anonymous Table A after renewal ';Comprise the following steps that:
1) first, it is provided with null set SetwFor the data to be released such as depositing;
2) when data set S non-NULLs, minimum tuple s of ts values in streaming biological data collection S is taken out, Set is inserted it intow
In, ts reaches the time of collection side for tuple;
If 3) null set SetwMiddle tuple number is not more than δ, then execution step is 4);If SetwMiddle tuple number is more than δ, then
Execution step is 6);
4) find null set SetwInterior sequence r nearest away from tuple s, calculate r and s apart from dist (r, s);
If 5), dist (r, s) less than published data collection A average distance AD when, from null set SetwTake out tuple r
The cluster formed with s is put in published data collection A, and extensive r and s, and then execution step is 7);Otherwise, direct execution step is 7);
6) obtain null set SetwMiddle ts minimum tuple a, finds in data set A apart from a nearest sequence b, a is added to containing b's
Tuple cluster niIn;For the new tuple cluster n for being formediThe difference of contained element number, takes respective handling mode:If n nowi
When middle tuple number is 3, then extensive ni;If niMiddle tuple number is 4, then niIt is divided into element number equal g and h two
Cluster, and guarantee that the inner elements of two packets are minimum apart from sum, then extensive g and h;
7) jump to step 2), until streaming biological data collection S is sky;
Anonymous Table A after 8) being updated '.
The invention has the beneficial effects as follows:Existing Hybrid algorithms can be overcome to hide the effective secret protection of biological data
Name streaming biological data easily causes excessively extensive defect, issues more accurate data set, increases substantially issue biological
The availability of data set.
Description of the drawings
Fig. 1 is FB(flow block) of the present invention;
Fig. 2 is the extensive lattice schematic diagrams of DNA under DNALA algorithms;
Fig. 3 is Multiple Sequence Alignment mechanism (MSA) and pair-wise alignment mechanism (PSA) schematic diagram;
Fig. 4 is that newly arrived biological data updates to published data the exemplary plot concentrated under Hybrid algorithms;
Fig. 5 is that newly arrived biological data updates to published data the exemplary plot concentrated under NSPSGD algorithms;
Fig. 6 a are data set I, the contrast of NSPSGD algorithms and Hybrid algorithms anonymity effect during δ=40;
Fig. 6 b are data set II, the contrast of NSPSGD algorithms and Hybrid algorithms anonymity effect during δ=40;
Fig. 6 c are data set III, the contrast of NSPSGD algorithms and Hybrid algorithms anonymity effect during δ=80;
Fig. 7 a are data set I, average distance and time delay δ and issue functional relationship between data volume;
Fig. 7 b are data set II, average distance and time delay δ and issue functional relationship between data volume;
Fig. 7 c are data set III, average distance and time delay δ and issue functional relationship between data volume.
Specific embodiment
The present invention mainly proposes the Privacy preserving algorithms that a kind of increment issues streaming biological data, makes for the present invention below
The k- anonymity concepts used and the concept of streaming genomic data.
Define 1 k- anonymity models:In the data set of issue, per bar, record is at least recorded with k-1 undistinguishables, then issue
Data set to meet k- anonymous.According to this principle, k- anonymity models guarantee to redefine a people in the data set announced
Probability is less than 1/k.With specific reference to 1 result of table.Table 1 is the anonymous transition diagram of original data set and its k-.Its middle age
The attribute of age and sex is extensive, and last entry is suppressed in table.As can be seen from the table, the data after conversion
It is anonymous that collection meets 2-.
Table 1
The k- for defining 2 streaming genomic datas is anonymous:Assume that S is one and has attribute AS=(pid, DNA sequence,
Ts streaming genomic data collection), the personal serial number of wherein pid marks, DNA are gene order, when ts is the arrival of tuple in S
Between.Assume that S' is the data after S anonymous, then do not include pid, ts attribute in S'.If it is anonymous that S' meets k-, need to meet bar
Part:
(1) forT' is extensive by t and obtains,
(2) for| EQ (t') | >=k, the tuple in all EQ (t') are identical with t', and | EQ (t') | is represented
The number of | EQ (t') |, then S' is named as one and meets the anonymous streaming gene data collection of k- by us.For example shown in table 2,
In form, the data set on the left side is original streaming gene data, and the data on the right are then to meet the anonymous data sets of 2- after anonymity.
It is an EQ (t') that wherein pid is 3201 and 3202 tuple, now | EQ (t') |=2.
Table 2
Define 3 deferred constraint δ:If P is the anonymous plan of a dynamic gene group data set, if the satisfaction exported by P
K- anonymous data set S' meets:t'.ts-t.ts<δ.Wherein, t is corresponding with t' tuple in S, and δ is one
Given real number and δ>0.So, we claim P to meet delay constraint δ.
For existing DNALA algorithms and the defect of Hybrid algorithm process dynamic gene data, one kind is we have proposed
Improved k- anonymity algorithms.First, DNALA is a kind of static genomic data, and which processes dynamic sequence and spends the time longer.Its
It is secondary, in DNALA, it has therefore proved that excessively extensive to including easily causing when clustering and carrying out extensive for three tuples, reduce data
Availability, and Hybrid algorithms can form a large amount of clusters comprising tlv triple when dynamic biological data are processed, and cause data set
It is excessively extensive.To solve this problem, the algorithm in the present invention cause as much as possible tuple be polymerized two-by-two cluster and carry out it is extensive,
The tables of data after anonymity is made while k=2 is met, polymerization more includes the cluster of two tuples.
As shown in figure 1, based on a kind of foregoing, Privacy preserving algorithms (NSPSGD of increment issue streaming biological data
Algorithm), input:Streaming biological data collection S;Published data collection A;Delay constraint δ;The average distance AD of published data collection A
(Average Distance);M cluster (n of cluster result of published data collection A1,n2,...,nm), wherein, any niWith njNo
Comprising identical tuple, and any one tuple cluster niIn comprising tuple quantity be 2 or 3, the unit in published data collection A
Group is present in this m cluster;Output:Anonymous Table A after renewal ';Comprise the following steps that:
1) first, it is provided with null set SetwFor the data to be released such as depositing;
2) when data set S non-NULLs, minimum tuple s of ts values in streaming biological data collection S is taken out, Set is inserted it intow
In, ts reaches the time of collection side for tuple;
If 3) null set SetwMiddle tuple number is not more than δ, then execution step is 4);If SetwMiddle tuple number is more than δ, then
Execution step is 6);
4) find null set SetwInterior sequence r nearest away from tuple s, calculate r and s apart from dist (r, s);
If 5), dist (r, s) less than published data collection A average distance AD when, from null set SetwTake out tuple r
The cluster formed with s is put in published data collection A, and extensive r and s, and then execution step is 7);Otherwise, direct execution step is 7);
6) obtain null set SetwMiddle ts minimum tuple a, finds in data set A apart from a nearest sequence b, a is added to containing b's
Tuple cluster niIn;For the new tuple cluster n for being formediThe difference of contained element number, takes respective handling mode:If n nowi
When middle tuple number is 3, then extensive ni;If niMiddle tuple number is 4, then niIt is divided into element number equal g and h two
Cluster, and guarantee that the inner elements of two packets are minimum apart from sum, then extensive g and h;
7) jump to step 2), until streaming biological data collection S is sky;
Anonymous Table A after 8) being updated '.
From tuple s that earliest arrival in S 2) is taken out the step of above step, NSPSGD algorithms, and it is inserted into one
The interim storage set Set to be released such as individualw.Step 3) judge SetwIn whether have data latency time to exceed time delay.Step
4)~step 5) if judging SetwMiddle number of tuples is less than δ, then SetwIn waiting time for reaching at first be less than δ.From SetwLook for
To from s nearest tuple r, spacing dist (r, s) of r and s is calculated.If, less than AD, the cluster constituted by r and s is more for dist (r, s)
Newly in A, when this step ensure that published data collection increases cluster newly, its information loss amount will not be increased.Step 6) if Setw
Number is not less than δ, then SetwIn waiting time of a for reaching at first have been over given time delay, take out tuple a, in A
Tuple b of its nearest neighbours is found, a is inserted in the cluster including b, this step ensure that all of sequence will all be published.Such as
Cluster after fruit a insertions includes four sequences, is divided into two tuftlets, and two tuples are only included in making each cluster, then extensive
These sequences;And the cluster for newly constituting is when including three sequences, then direct extensive these sequences.
Fig. 3 is Multiple Sequence Alignment mechanism (MSA) and pair-wise alignment mechanism (PSA) schematic diagram;Although NSPSGD algorithms
Cannot ensure two genomes are comprised only in each cluster in the data set issued, but by delay constraint, effectively can subtract
The formation of the cluster comprising three tuples less, makes the data set of issue have higher availability, while, it is ensured that gene data
Safety, prevents the leakage of individual privacy.It can thus be appreciated that the NSPSGD algorithms streaming gene data tool more anonymous than Hybrid algorithm
There is less information loss.Fig. 5 is the illustration of NSPSGD algorithm process stream datas, and Fig. 4 is Hybrid algorithm process streaming numbers
According to illustration.As seen from the figure, in Fig. 5, the number of the cluster comprising three sequences is fewer than Fig. 4, therefore, NSPSGD algorithms have higher
Degree of accuracy.
Experimental verification and interpretation of result
Experimental data set and environment:In order to assess NSPSGD algorithms, and test which when newest arrival gene data is processed
Performance, experiment using from NCBI three data sets, comprising tuple number be respectively:327th, 540 and 711.Details such as table 3
It is shown.To simulate high amount of traffic, test the 1/3 of these data as static treatment data set, using Hybrid and other
MWM-based algorithms carry out anonymous process to which.Hereafter remaining 2/3 is updated the data as dynamic, and is calculated by NSPSGD
Method carries out dynamic anonymity process.
Table 3
The experiment porch configuration of test NSPSGD algorithms is as follows:AMD Athlon (tm) II 2.1GHz CPU/4GB internal memories,
10 systems of Window.Following obtained experimental data is on the basis of 10 experiments of operation the meansigma methodss for taking its result.
Interpretation
Fig. 6 a, Fig. 6 b and Fig. 6 c are the situation of change that average distance updates quantity with stream gene order.Can from Fig. 6 a
To find out, the average distance formed after NSPSGD algorithm process is less than Hybrid algorithms, and the data of NSPSGD algorithms concealment are put down
Distance constantly reduction, Hybrid algorithms are totally presented tortuous decline.In this process, Hybrid algorithms are extensive generates perhaps
Many three sequence clusters, so as to cause average distance to increase, and NSPSGD algorithms can find some SetwIn appropriate two sequence clusters
So that average distance reduces.Therefore, the data hidden by NSPSGD algorithms have more compared to the result of Hybrid algorithms
Little average distance and IL.Fig. 6 b and Fig. 6 c also show same conclusions:When stream data is processed, NSPSGD algorithms ratio
Hybrid algorithms have higher precision.
Fig. 7 a, Fig. 7 b and Fig. 7 c mainly represent the assessment between the parameter that NSPSGD algorithms itself have and effect,
In figure, data represent average distance and issue the functional relationship between sequence amount and time delay δ.It can be seen that universal rule, with
The increase of time delay, in the case of the same amount of updating the data, what average distance reduced therewith.
In sum, compared to Hybrid algorithms, NSPSGD algorithm overall performances are more excellent.Meanwhile, test result indicate that should
Algorithm follows general rule:In whole process, suppression threshold value is bigger, and information loss must be fewer.It can retain Hybrid algorithms
The characteristics of, effective secret protection is carried out to biological data, overcomes existing Hybrid algorithms to generate a large amount of three Sequence clusterings
Defect, issues more accurate data sets, while shortening the time of anonymous increment streaming biological data so that the biological number of issue
Greatly enhance according to the practicality of collection.
Claims (1)
1. a kind of increment issues the Privacy preserving algorithms of streaming biological data, it is characterised in that:Input:Streaming biological data collection S;
Published data collection A;Delay constraint δ;The average distance AD of published data collection A;Cluster result m of published data collection A
Cluster (n1,n2,...,nm), wherein, any niWith njNot comprising identical tuple, and any one tuple cluster niIn comprising tuple
Quantity is 2 or 3, and the tuple in published data collection A is present in this m cluster;Output:Anonymous Table A after renewal ';Tool
Body step is as follows:
1) first, it is provided with null set SetwFor the data to be released such as depositing;
2) when data set S non-NULLs, minimum tuple s of ts values in streaming biological data collection S is taken out, Set is inserted it intowIn, ts
The time of collection side is reached for tuple;
If 3) null set SetwMiddle tuple number is not more than δ, then execution step is 4);If SetwMiddle tuple number is more than δ, then perform
Step 6);
4) find null set SetwInterior sequence r nearest away from tuple s, calculate r and s apart from dist (r, s);
If 5), dist (r, s) less than published data collection A average distance AD when, from null set SetwTake out tuple r and s shapes
Into cluster be put in published data collection A, and extensive r and s, then execution step is 7);Otherwise, direct execution step is 7);
6) obtain null set SetwMiddle ts minimum tuple a, finds in data set A apart from a nearest sequence b, a is added to containing b
Tuple cluster niIn;For the new tuple cluster n for being formediThe difference of contained element number, takes respective handling mode:If now
niWhen middle tuple number is 3, then extensive ni;If niMiddle tuple number is 4, then niIt is divided into element number equal g and h two
Individual cluster, and guarantee that the inner elements of two packets are minimum apart from sum, then extensive g and h;
7) jump to step 2), until streaming biological data collection S is sky;
Anonymous Table A after 8) being updated '.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610876548.1A CN106570412B (en) | 2016-10-08 | 2016-10-08 | A kind of method for secret protection of increment publication streaming biological data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610876548.1A CN106570412B (en) | 2016-10-08 | 2016-10-08 | A kind of method for secret protection of increment publication streaming biological data |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106570412A true CN106570412A (en) | 2017-04-19 |
CN106570412B CN106570412B (en) | 2018-10-30 |
Family
ID=58532585
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610876548.1A Active CN106570412B (en) | 2016-10-08 | 2016-10-08 | A kind of method for secret protection of increment publication streaming biological data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106570412B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108664807A (en) * | 2018-04-03 | 2018-10-16 | 徐州医科大学 | Method based on the difference privacy DNA motif discoveries that stochastical sampling and die body are compressed |
CN110704865A (en) * | 2019-09-02 | 2020-01-17 | 北京交通大学 | Privacy protection method based on dynamic graph data release |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130198194A1 (en) * | 2012-01-31 | 2013-08-01 | International Business Machines Corporation | Method and system for preserving privacy of a dataset |
CN104135362A (en) * | 2014-07-21 | 2014-11-05 | 南京大学 | Availability computing method of data published based on differential privacy |
CN105512566A (en) * | 2015-11-27 | 2016-04-20 | 电子科技大学 | Health data privacy protection method based on K-anonymity |
-
2016
- 2016-10-08 CN CN201610876548.1A patent/CN106570412B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130198194A1 (en) * | 2012-01-31 | 2013-08-01 | International Business Machines Corporation | Method and system for preserving privacy of a dataset |
CN104135362A (en) * | 2014-07-21 | 2014-11-05 | 南京大学 | Availability computing method of data published based on differential privacy |
CN105512566A (en) * | 2015-11-27 | 2016-04-20 | 电子科技大学 | Health data privacy protection method based on K-anonymity |
Non-Patent Citations (2)
Title |
---|
《小型微型计算机系统》 * |
《通信学报》 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108664807A (en) * | 2018-04-03 | 2018-10-16 | 徐州医科大学 | Method based on the difference privacy DNA motif discoveries that stochastical sampling and die body are compressed |
CN110704865A (en) * | 2019-09-02 | 2020-01-17 | 北京交通大学 | Privacy protection method based on dynamic graph data release |
Also Published As
Publication number | Publication date |
---|---|
CN106570412B (en) | 2018-10-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Jansen et al. | Building gene regulatory networks from scATAC-seq and scRNA-seq using linked self organizing maps | |
Simmons et al. | Realizing privacy preserving genome-wide association studies | |
Schmickl et al. | Arabidopsis hybrid speciation processes | |
Wolfson et al. | DataSHIELD: resolving a conflict in contemporary bioscience—performing a pooled analysis of individual-level data without sharing the data | |
Zhang et al. | Fireworks algorithm with enhanced fireworks interaction | |
WO2012176923A1 (en) | Anonymization index determination device and method, and anonymization process execution system and method | |
Funkhouser et al. | Evidence for transcriptome-wide RNA editing among Sus scrofa PRE-1 SINE elements | |
Weiner et al. | Spatial ecology of territorial populations | |
Das et al. | OnlineCall: fast online parameter estimation and base calling for illumina's next-generation sequencing | |
Sankoff et al. | Fractionation, rearrangement and subgenome dominance | |
CN106570412A (en) | Privacy protection algorithm for incremental distribution of stream-type biologic data | |
Fujisawa et al. | PCA-based unsupervised feature extraction for gene expression analysis of COVID-19 patients | |
Pust et al. | Bacterial low-abundant taxa are key determinants of a healthy airway metagenome in the early years of human life | |
Conant | Rapid reorganization of the transcriptional regulatory network after genome duplication in yeast | |
Dama et al. | Non-Coding RNAs as Prognostic Biomarkers: A miRNA signature specific for aggressive early-stage lung adenocarcinomas | |
Li et al. | Integrative analysis of the lncRNA and mRNA transcriptome revealed genes and pathways potentially involved in the anther abortion of cotton (Gossypium hirsutum l.) | |
Rui et al. | Early warning of hand, foot, and mouth disease transmission: a modeling study in mainland, China | |
CN106570348A (en) | Streaming biodata privacy protection increment publishing algorithm with inhibition mechanism | |
Wooten et al. | Data-driven math model of FLT3-ITD acute myeloid leukemia reveals potential therapeutic targets | |
Pfaffelhuber et al. | Muller’s ratchet with compensatory mutations | |
Huang et al. | A memetic gravitation search algorithm for solving DNA fragment assembly problems | |
CN109801676B (en) | Method and device for evaluating activation effect of compound on gene pathway | |
Li et al. | Risk prediction: methods, challenges, and opportunities | |
CN109478381B (en) | Secret calculation system, secret calculation device, secret calculation method, and program | |
Chen et al. | Quantifying the Landscape and Transition Paths for Proliferation–Quiescence Fate Decisions |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |