CN106570412A - Privacy protection algorithm for incremental distribution of stream-type biologic data - Google Patents

Privacy protection algorithm for incremental distribution of stream-type biologic data Download PDF

Info

Publication number
CN106570412A
CN106570412A CN201610876548.1A CN201610876548A CN106570412A CN 106570412 A CN106570412 A CN 106570412A CN 201610876548 A CN201610876548 A CN 201610876548A CN 106570412 A CN106570412 A CN 106570412A
Authority
CN
China
Prior art keywords
tuple
data
cluster
data collection
anonymous
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610876548.1A
Other languages
Chinese (zh)
Other versions
CN106570412B (en
Inventor
吴响
俞啸
魏裕阳
林童
王换换
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xuzhou Medical University
Original Assignee
Xuzhou Medical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xuzhou Medical University filed Critical Xuzhou Medical University
Priority to CN201610876548.1A priority Critical patent/CN106570412B/en
Publication of CN106570412A publication Critical patent/CN106570412A/en
Application granted granted Critical
Publication of CN106570412B publication Critical patent/CN106570412B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • G06F21/6254Protecting personal data, e.g. for financial or medical purposes by anonymising data, e.g. decorrelating personal data from the owner's identification

Abstract

The invention discloses a privacy protection algorithm for incremental distribution of stream-type biologic data and relates to the technical field of anonymous privacy protection. According to the invention, based on a k-anonymous model, a tuple which arrives at soonest is extracted from the stream-type biological data and inserted into a set Setw which stores tuples temporarily for later release; and then, the relation between the tuple in the Setw with the longest wait time and a time delay constraint Delta is judged, so that corresponding countermeasures can be adopted. The algorithm is characterized in that ideas of time delay constraints is applied, so information loss in the incremental anonymous release of the stream-type biologic data can be controlled effectively. The experiment shows that the algorithm can make the stream-type biologic data anonymous effectively, and can also ensure high usability of the released biologic data. The algorithm is significantly advantageous in processing of the stream-type biologic data.

Description

A kind of increment issues the Privacy preserving algorithms of streaming biological data
Technical field
The present invention relates to the anonymous secret protection technical field in data publication, it is biological that specifically a kind of increment issues streaming The Privacy preserving algorithms of data.
Background technology
As the development of DNA sequencing technology, DNA sequencing cost decline rapidly, " Human Genome Project " is accomplished. After this, the biological data based on gene data is still produced incessantly in a large number, and these biological datas are by sharing It is widely used in medical research and clinical diagnosises.When the dynamic biological data of separate sources reach collection in the form of data flow Fang Hou, the data can be updated in a timely manner in announced data set.However, there is potential privacy to let out for the issue of biological data Dew problem, easily causes data set provider identity and is identified.This will hinder sharing for biological data, cause biological data to be difficult to It is provided in medical research.Therefore, biological data should avoid supplier's identity from being identified when issuing, and carry out rational privacy guarantor Shield.
At present, the anonymous calculations of k- of the method for biological data secret protection predominantly based on the extensive lattice of DNA in Fig. 2 are directed to Method --- DNALA algorithms, the algorithm directly carry out extensive operation to genome sequence, make the biological data table of issue meet 2- and hide Name.In DNALA, Malin is had been proven that if k>2, then the genomic data after anonymity easily cause excessively extensive, make to send out The data set of cloth has relatively low effectiveness.For the availability of retention data, DNALA algorithms ensure sequential polymerization into two-by-two as far as possible One group of cluster, then carries out extensive to each cluster, and in making each cluster, genome has identical base sequence.As DNALA is calculated Method forms a small amount of cluster comprising three tuples in processing data, thus DNALA ensure that meet 2- it is anonymous while remain The availability of data.But, DNALA algorithms are a kind of algorithms for processing static biological data, the algorithm process dynamic data Increment is issued and is required a great deal of time, it is impossible to newly arrived biological data is issued in time.Based on this, Li is proposed Hybrid algorithms, the algorithm can be anonymous in time and issue streaming biological data, but Hybrid algorithms are often formed in a large number Cluster comprising three genomes, causes the data set availability issued relatively low.
The content of the invention
In order to overcome the shortcoming of above-mentioned prior art, the present invention to provide the privacy guarantor that a kind of increment issues streaming biological data Shield algorithm, significantly lifts the practicality of the DNA data sets of issue so as to higher tap value.
This is realized with following technical scheme:A kind of increment issues the Privacy preserving algorithms of streaming biological data, defeated Enter:Streaming biological data collection S;Published data collection A;Delay constraint δ;Average distance AD (the Average of published data collection A Distance);M cluster (n of cluster result of published data collection A1,n2,...,nm), wherein, any niWith njNot comprising identical Tuple, and any one tuple cluster niIn comprising tuple quantity be 2 or 3, the tuple in published data collection A is present In this m cluster;Output:Anonymous Table A after renewal ';Comprise the following steps that:
1) first, it is provided with null set SetwFor the data to be released such as depositing;
2) when data set S non-NULLs, minimum tuple s of ts values in streaming biological data collection S is taken out, Set is inserted it intow In, ts reaches the time of collection side for tuple;
If 3) null set SetwMiddle tuple number is not more than δ, then execution step is 4);If SetwMiddle tuple number is more than δ, then Execution step is 6);
4) find null set SetwInterior sequence r nearest away from tuple s, calculate r and s apart from dist (r, s);
If 5), dist (r, s) less than published data collection A average distance AD when, from null set SetwTake out tuple r The cluster formed with s is put in published data collection A, and extensive r and s, and then execution step is 7);Otherwise, direct execution step is 7); 6) obtain null set SetwMiddle ts minimum tuple a, finds in data set A apart from a nearest sequence b, a is added to containing b's Tuple cluster niIn;For the new tuple cluster n for being formediThe difference of contained element number, takes respective handling mode:If n nowi When middle tuple number is 3, then extensive ni;If niMiddle tuple number is 4, then niIt is divided into element number equal g and h two Cluster, and guarantee that the inner elements of two packets are minimum apart from sum, then extensive g and h;
7) jump to step 2), until streaming biological data collection S is sky;
Anonymous Table A after 8) being updated '.
The invention has the beneficial effects as follows:Existing Hybrid algorithms can be overcome to hide the effective secret protection of biological data Name streaming biological data easily causes excessively extensive defect, issues more accurate data set, increases substantially issue biological The availability of data set.
Description of the drawings
Fig. 1 is FB(flow block) of the present invention;
Fig. 2 is the extensive lattice schematic diagrams of DNA under DNALA algorithms;
Fig. 3 is Multiple Sequence Alignment mechanism (MSA) and pair-wise alignment mechanism (PSA) schematic diagram;
Fig. 4 is that newly arrived biological data updates to published data the exemplary plot concentrated under Hybrid algorithms;
Fig. 5 is that newly arrived biological data updates to published data the exemplary plot concentrated under NSPSGD algorithms;
Fig. 6 a are data set I, the contrast of NSPSGD algorithms and Hybrid algorithms anonymity effect during δ=40;
Fig. 6 b are data set II, the contrast of NSPSGD algorithms and Hybrid algorithms anonymity effect during δ=40;
Fig. 6 c are data set III, the contrast of NSPSGD algorithms and Hybrid algorithms anonymity effect during δ=80;
Fig. 7 a are data set I, average distance and time delay δ and issue functional relationship between data volume;
Fig. 7 b are data set II, average distance and time delay δ and issue functional relationship between data volume;
Fig. 7 c are data set III, average distance and time delay δ and issue functional relationship between data volume.
Specific embodiment
The present invention mainly proposes the Privacy preserving algorithms that a kind of increment issues streaming biological data, makes for the present invention below The k- anonymity concepts used and the concept of streaming genomic data.
Define 1 k- anonymity models:In the data set of issue, per bar, record is at least recorded with k-1 undistinguishables, then issue Data set to meet k- anonymous.According to this principle, k- anonymity models guarantee to redefine a people in the data set announced Probability is less than 1/k.With specific reference to 1 result of table.Table 1 is the anonymous transition diagram of original data set and its k-.Its middle age The attribute of age and sex is extensive, and last entry is suppressed in table.As can be seen from the table, the data after conversion It is anonymous that collection meets 2-.
Table 1
The k- for defining 2 streaming genomic datas is anonymous:Assume that S is one and has attribute AS=(pid, DNA sequence, Ts streaming genomic data collection), the personal serial number of wherein pid marks, DNA are gene order, when ts is the arrival of tuple in S Between.Assume that S' is the data after S anonymous, then do not include pid, ts attribute in S'.If it is anonymous that S' meets k-, need to meet bar Part:
(1) forT' is extensive by t and obtains,
(2) for| EQ (t') | >=k, the tuple in all EQ (t') are identical with t', and | EQ (t') | is represented The number of | EQ (t') |, then S' is named as one and meets the anonymous streaming gene data collection of k- by us.For example shown in table 2, In form, the data set on the left side is original streaming gene data, and the data on the right are then to meet the anonymous data sets of 2- after anonymity. It is an EQ (t') that wherein pid is 3201 and 3202 tuple, now | EQ (t') |=2.
Table 2
Define 3 deferred constraint δ:If P is the anonymous plan of a dynamic gene group data set, if the satisfaction exported by P K- anonymous data set S' meets:t'.ts-t.ts<δ.Wherein, t is corresponding with t' tuple in S, and δ is one Given real number and δ>0.So, we claim P to meet delay constraint δ.
For existing DNALA algorithms and the defect of Hybrid algorithm process dynamic gene data, one kind is we have proposed Improved k- anonymity algorithms.First, DNALA is a kind of static genomic data, and which processes dynamic sequence and spends the time longer.Its It is secondary, in DNALA, it has therefore proved that excessively extensive to including easily causing when clustering and carrying out extensive for three tuples, reduce data Availability, and Hybrid algorithms can form a large amount of clusters comprising tlv triple when dynamic biological data are processed, and cause data set It is excessively extensive.To solve this problem, the algorithm in the present invention cause as much as possible tuple be polymerized two-by-two cluster and carry out it is extensive, The tables of data after anonymity is made while k=2 is met, polymerization more includes the cluster of two tuples.
As shown in figure 1, based on a kind of foregoing, Privacy preserving algorithms (NSPSGD of increment issue streaming biological data Algorithm), input:Streaming biological data collection S;Published data collection A;Delay constraint δ;The average distance AD of published data collection A (Average Distance);M cluster (n of cluster result of published data collection A1,n2,...,nm), wherein, any niWith njNo Comprising identical tuple, and any one tuple cluster niIn comprising tuple quantity be 2 or 3, the unit in published data collection A Group is present in this m cluster;Output:Anonymous Table A after renewal ';Comprise the following steps that:
1) first, it is provided with null set SetwFor the data to be released such as depositing;
2) when data set S non-NULLs, minimum tuple s of ts values in streaming biological data collection S is taken out, Set is inserted it intow In, ts reaches the time of collection side for tuple;
If 3) null set SetwMiddle tuple number is not more than δ, then execution step is 4);If SetwMiddle tuple number is more than δ, then Execution step is 6);
4) find null set SetwInterior sequence r nearest away from tuple s, calculate r and s apart from dist (r, s);
If 5), dist (r, s) less than published data collection A average distance AD when, from null set SetwTake out tuple r The cluster formed with s is put in published data collection A, and extensive r and s, and then execution step is 7);Otherwise, direct execution step is 7); 6) obtain null set SetwMiddle ts minimum tuple a, finds in data set A apart from a nearest sequence b, a is added to containing b's Tuple cluster niIn;For the new tuple cluster n for being formediThe difference of contained element number, takes respective handling mode:If n nowi When middle tuple number is 3, then extensive ni;If niMiddle tuple number is 4, then niIt is divided into element number equal g and h two Cluster, and guarantee that the inner elements of two packets are minimum apart from sum, then extensive g and h;
7) jump to step 2), until streaming biological data collection S is sky;
Anonymous Table A after 8) being updated '.
From tuple s that earliest arrival in S 2) is taken out the step of above step, NSPSGD algorithms, and it is inserted into one The interim storage set Set to be released such as individualw.Step 3) judge SetwIn whether have data latency time to exceed time delay.Step 4)~step 5) if judging SetwMiddle number of tuples is less than δ, then SetwIn waiting time for reaching at first be less than δ.From SetwLook for To from s nearest tuple r, spacing dist (r, s) of r and s is calculated.If, less than AD, the cluster constituted by r and s is more for dist (r, s) Newly in A, when this step ensure that published data collection increases cluster newly, its information loss amount will not be increased.Step 6) if Setw Number is not less than δ, then SetwIn waiting time of a for reaching at first have been over given time delay, take out tuple a, in A Tuple b of its nearest neighbours is found, a is inserted in the cluster including b, this step ensure that all of sequence will all be published.Such as Cluster after fruit a insertions includes four sequences, is divided into two tuftlets, and two tuples are only included in making each cluster, then extensive These sequences;And the cluster for newly constituting is when including three sequences, then direct extensive these sequences.
Fig. 3 is Multiple Sequence Alignment mechanism (MSA) and pair-wise alignment mechanism (PSA) schematic diagram;Although NSPSGD algorithms Cannot ensure two genomes are comprised only in each cluster in the data set issued, but by delay constraint, effectively can subtract The formation of the cluster comprising three tuples less, makes the data set of issue have higher availability, while, it is ensured that gene data Safety, prevents the leakage of individual privacy.It can thus be appreciated that the NSPSGD algorithms streaming gene data tool more anonymous than Hybrid algorithm There is less information loss.Fig. 5 is the illustration of NSPSGD algorithm process stream datas, and Fig. 4 is Hybrid algorithm process streaming numbers According to illustration.As seen from the figure, in Fig. 5, the number of the cluster comprising three sequences is fewer than Fig. 4, therefore, NSPSGD algorithms have higher Degree of accuracy.
Experimental verification and interpretation of result
Experimental data set and environment:In order to assess NSPSGD algorithms, and test which when newest arrival gene data is processed Performance, experiment using from NCBI three data sets, comprising tuple number be respectively:327th, 540 and 711.Details such as table 3 It is shown.To simulate high amount of traffic, test the 1/3 of these data as static treatment data set, using Hybrid and other MWM-based algorithms carry out anonymous process to which.Hereafter remaining 2/3 is updated the data as dynamic, and is calculated by NSPSGD Method carries out dynamic anonymity process.
Table 3
The experiment porch configuration of test NSPSGD algorithms is as follows:AMD Athlon (tm) II 2.1GHz CPU/4GB internal memories, 10 systems of Window.Following obtained experimental data is on the basis of 10 experiments of operation the meansigma methodss for taking its result.
Interpretation
Fig. 6 a, Fig. 6 b and Fig. 6 c are the situation of change that average distance updates quantity with stream gene order.Can from Fig. 6 a To find out, the average distance formed after NSPSGD algorithm process is less than Hybrid algorithms, and the data of NSPSGD algorithms concealment are put down Distance constantly reduction, Hybrid algorithms are totally presented tortuous decline.In this process, Hybrid algorithms are extensive generates perhaps Many three sequence clusters, so as to cause average distance to increase, and NSPSGD algorithms can find some SetwIn appropriate two sequence clusters So that average distance reduces.Therefore, the data hidden by NSPSGD algorithms have more compared to the result of Hybrid algorithms Little average distance and IL.Fig. 6 b and Fig. 6 c also show same conclusions:When stream data is processed, NSPSGD algorithms ratio Hybrid algorithms have higher precision.
Fig. 7 a, Fig. 7 b and Fig. 7 c mainly represent the assessment between the parameter that NSPSGD algorithms itself have and effect, In figure, data represent average distance and issue the functional relationship between sequence amount and time delay δ.It can be seen that universal rule, with The increase of time delay, in the case of the same amount of updating the data, what average distance reduced therewith.
In sum, compared to Hybrid algorithms, NSPSGD algorithm overall performances are more excellent.Meanwhile, test result indicate that should Algorithm follows general rule:In whole process, suppression threshold value is bigger, and information loss must be fewer.It can retain Hybrid algorithms The characteristics of, effective secret protection is carried out to biological data, overcomes existing Hybrid algorithms to generate a large amount of three Sequence clusterings Defect, issues more accurate data sets, while shortening the time of anonymous increment streaming biological data so that the biological number of issue Greatly enhance according to the practicality of collection.

Claims (1)

1. a kind of increment issues the Privacy preserving algorithms of streaming biological data, it is characterised in that:Input:Streaming biological data collection S; Published data collection A;Delay constraint δ;The average distance AD of published data collection A;Cluster result m of published data collection A Cluster (n1,n2,...,nm), wherein, any niWith njNot comprising identical tuple, and any one tuple cluster niIn comprising tuple Quantity is 2 or 3, and the tuple in published data collection A is present in this m cluster;Output:Anonymous Table A after renewal ';Tool Body step is as follows:
1) first, it is provided with null set SetwFor the data to be released such as depositing;
2) when data set S non-NULLs, minimum tuple s of ts values in streaming biological data collection S is taken out, Set is inserted it intowIn, ts The time of collection side is reached for tuple;
If 3) null set SetwMiddle tuple number is not more than δ, then execution step is 4);If SetwMiddle tuple number is more than δ, then perform Step 6);
4) find null set SetwInterior sequence r nearest away from tuple s, calculate r and s apart from dist (r, s);
If 5), dist (r, s) less than published data collection A average distance AD when, from null set SetwTake out tuple r and s shapes Into cluster be put in published data collection A, and extensive r and s, then execution step is 7);Otherwise, direct execution step is 7);
6) obtain null set SetwMiddle ts minimum tuple a, finds in data set A apart from a nearest sequence b, a is added to containing b Tuple cluster niIn;For the new tuple cluster n for being formediThe difference of contained element number, takes respective handling mode:If now niWhen middle tuple number is 3, then extensive ni;If niMiddle tuple number is 4, then niIt is divided into element number equal g and h two Individual cluster, and guarantee that the inner elements of two packets are minimum apart from sum, then extensive g and h;
7) jump to step 2), until streaming biological data collection S is sky;
Anonymous Table A after 8) being updated '.
CN201610876548.1A 2016-10-08 2016-10-08 A kind of method for secret protection of increment publication streaming biological data Active CN106570412B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610876548.1A CN106570412B (en) 2016-10-08 2016-10-08 A kind of method for secret protection of increment publication streaming biological data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610876548.1A CN106570412B (en) 2016-10-08 2016-10-08 A kind of method for secret protection of increment publication streaming biological data

Publications (2)

Publication Number Publication Date
CN106570412A true CN106570412A (en) 2017-04-19
CN106570412B CN106570412B (en) 2018-10-30

Family

ID=58532585

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610876548.1A Active CN106570412B (en) 2016-10-08 2016-10-08 A kind of method for secret protection of increment publication streaming biological data

Country Status (1)

Country Link
CN (1) CN106570412B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108664807A (en) * 2018-04-03 2018-10-16 徐州医科大学 Method based on the difference privacy DNA motif discoveries that stochastical sampling and die body are compressed
CN110704865A (en) * 2019-09-02 2020-01-17 北京交通大学 Privacy protection method based on dynamic graph data release

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130198194A1 (en) * 2012-01-31 2013-08-01 International Business Machines Corporation Method and system for preserving privacy of a dataset
CN104135362A (en) * 2014-07-21 2014-11-05 南京大学 Availability computing method of data published based on differential privacy
CN105512566A (en) * 2015-11-27 2016-04-20 电子科技大学 Health data privacy protection method based on K-anonymity

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130198194A1 (en) * 2012-01-31 2013-08-01 International Business Machines Corporation Method and system for preserving privacy of a dataset
CN104135362A (en) * 2014-07-21 2014-11-05 南京大学 Availability computing method of data published based on differential privacy
CN105512566A (en) * 2015-11-27 2016-04-20 电子科技大学 Health data privacy protection method based on K-anonymity

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
《小型微型计算机系统》 *
《通信学报》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108664807A (en) * 2018-04-03 2018-10-16 徐州医科大学 Method based on the difference privacy DNA motif discoveries that stochastical sampling and die body are compressed
CN110704865A (en) * 2019-09-02 2020-01-17 北京交通大学 Privacy protection method based on dynamic graph data release

Also Published As

Publication number Publication date
CN106570412B (en) 2018-10-30

Similar Documents

Publication Publication Date Title
Jansen et al. Building gene regulatory networks from scATAC-seq and scRNA-seq using linked self organizing maps
Simmons et al. Realizing privacy preserving genome-wide association studies
Schmickl et al. Arabidopsis hybrid speciation processes
Wolfson et al. DataSHIELD: resolving a conflict in contemporary bioscience—performing a pooled analysis of individual-level data without sharing the data
Zhang et al. Fireworks algorithm with enhanced fireworks interaction
WO2012176923A1 (en) Anonymization index determination device and method, and anonymization process execution system and method
Funkhouser et al. Evidence for transcriptome-wide RNA editing among Sus scrofa PRE-1 SINE elements
Weiner et al. Spatial ecology of territorial populations
Das et al. OnlineCall: fast online parameter estimation and base calling for illumina's next-generation sequencing
Sankoff et al. Fractionation, rearrangement and subgenome dominance
CN106570412A (en) Privacy protection algorithm for incremental distribution of stream-type biologic data
Fujisawa et al. PCA-based unsupervised feature extraction for gene expression analysis of COVID-19 patients
Pust et al. Bacterial low-abundant taxa are key determinants of a healthy airway metagenome in the early years of human life
Conant Rapid reorganization of the transcriptional regulatory network after genome duplication in yeast
Dama et al. Non-Coding RNAs as Prognostic Biomarkers: A miRNA signature specific for aggressive early-stage lung adenocarcinomas
Li et al. Integrative analysis of the lncRNA and mRNA transcriptome revealed genes and pathways potentially involved in the anther abortion of cotton (Gossypium hirsutum l.)
Rui et al. Early warning of hand, foot, and mouth disease transmission: a modeling study in mainland, China
CN106570348A (en) Streaming biodata privacy protection increment publishing algorithm with inhibition mechanism
Wooten et al. Data-driven math model of FLT3-ITD acute myeloid leukemia reveals potential therapeutic targets
Pfaffelhuber et al. Muller’s ratchet with compensatory mutations
Huang et al. A memetic gravitation search algorithm for solving DNA fragment assembly problems
CN109801676B (en) Method and device for evaluating activation effect of compound on gene pathway
Li et al. Risk prediction: methods, challenges, and opportunities
CN109478381B (en) Secret calculation system, secret calculation device, secret calculation method, and program
Chen et al. Quantifying the Landscape and Transition Paths for Proliferation–Quiescence Fate Decisions

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant