CN108446568B - Histogram data publishing method for trend analysis differential privacy protection - Google Patents

Histogram data publishing method for trend analysis differential privacy protection Download PDF

Info

Publication number
CN108446568B
CN108446568B CN201810228544.1A CN201810228544A CN108446568B CN 108446568 B CN108446568 B CN 108446568B CN 201810228544 A CN201810228544 A CN 201810228544A CN 108446568 B CN108446568 B CN 108446568B
Authority
CN
China
Prior art keywords
histogram
sequence
data
clustering
outlier
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810228544.1A
Other languages
Chinese (zh)
Other versions
CN108446568A (en
Inventor
高岭
杨旭东
罗昭
毛勇
孙骞
王帆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northwestern University
Original Assignee
Northwestern University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northwestern University filed Critical Northwestern University
Priority to CN201810228544.1A priority Critical patent/CN108446568B/en
Publication of CN108446568A publication Critical patent/CN108446568A/en
Application granted granted Critical
Publication of CN108446568B publication Critical patent/CN108446568B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Bioethics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Complex Calculations (AREA)

Abstract

A method for judging signal sequence trend is introduced into judgment of histogram abnormal distribution, a large number of outliers cause large fluctuation of data distribution and stability reduction, and the histogram barrel counting distribution condition is regarded as continuous digital signals from the viewpoint to carry out data outliers. Meanwhile, aiming at a clustering target function which can cause a large number of outliers in the traditional method, an outlier balancing constraint and a similar punishment constraint are added to balance the influence of similar bucket and outlier bucket data on clustering, so that the occurrence of outliers is reduced; and carrying out outlier data micro-clustering on the outlier data based on the outlier similarity.

Description

Histogram data publishing method for trend analysis differential privacy protection
Technical Field
The invention belongs to the technical field of computer information security, and particularly relates to a histogram data publishing method for trend analysis differential privacy protection.
Background
The histogram-based data publishing method is the most common data publishing mode at present, and because the data distribution form is vividly displayed, the statistical result can provide a theoretical basis for realizing counting query. The histogram divides the data table into a plurality of disjoint subsets, forming a plurality of independent buckets, mainly according to the fact that one or more attributes have different attribute values, and respectively uses statistical values to identify the dividing meaning of each subset (or bucket), wherein the width of each bucket represents a query range, thereby realizing the range counting query.
In the distribution process of the histogram, in order to meet the differential privacy and improve the usability of data distribution, a re-partitioning-based idea is generally adopted, a clustering idea is adopted to combine adjacent similar buckets, reconstruction is performed, and laplacian noise is added to perform differential privacy protection. However, as the number of outliers present in the raw histogram bucket count increases, greater global sensitivity results; and causes a reduction in the merging probability between adjacent buckets, thereby reducing the effect of privacy protection, and histogram reconstruction has little privacy protection effect on such points. To solve this problem, the analysis and processing of outliers is necessary.
Conventional outliers are defined as small or large values in the data set, i.e., the absolute value of the difference between the outlier and other values in the data set is much larger than the absolute value between the normal values. However, this method is time-consuming on one hand, and the accuracy of the conventional outlier determination method is not high enough on the other hand. The method for judging the outliers efficiently and accurately has very important significance for the data clustering effect of the histogram. Meanwhile, no learner provides a good processing method for processing the judged outliers, and how to solve the outlier clustering is also a key problem of the differential privacy utility of the histogram data distribution.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention aims to provide a histogram data distribution method for trend analysis difference privacy protection, which introduces a method for judging signal sequence trend into the judgment of histogram abnormal distribution, so that a large number of outliers can cause the fluctuation of data distribution to be relatively large and the stationarity to be reduced, and the histogram bucket counting distribution condition is regarded as continuous digital signals from the viewpoint to carry out data outliers. Meanwhile, aiming at a clustering target function which can cause a large number of outliers in the traditional method, an outlier balancing constraint and a similar punishment constraint are added to balance the influence of similar bucket and outlier bucket data on clustering, so that the occurrence of outliers is reduced; and carrying out outlier data micro-clustering on the outlier data based on the outlier similarity.
Definition of differential privacy: a histogram issuing method A is given, and if the result H output by the method A on H and H' randomly meets the following inequality, A meets the difference privacy
Pr[A(H)=H]≤exp(ε)×Pr[A(H′)=H]
Global sensitivity: let f be a query, the global sensitivity of f be
Figure BDA0001600795030000021
Implementation of differential privacy: laplace mechanism
Figure BDA0001600795030000022
Wherein and the exponential mechanism
Figure BDA0001600795030000023
Firstly, ordering the histogram through trend-removing ordering processing, clustering the histogram by utilizing a balanced outlier of the histogram and a clustering algorithm of similar constraint, and then adding Laplace noise to finish differential privacy protection.
In order to achieve the purpose, the invention adopts the technical scheme that:
a histogram data publishing method for trend-removing analysis differential privacy protection comprises the following steps:
1. and (3) trend-removing histogram processing:
firstly, obtaining the integral ordered degree of the histogram barrel counting through a trend-removing correlation coefficient to judge; secondly, dividing the disordered histogram sequence into a plurality of subsequences, performing ordered adjustment, judging whether the sequence conforms to the ascending or descending order of the whole sequence or not by comparing the trend-free correlation coefficient with the minimum ordered subsequence, and performing ordered adjustment on the unsatisfied subsequences to obtain an integrally ordered histogram sequence;
2. histogram sequence detrending analysis:
detrended cross-correlation analysis (DCCA) is an effective model for measuring cross-correlation of non-stationary time series by first computing the "contours" of two time series; secondly, dividing the data into [ N/S ] irrelevant intervals with the length S, and being characterized in that the data jumping property in the adjacent intervals is small; fitting and eliminating the local trend by using a quadratic minimum fitting function, and calculating a local de-trend covariance function; repeating the above processes to obtain fluctuation functions corresponding to different scales s, and finally, measuring whether the characterization sequence has cross correlation relations, namely three relations of correlation, opposition and independence according to fluctuation function data, and using the idea of trend removal analysis for reference, wherein the histogram sequence is orderly judged by adopting the trend removal analysis;
from the above analysis, it can be known that the detrending analysis according to the overall dispersion of the sequence can determine the ordered degree of the sequence, first, the histogram count is regarded as an ordered sequence, the count is the overall detrending correlation analysis in the sequence, two sequences based on time are defined as front and back sequences in the bucket count sequence, according to the fluctuation coefficient determination, that is, α is 0.5, it indicates that the sequences are not correlated, and is an independent random process, that is, the current state does not affect the future state, α is less than 0.5, it indicates that if the time sequence is anti-correlated, that is, the trends in a certain time period and its next time period are opposite, the overall sequence dispersion analysis can be obtained by comparing the scale index of the fluctuation function of the subsequence, that is to determine whether the sequence is ordered, when the overall fluctuation index function of the data in the sequence is greater than the threshold value, it is considered that the histogram sequence dispersion needs to be adjusted excessively large, that is, the sequence does not satisfy the ordered constraint, and the histogram de-trend correlation threshold value alpha is obtained by reasoning according to the inverse correlation when the scale index of the fluctuation function is less than 0.5 in de-trend analysis
Figure BDA0001600795030000042
Wherein Ns is N/L N for the total data length and L is the minimum ordered sequence length;
3. histogram detrending ordered adjustment:
and obtaining the integral ordering degree of the histogram sequence according to trend-removing analysis of the integral data, adjusting the histogram sequence which does not meet the ordering constraint, and defining a minimum ordered sequence, namely, an initial most ordered sequence, namely, an ordered, ascending or descending sequence with the shortest length in the counting of the original buckets.
By dividing the histogram sequence into a plurality of sub-sequences according to the minimum ordered sequence, the sub-sequences in all the histogram sequences are facilitated and compared with the minimum ordered sub-sequence, and the traversed sub-sequences are adjusted when an anti-correlation relationship exists between the two sequences, and the specific algorithm is as follows:
defining sequence similarity, and measuring the similarity between two subsequences by adopting the product of Euclidean distance of elements in the sequence and the difference value of the sequence length as follows:
Figure BDA0001600795030000041
Wd=L(Ci)-L(Cj),
therefore, the sequence similarity is Ops (Ci, Cj) ═ Wd*dis(Ci,Cj),
And (3) based on the dispersion descending sampling sorting of trend analysis, inputting: original histogram H, output: the ordered histogram sequence H' is approximated by the following steps:
1) finding a minimum ordered sequence with the length of L from the original data;
2) dividing the original sequence into Ns (N/L) irrelevant equal-length sequences according to L;
3) in order to prevent the loss of the terminal information, the same operation is carried out on the reverse order of the sequence;
4) in proportion to
Figure BDA0001600795030000051
Probability extraction of the subsequences of the responses:
5) calculating each histogram sequence contour
Figure BDA0001600795030000052
Where H (j) is the jth data in the sequence,
Figure BDA0001600795030000053
is the average of the sequence;
6) in each interval v, fitting data by using a least square method, and marking a time sequence after filtering the trend as Yt(i) Representing the difference between the original sequence and the fit value, i.e. Ys(i)=Y(i)-Ps(i) Wherein, Ps(i) Is a quadratic fitting function;
7) calculating a root mean square of fluctuation of the cumulative detrended time series
Figure BDA0001600795030000054
In general, F (n) increases as n becomes larger. log F (n) to lognSlope of 8)
The scale-defining index (self-affine parameter) alpha is a Hurst index, and if the curve of the double-logarithm graph is a straight line, the self-similarity can be represented by the following formula F (n) · nα
9) Sequencing and adjusting the sequence with the fluctuation coefficient alpha less than or equal to 0.5;
10) repeating the above processes until the final sequence is obtained in a circulating manner;
4. histogram clustering
Clustering the clustering functions of all buckets subjected to initial clustering by adopting a clustering algorithm, adding similar preference punishment constraint and outlier influence balance constraint in the clustering functions, balancing the influence of similar data and outlier data on clustering, and performing secondary clustering on outliers existing after clustering based on outlier similarity;
1) histogram bucket clustering
In order to balance the influence of outlier data in the clustering process, an outlier function is added into a fuzzy clustering target function, the larger size of the target function influenced by an outlier is corrected, the function oversize caused by a singular value in the target function is punished, and the similar preference punishment constraint of the dispersion is violated: in order to reduce the special condition that the same data are all gathered together, a similar penalty factor is added, the clustering effect is influenced by the selection of an objective function when the number of similar data in the set is larger, and the addition of a similarity penalty constraint and an outlier weighting constraint in a clustering objective function is helpful for balancing the clustering result of outliers and similar pointsTherefore, a better clustering effect is obtained, the objective function of the clustering is composed of three parts, and the error function, the similarity constraint and the outlier contribution balance data are divided: h is the original histogram bucket count, CiFor partitioning of the original data set, i.e. C1={H1,H2…Hi},C2={Hi+1…Hi+n},…,Cj={Hi+n+1… } in which H isiBelongs to the group H;
outlier contribution equalization constraints
Firstly, because the data influence of the neighborhood is a key measurement factor of outlier, the influence histogram neighborhood of the bucket merging data set after data clustering is judged according to the relation with the neighborhood data: histogram data neighbors are histograms with a pre-post neighbor relationship after sorting, denoted as S (H)i,Hj)={Hi:|Hi-Hj< ε, where Hi∈H;
Histogram neighborhood set: all histogram sets with front and back neighbor relations and all histogram sets satisfying histogram neighborhood relations
N(Hj)={Hj|S(Hj,Hi)=true,Hj∈H\Hi},S(Hj,Hi) Represents a histogram HiAnd HjIn order to reduce the cost of bucket combination, a histogram neighborhood set is mainly measured according to the difference value of bucket counts;
histogram weighted distance: hi∈H,XHiIs the bucket count of the histogram, wijIs the histogram outlier contribution, and 0 < w < 1, then the histogram bucket HiAnd HjAre weighted by a distance of
Figure BDA0001600795030000071
Wherein w is w-X ', X' represents the histogram mean, the farther away from the mean represents the greater the degree of outlier contribution, and as can be seen from the above formula, the greater w evidences the greater the degree of outlier;
histogram neighborhood distance: neighborhood of histogram bucketDistance is the average of the weighted distances of the histogram from all the histograms in its neighborhood, i.e.
Figure BDA0001600795030000073
Wherein N (H)j) Representing the number of histograms in the neighborhood set,
in order to eliminate the influence of the poles in the neighborhood on the calculation of the neighborhood distance, an average eliminating method is adopted, the distance of the poles in the neighborhood is eliminated, and then the average distance between the histogram and the neighborhood is calculated.
Figure BDA0001600795030000072
Histogram neighborhood outlier coefficient: the neighborhood distance of the histogram is compared with its neighbors to derive the degree of deviation of the histogram in the neighborhood space, i.e. the local outlier coefficients of the histogram for the partition Ci to be aggregated,
Figure BDA0001600795030000081
then the outlier balance constraint is:
Figure BDA0001600795030000082
wherein,
Figure BDA0001600795030000083
the ratio of the number of the histograms in the neighborhood set to the total number of the histograms;
similar preference penalty constraints
When a large number of similar or identical bucket count values exist in a data set, the difference of any point can cause that data cannot be clustered, and a large number of outliers exist, the influence of the concept of discrete entropy on similar buckets in a cluster is adopted, so that the negative influence of similar data buckets on outliers is reduced, and the method specifically comprises the following steps:
when dataset partition Ci ═ H1,H2....HnIn which H isi+1....HjData betweenThe dispersion is smaller than a certain value, Hi-Hj < Lo is similar or similar, Count (Hi-Hj < Lo) > Lo 1 time causes the risk of data outlier significantly increased, and according to the conventional adaptive clustering function, H is causedi+1To reduce such instances, which are punished using dispersion to constrain, the ratio between data count values may effectively indicate the dispersion in the data set:
Figure BDA0001600795030000084
wherein i is greater than j Xi,jBelong to Ci
The information entropy can effectively indicate the discrete degree of the data, so the discrete degree of the clusters is as follows:
Figure BDA0001600795030000085
wherein p isij=P(x)log P(x)
The larger the ratio is, the lower the similarity is, and because the clustering objective function is required to obtain the minimum value, the similar preference punishment constraint can only play a punishment role in the positive direction;
clustering objective function design
Through the analysis, in order to reduce the appearance of outliers and balance the privacy and availability of data distribution, the objective function is designed as follows:
Figure BDA0001600795030000091
where, is an adaptive weight coefficient, the objective function should take into account that the set of clusters formed should not only be the smallest within-class distance, but also the ability to generate outliers for forming the set. And measured by the outlier contribution rate;
2) histogram outlier data micro-clustering
When outlier data still existing after outlier equalization is added, performing micro-clustering C on the histogram according to the similarity of the outlier dataiAnd form clustersThe cluster of (2) is added with noise and then data is released, firstly, in order to measure the difference between different outlier data sets, an outlier division similarity OPS is introduced, if Ck belongs to X, K is 1, 2, and K is less than n, the following conditions are met: 1) ck≠Θ;2)Ck1∩Ck2=O(k1≠k2);3)C1∪C2∪C3...∪CkH, then { C1,C2...CkConstitute a partition of X. Cd={XOd,X-XOdDenotes the XO containing outlier data setdA data division of (1), like Cs={XOs,X-XOSDenotes another outlier partition, its similarity OPS can be expressed as:
Figure BDA0001600795030000092
wherein,
Figure BDA0001600795030000093
fsupis CkDegree of support, fconIs CkConfidence of (f)incIs CkThe degree of inclusion of (a) is,
cis=card(XOd∩XOs) Car α represents the potential of the set, the degree of similarity of the outliers, the degree of similarity of G, ops (C)S,Cd) Greater values of (A) indicate outliers of the XOdWith XOsThe overall trend is more consistent. The support degree, the inclusion degree and the confidence degree show the approaching degree from different angles, the larger the support degree is, the XO and the XO are showndOverall more similarly, the inclusion degree represents the XOsCorrectly reflect XOdAnd confidence level represents the XOsDegree of correctness of itself, obviously, 0. ltoreq. ops (C)d,Cs) 1 or less, and ops (C)d,Cs) If and only if XO is 1d=XOs
5. Clustering algorithm of histogram similarity constraint outlier equalization:
according to the sorted histograms, greedy partitioning from left to right without giving a specific number of packets is a basic idea of clustering. The only consideration in the clustering process is the clustering objective function, which is the process of selecting the minimum objective function. It is calculated in three cases:
4) when H is merged with the current cluster, the objective function is
Figure BDA0001600795030000101
5) When H is not merged with the current cluster but with the next cluster, the objective function is
Figure BDA0001600795030000102
6) When H is clustered separately to form outlier data, err (C) is formed because H is clustered separately and there is no reconstruction erroriU H) is equal to 0, so the objective function is
Figure BDA0001600795030000103
Clustering mainly selects whether H is clustered or not according to the size of the current objective function
The algorithm is as follows: clustering algorithm for equalizing histogram constraint
Figure BDA0001600795030000104
Figure BDA0001600795030000111
And finally, adding the combined clusters into Laplace noise to realize final differential privacy.
The invention has the beneficial effects that:
the method for judging the signal sequence trend is introduced into the judgment of abnormal distribution of the histogram, a large number of outliers can cause the problem that the data distribution has high volatility and low stability, and the counting distribution condition of the histogram barrel is regarded as continuous digital signals from the angle to perform data outliers. Meanwhile, aiming at a clustering target function which can cause a large number of outliers in the traditional method, an outlier balancing constraint and a similar punishment constraint are added to balance the influence of similar bucket and outlier bucket data on clustering, so that the occurrence of outliers is reduced; and carrying out outlier data micro-clustering on the outlier data based on the outlier similarity.
Drawings
FIG. 1 is a diagram of a differential privacy protection algorithm architecture;
FIG. 2 is a clustering research framework for outlier balancing and constraints;
FIG. 3 is a differential privacy preserving data distribution algorithm;
Detailed Description
The invention is further described below with reference to the accompanying drawings.
Based on the invention, the method mainly comprises the following implementation steps:
1) detrending histogram ordering process
2) Adaptive histogram clustering for outlier equalization and constraints
3) Add to Add noise to clustered data
4) Histogram after differential privacy protection
Detrending histogram processing
The ordered histogram sequence has very important influence on the clustering effect of the histogram barrel, namely, the similar barrel counting histogram is issued after clustering, which is beneficial to reducing reconstruction errors. The advantages of the trend analysis are mainly that on one hand, the data calculation amount is reduced relative to the discrete analysis of the calculated difference value, and on the other hand, the adjustment time is saved in the ordered adjustment process. The histogram ordering is carried out by taking the idea of trend analysis as a reference. Firstly, obtaining the integral ordered degree of the histogram barrel counting through a trend-removing correlation coefficient to judge; and secondly, dividing the disordered histogram sequence into a plurality of subsequences, performing ordered adjustment, judging whether the sequence conforms to the ascending order or the descending order of the whole sequence or not by comparing the trend-removing correlation coefficient with the minimum ordered subsequence, and performing ordered adjustment on the unsatisfied subsequences. Finally, an overall ordered histogram sequence is obtained.
Histogram series detrending analysis
Detrended cross-correlation analysis (DCCA) is an effective model for measuring non-stationary time series cross-correlation. First, calculating the 'outline' of two time series; secondly, dividing the data into [ N/S ] irrelevant intervals with the length S, and being characterized in that the data jumping property in the adjacent intervals is small; fitting and eliminating the local trend by using a quadratic minimum fitting function, and calculating a local de-trend covariance function; repeating the above processes to obtain the fluctuation functions corresponding to different scales s. And finally, according to the fluctuation function data, judging whether the characterization sequences have cross correlation relations, namely, three relations of correlation, opposite relation and irrelevant relation. By using the idea of trend-removing analysis, the histogram sequence is subjected to orderly judgment by adopting the trend-removing analysis.
From the above analysis, it can be known that the degree of order of the sequence can be judged by detrending analysis of the overall dispersion of the sequence. The histogram counts are first considered as an ordered sequence, with the counts being an overall de-trending correlation analysis in the sequence. Two sequences based on time are defined as a front sequence and a rear sequence in the bucket counting sequence, and are judged according to the fluctuation coefficient, namely alpha is 0.5, the sequences are irrelevant, and the sequences are an independent random process, namely the current state does not influence the future state such as white noise. Alpha <0.5, indicating that if the time series is inversely correlated, the trend is reversed in one time segment and its next. For example, the relationship between 3, 4, 5 and 3, 4, 6 is irrelevant; 4, 3, 6 and 3, 4, 5 are also not relevant. The overall sequence discreteness analysis can be obtained by comparing the scale indexes of the volatility functions of the subsequences, namely judging whether the sequences are ordered or not. When the total fluctuation index function of the data in the sequence is larger than the threshold value, the dispersion degree of the histogram sequence is considered to be overlarge and needs to be adjusted, namely the sequence does not meet the ordering constraint. According to the inverse correlation when the scale index of the fluctuation function is less than 0.5 in the trend removing analysis, the histogram trend removing correlation threshold value alpha is obtained by reasoning
Figure BDA0001600795030000142
Where Ns is N/L N as the total data length and L is the minimum ordered sequence length
Histogram detrending ordered adjustment
And obtaining the integral ordering degree of the histogram sequence according to the trend-removing analysis of the integral data, and adjusting the histogram sequence which does not meet the ordering constraint.
The least ordered sequence (the initial most ordered sequence) is defined, i.e. the sequence with the shortest length of the orders in the original bucket counts (ascending or descending).
By dividing the histogram sequence into a plurality of sub-sequences according to the least ordered sequence, the sub-sequences in all histogram sequences are facilitated and compared to the least ordered sub-sequence, and the traversed sub-sequences are adjusted when there is an anti-correlation relationship between the two sequences. The specific algorithm is as follows:
defining sequence similarity, and measuring the similarity between two subsequences by adopting the product of Euclidean distance of elements in the sequence and the difference value of the sequence length as follows:
Figure BDA0001600795030000141
Wd=L(Ci)-L(Cj)
therefore, the sequence similarity is Ops (Ci, Cj) ═ Wd*dis(Ci,Cj)
Algorithm 1: dispersion decreasing sampling sequence based on trend analysis
Inputting: original histogram H
And (3) outputting: approximate ordered histogram sequence H'
Finding out the minimum ordered sequence with length L from the original data
Dividing the original sequence according to L into Ns/L irrelevant equal-length sequences;
in order to prevent the loss of the terminal information, the same operation is carried out on the reverse order of the sequence;
③ in direct proportion to
Figure BDA0001600795030000151
Sub-sequences of probabilistic extraction responses
Fourthly, calculating the outline of each histogram sequence
Figure BDA0001600795030000152
Where H (j) is the jth data in the sequence,
Figure BDA0001600795030000153
is the average of the sequence
Within each interval v, fitting data by using a least square method, and marking a time sequence after filtering the trend as Yt(i) Representing the difference between the original sequence and the fit value, i.e.
Ys(i)=Y(i)-Ps(i) Wherein, Ps(i) Fitting a function for a second order
Calculating the fluctuating root mean square of the cumulative detrended time series
Figure BDA0001600795030000154
In general, F (n) increases as n increases. log F (n) to lognThe slope of (a) determines the scale index (self-affine parameter) α, which is a Hurst index. When the curves of the log-log graph are straight lines, the self-similarity can be expressed by F (n) · nα
Eighthly, sequencing and adjusting the sequence with the fluctuation coefficient alpha less than or equal to 0.5;
ninthly, repeating the above processes until a final sequence is obtained in a cycle
Histogram clustering
The traditional barrel clustering only measures the error after the barrel clustering, and the influence of an equilibrium outlier on the privacy of the traditional barrel clustering is avoided. The clustering algorithm adopted by the method firstly clusters the clustering functions of all buckets of initial clustering, adds similar preference punishment constraint and outlier influence balance constraint in the design of the clustering functions, balances the influences of similar data and outlier data on clustering, and secondly clusters the outliers existing after clustering based on the outlier similarity, thereby achieving better privacy protection effect.
1) Histogram bucket clustering
In order to balance the influence of outlier data in the clustering process, an outlier function is added into a fuzzy clustering target function, the larger the target function influenced by outliers is corrected, and the oversize function caused by singular values in the target function is punished (similar preference punishment constraint of the dispersion is violated; in order to reduce the special condition that the same data are all clustered together, a similar punishment factor is added, the more similar data in the set is, the larger the similar data is)
The selection of the objective function influences the clustering effect. Adding similarity punishment constraint and outlier weighting constraint in the clustering objective function is helpful for balancing the clustering results of outliers and similar points, so as to obtain better clustering effect. The objective function of the current clustering consists of three parts, namely an error function, similar constraint and outlier contribution balance
Data division: h is the original histogram bucket count, CiFor partitioning of the original data set, i.e. C1={H1,H2…Hi},C2={Hi+1…Hi+n},…,Cj={Hi+n+1… } in which H isiBelongs to the group H.
Outlier contribution equalization constraints
In order to make the difference large, which may result in too many outliers and privacy leakage, an outlier contribution balancing constraint is proposed to balance the relationship between outliers and reconstruction errors.
Firstly, because the data influence of the neighborhood is a key measurement factor of the outlier, the influence of the clustered data on the barrel merged data set is judged according to the relation with the neighborhood data
Histogram neighborhood: histogram data neighbors are histograms with a pre-post neighbor relationship after sorting, denoted as S (H)i,Hj)={Hi:|Hi-Hj< ε, where HiE, H histogram neighborhood set: all histogram sets with front and back neighbor relations and all histogram sets satisfying histogram neighborhood relations
N(Hj)={Hj|S(Hj,Hi)=true,Hj∈H\Hi},S(Hj,Hi) And representing the neighborhood relationship of the histograms Hi and Hj, wherein in order to reduce the expense of bucket combination, the histogram neighborhood set is mainly measured according to the difference value of the bucket counts.
Histogram weighted distance: hi∈H,XHiIs the bucket count of the histogram, wijIs the histogram outlier contribution, and 0 < w < 1, then the histogram bucket HiAnd HjAre weighted by a distance of
Figure BDA0001600795030000171
Wherein w ═ w-x'
X' represents the mean of the histogram, and the farther away from the mean represents the greater the degree of outlier contribution, from which it can be seen that the greater w evidences the greater the degree of outlier.
Histogram neighborhood distance: the neighborhood distance of a histogram bucket is the average of the weighted distances of the histogram from all the histograms in its neighborhood, i.e.
Figure BDA0001600795030000172
Wherein N (H)j) Representing the number of histograms in the neighborhood set,
in order to eliminate the influence of the poles in the neighborhood on the calculation of the neighborhood distance, an average eliminating method is adopted, the distance of the poles in the neighborhood is eliminated, and then the average distance between the histogram and the neighborhood is calculated.
Figure BDA0001600795030000173
Histogram neighborhood outlier coefficient: the neighborhood distance of the histogram is compared with the neighbors to obtain the space of the histogram in the neighborhoodDegree of deviation in between, i.e. histogram to partition C to be aggregatediLocal outlier coefficients.
Figure BDA0001600795030000181
Then the outlier balance constraint is:
Figure BDA0001600795030000182
wherein,
Figure BDA0001600795030000183
is the ratio of the number of histograms in the neighborhood set to the total number of histograms
Similar preference penalty constraints
When there are a large number of similar or identical bucket count values in a data set, differences at any one point can cause the data to be non-clustered, resulting in the existence of a large number of outliers. The influence of the concept of discrete entropy on similar buckets in a cluster is adopted, so that the negative influence of similar data buckets on outliers is reduced, and the method specifically comprises the following steps:
when dataset partition Ci ═ H1,H2.....HnIn which H isi+1....HjThe dispersion of the data is smaller than a certain value, and Hi-Hj < Lo is close or similar. Count (Hi | Hi-Hj < Lou) > Lou 1 time, the risk of data outlier will increase significantly. According to the conventional adaptive clustering function, H is causedi+1Are outliers. So to reduce this, a penalty constraint is imposed on such a situation using dispersion, and the ratio between the data count values can effectively indicate the dispersion in the data set:
Figure BDA0001600795030000184
wherein i is greater than j Xi,jBelong to Ci
The information entropy can effectively indicate the discrete degree of the data, so the discrete degree of the clusters is as follows:
Figure BDA0001600795030000185
wherein p isij=P(x)log P(x)
The larger the ratio is, the lower the similarity is, because the clustering objective function is required to be the minimum value, so the similar preference punishment constraint can only play a punishment role in the positive
Clustering objective function design
Through the analysis, in order to reduce the appearance of outliers and balance the privacy and availability of data distribution, the objective function is designed as follows:
Figure BDA0001600795030000191
among them, the adaptive weight coefficient.
The objective function should take into account that the set of clusters formed should not only be the smallest in-class distance, but also the ability to generate outliers for forming the set. And measured by outlier contribution
Histogram outlier data micro-clustering
When outlier data still exists after outlier equalization is added, privacy protection by only adding noise is far from enough. Herein, the histogram is micro-clustered according to outlier data similarity CiAnd performing data distribution after adding noise to the clusters formed by clustering.
Firstly, in order to measure the difference between different outlier data sets, an outlier partition similarity OPS is introduced, and if Ck ∈ X, K is 1, 2, K < n, the following conditions are satisfied: 1) ck≠Θ;2)Ck1∩Ck2=O(k1≠k2);3)C1∪C2∪C3...∪CkH, then { C1,C2...CkConstitute a partition of X. Cd={XOd,X-XOdDenotes the XO containing outlier data setdA data division of (1), like Cs={XOs,X-XOSDenotes another outlier partition, its similarity OPS can be expressed as:
Figure BDA0001600795030000192
wherein,
Figure BDA0001600795030000193
fsupis CkDegree of support, fconIs CkConfidence of (f)incIs CkThe degree of inclusion of (a) is,
cis=card(XOd∩XOs) And card represents the potential of the set, outlier similarity. Degree of similarity of G, ops (C)s,Cd) Greater values of (A) indicate outliers of the XOdWith XOsThe overall trend is more consistent. The support degree, the inclusion degree and the confidence degree show the approaching degree from different angles, the larger the support degree is, the XO and the XO are showndOverall more similarly, the inclusion degree represents the XOSCorrectly reflect XOdAnd confidence level represents the XOsDegree of correctness of itself, obviously, 0. ltoreq. ops (C)d,Cs) 1 or less, and ops (C)d,Cs) If and only if XO is 1d=XOs
Clustering algorithm of histogram similarity constraint outlier equalization:
according to the sorted histograms, greedy partitioning from left to right without giving a specific number of packets is a basic idea of clustering. The only consideration in the clustering process is the clustering objective function. The clustering process is the process of selecting the minimum objective function. It is calculated in three cases:
7) when H is merged with the current cluster, the objective function is
Figure BDA0001600795030000201
8) When H is not merged with the current cluster but with the next cluster, the objective function is
Figure BDA0001600795030000202
9) When H is clustered separately to form outlier data, err (C) is formed because H is clustered separately and there is no reconstruction erroriU H) is equal to 0, so the objective function is
Figure BDA0001600795030000203
Clustering mainly selects whether H is clustered or not according to the size of the current objective function
The algorithm is as follows: clustering algorithm for equalizing histogram constraint
Figure BDA0001600795030000204
Figure BDA0001600795030000211
And finally, adding the combined clusters into Laplace noise to realize final differential privacy.

Claims (1)

1. A histogram data publishing method for trend-free analysis differential privacy protection is characterized by comprising the following steps:
firstly, trend histogram removing processing:
firstly, calculating a detrending correlation coefficient to obtain the overall order degree condition of a histogram sequence; secondly, dividing the disordered histogram sequence into a plurality of subsequences, performing ordered adjustment, judging whether the sequence conforms to the ascending or descending order of the whole sequence or not by comparing the trend-free correlation coefficient with the minimum ordered subsequence, and performing ordered adjustment on the unsatisfied subsequences to obtain an integrally ordered histogram sequence;
secondly, trend analysis is carried out on the histogram sequence:
detrended cross-correlation analysis (DCCA) is an effective model for measuring cross-correlation of non-stationary time series by first computing the "contours" of two time series; secondly, dividing the n histograms into [ n/S ] irrelevant intervals with the length of S, wherein the dividing is characterized in that the jumping property of data in adjacent intervals is small; fitting and eliminating the local trend by using a quadratic minimum fitting function, and calculating a local de-trend covariance function; obtaining fluctuation functions corresponding to different scales, finally, measuring whether the characterization sequences have cross correlation relations, namely correlation, inverse and irrelevant relations according to fluctuation function data, using the idea of trend removing analysis for reference, and judging the relation between the histogram sequences by adopting a trend removing analysis method; it can be known that the order degree of the sequence can be judged according to trend-removing analysis of the overall dispersion of the sequence, and the specific process is as follows: firstly, regarding histogram counting as an ordered sequence, counting as integral detrending correlation analysis in the sequence, and defining two sequences based on time as a front sequence and a rear sequence in a bucket counting sequence; judging the relation between the sequences according to the fluctuation coefficient alpha, and when the alpha is 0.5, the sequences are irrelevant and are an independent random process, namely the current state does not influence the future state; when alpha is less than 0.5, the fact that if the time sequence is in inverse correlation, namely the trends of a certain time period and the next time period are opposite, overall sequence dispersion analysis can be obtained by comparing the scale indexes of the volatility function of the subsequence, namely whether the sequence is ordered or not is judged, when the total volatility index function of the data in the sequence is greater than a threshold value, the histogram sequence dispersion degree is considered to be too large and needs to be adjusted, namely the sequence does not meet the ordering constraint, when the volatility function scale index is less than 0.5 in the trend removing analysis, the inverse correlation can be inferred, the histogram de-trend correlation threshold value theta is theta less than 0.5Ns, wherein Ns is N/L, N is the total data length, and L is the minimum ordered sequence length is shown in the specification;
thirdly, orderly adjusting the trend of the histogram:
obtaining the integral ordering degree of the histogram sequence according to trend-removing analysis of integral data, adjusting the histogram sequence which does not meet ordering constraint, and defining a minimum ordered sequence, namely, an initial most ordered sequence, namely, an ordered sequence with the shortest length in the counting of an original barrel, an ascending sequence or a descending sequence;
dividing the histogram sequence into a plurality of subsequences according to the minimum ordered sequence, traversing all subsequences in the histogram sequence and comparing the subsequences with the minimum ordered subsequence, and adjusting the traversed subsequences when an anti-correlation relationship exists between the two sequences, wherein a specific algorithm is as follows:
defining the similarity between two subsequences, which is measured by the product of the Euclidean distance of elements in the sequence and the difference value of the length of the sequence, as follows:
Figure FDA0002958556020000021
Wd=L(Ci)-L(Cj),
Ciand CjRepresents any two subsequences in the histogram sequence, hi,hjRespectively represent C in clusteriAnd CjN is the number of buckets in the subsequence, L (C)i),L(Cj) Represents subsequence Ci,CjLength, i.e. L (C)i),L(Cj) Number of histograms included in a sub-sequence;
the similarity of the terms of the subsequence Ops (Ci, Cj) is Wd*dis(Ci,Cj),
Dispersion decreasing sampling sorting algorithm based on trend analysis
Inputting: the original histogram of the histogram H is,
and (3) outputting: approximating an ordered histogram sequence H',
the method comprises the following steps:
1) finding a minimum ordered sequence with the length of L from the original data;
2) dividing the original sequence into Ns (N) according to L2L unrelated equal-length sequences;
3) in order to prevent the loss of the terminal information, the same operation is carried out on the reverse order of the sequence;
4) sub-sequence extraction is carried out by adopting a differential privacy method, and is specifically proportional to
Figure FDA0002958556020000031
Probability extracting corresponding subsequences:
5) calculating each histogram sequence contour
Figure FDA0002958556020000032
Where H (j) is the jth data in the sequence,
Figure FDA0002958556020000033
is the average of the sequence;
6) in each interval v, fitting data by using a least square method, filtering out a time sequence of contour values in the interval and recording the time sequence as Y (i), and expressing the difference between an original sequence and the fitting value, namely
error(i)=Y(i)-Ps(i) Wherein, Ps(i) Is a quadratic fitting function;
7) calculating the fluctuation root mean square of the cumulative detrending time sequence;
Figure FDA0002958556020000034
8) in general, F (n) will increase as n becomes larger, log2F (n) log2The slope of n determines the scale index (self-affine parameter) beta, which is a Hurst index, and if the curve of the log-log graph is a straight line, the self-similarity can be expressed by the following formula F (n) · nβ
9) Sequencing and adjusting the sequence with the fluctuation coefficient alpha less than or equal to 0.5;
10) repeating the steps 1) to 9) until a final sequence is obtained in a circulating mode;
four, histogram clustering
Firstly, clustering functions of all buckets subjected to initial clustering by adopting a clustering algorithm, adding similar preference punishment constraint and outlier influence balance constraint into the clustering functions, balancing the influence of similar data and outlier data on clustering, and carrying out secondary clustering on clustered outliers by calculating the outlier similarity of the clustered outliers;
1) histogram bucket clustering
In order to balance the influence of outlier data in the clustering process, an outlier function is added into a fuzzy clustering target function, the larger size of the target function influenced by an outlier is corrected, the function oversize caused by a singular value in the target function is punished, and the similar preference punishment constraint of the dispersion is violated: in order to reduce the special condition that the same data are all gathered together, a similar penalty factor is added, the more similar data contained in the set is, the larger the more the similar data is, the more the selection of the target function influences the clustering effect, the similarity penalty constraint and the outlier weighting constraint are added into the clustering target function, which is helpful for balancing the clustering results of the outliers and the similar points, so as to obtain the better clustering effect, the target function of the current clustering consists of three parts, and the error function, the similar constraint and the outlier contribution balance data are divided: h is the original histogram bucket count, CiFor partitioning of the original data set, i.e. C1={H1,H2…Hi},C2={Hi+1…Hi+n},…,Cj={Hi+n+1… } in which H isiBelongs to the group H;
outlier contribution equalization constraints
Firstly, because the influence of adjacent data is a key measurement factor of outlier, the influence histogram neighbor data of the bucket merging data set after data clustering is judged according to the relation with the adjacent data: histogram data neighbor data refers to a histogram having a front-to-back adjacent relationship after sorting, and this relationship is represented as S (H)i,Hj)={Hi:∣Hi-Hj∣<ε }, wherein Hi∈H;
Histogram neighbor set: all histogram sets with front-back adjacent relation and all histogram sets satisfying the adjacent relation
N(Hj)={Hj|S(Hj,Hi)=true,Hj∈H\Hi},S(Hj,Hi) Represents a histogram HiAnd HjNeighbor relationships exist, and to reduce the overhead of bucket merging, the histogram neighbor set is mainlyMeasured in terms of the difference in bucket counts;
histogram weighted distance: hi∈H,XHiIs the bucket count of the histogram, wijIs the histogram outlier contribution, and 0<wij<1, then histogram bucket HiAnd HjAre weighted by a distance of
Figure FDA0002958556020000051
Wherein wij=wijX ', x' represents the mean value in the histogram cluster, the further away from the mean represents the greater the degree of outlier contribution, and w is shown by the above equationijThe greater the evidence of greater outliers;
histogram neighborhood distance: the neighborhood distance of a histogram bucket is the average of the weighted distances of the histogram from all the histograms in its neighborhood, i.e.
Figure FDA0002958556020000052
Wherein N (H)j) Representing the number of histograms in the neighborhood set,
in order to eliminate the influence of the poles in the neighborhood on the calculation of the neighborhood distance, an average eliminating method is adopted, the distance of the poles in the neighborhood is eliminated, and then the average distance between the histogram and the neighborhood is calculated.
Figure FDA0002958556020000061
Histogram neighborhood outlier coefficient: the neighborhood distance of the histogram is compared with its neighbors to derive the degree of deviation of the histogram in the neighborhood space, i.e. the local outlier coefficients of the histogram for the partition Ci to be aggregated,
Figure FDA0002958556020000062
then the outlier balance constraint is:
Figure FDA0002958556020000063
wherein,
Figure FDA0002958556020000064
the ratio of the number of the histograms in the neighborhood set to the total number of the histograms;
similar preference penalty constraints
When a large number of similar or identical bucket count values exist in a data set, the difference of any point can cause that data cannot be clustered, and a large number of outliers exist, the influence of the concept of discrete entropy on similar buckets in a cluster is adopted, so that the negative influence of similar data buckets on outliers is reduced, and the method specifically comprises the following steps:
when data set is divided into Ci={H1,H2.....HnIn which H isi+1....HjThe dispersion of data between is less than a certain value, Hi-Hj<The locus is similar or similar to the locus of qi and has a data dispersion Count (H)i|Hi-Hj<ó)>ó1Shi, Lou Qi1Since the Log represents a smaller value, the risk of outliers generated by the data will be significantly increased, and H will be caused according to the conventional adaptive clustering functioni+1To reduce such instances, which are punished using dispersion to constrain, the ratio between data count values may effectively indicate the dispersion in the data set:
Figure FDA0002958556020000071
wherein i>j XHi,XHjBelong to Ci
The information entropy can effectively indicate the discrete degree of the data, so the discrete degree of the clusters is as follows:
Figure FDA0002958556020000072
wherein P isij=P(x)log P(x)
The larger the ratio is, the lower the similarity is, and because the clustering objective function is required to obtain the minimum value, the similar preference punishment constraint can only play a punishment role in the positive direction;
clustering objective function design
Through the analysis, in order to reduce the appearance of outliers and balance the privacy and availability of data distribution, the objective function is designed as follows:
Figure FDA0002958556020000073
where err represents the clustering error, lap represents the Laplace noise, λ1,λ2The self-adaptive weight coefficient is adopted, the target function considers that the formed cluster set has the minimum intra-class distance and the minimum capability of generating outliers for the formed set, and the outlier contribution rate is used for measurement;
2) histogram outlier data micro-clustering
When outlier data still existing after outlier equalization is added, performing micro-clustering C on the histogram according to the similarity of the outlier dataiAnd adding noise to the cluster formed by clustering and then issuing data, firstly, in order to measure the difference between different outlier data sets, introducing outlier division similarity ops', if Ck∈X,k=1,2...,K,K<n, satisfies: 1) ck≠Θ;2)Ck1∩Ck2=Ο(k1≠k2);3)C1∪C2∪C3...∪CkH, then { C1,C2...CkForm a partition of X, Cd={XOd,X-XOdDenotes the XO containing outlier data setdA data division of (1), like Cs={XOs,X-XOSDenotes another outlier partition, its similarity ops' can be expressed as:
Figure FDA0002958556020000081
fsupis CkDegree of support, fconIn two setsConfidence, i.e. degree of data correctness, fincRepresenting the degree of inclusion of two sets, measured by the same amount of included data between the sets, cis-card (XO)d∩XOs) Card denotes the potential of the set, the degree of similarity of the outliers, the degree of similarity of G, ops' (C)S,Cd) Greater values of (A) indicate outliers of the XOdWith XOsThe general trends are more consistent, the support degree, the inclusion degree and the confidence degree show the approach degree from different angles, and the larger the support degree is, the X0 is shownsAnd X0dOverall more similarly, the inclusion degree represents the XOSCorrectly reflecting X0dAnd confidence level represents the XOsDegree of accuracy of itself, obviously, 0. ltoreq. ops' (C)d,Cs) 1 ≦ and ops' (C)d,Cs) 1 if and only if X0d=XOs
Fifthly, clustering algorithm of similarity constraint outlier balance of the histogram:
according to the ordered histograms, greedy division from left to right is a basic idea of clustering without giving specific numbers to groups, a clustering objective function is considered only in the clustering process, the clustering process is a process of selecting a minimum objective function, and calculation is performed according to three conditions:
1) when H is merged with the current cluster, the objective function is
Figure FDA0002958556020000082
2) When H is not merged with the current cluster but with the next cluster, the objective function is
Figure FDA0002958556020000083
3) When H is clustered separately to form outlier data, err (C) is formed because H is clustered separately and there is no reconstruction erroriU H) is equal to 0, so the objective function is
Figure FDA0002958556020000091
Counting representative buckets, judging whether the current histogram is clustered or not by clustering according to the size of the current target function, wherein the specific algorithm is as follows:
the algorithm is as follows: clustering algorithm for equalizing histogram constraint
Figure FDA0002958556020000092
Figure FDA0002958556020000101
And finally, adding the combined clusters into Laplace noise to realize final differential privacy.
CN201810228544.1A 2018-03-19 2018-03-19 Histogram data publishing method for trend analysis differential privacy protection Active CN108446568B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810228544.1A CN108446568B (en) 2018-03-19 2018-03-19 Histogram data publishing method for trend analysis differential privacy protection

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810228544.1A CN108446568B (en) 2018-03-19 2018-03-19 Histogram data publishing method for trend analysis differential privacy protection

Publications (2)

Publication Number Publication Date
CN108446568A CN108446568A (en) 2018-08-24
CN108446568B true CN108446568B (en) 2021-04-13

Family

ID=63195920

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810228544.1A Active CN108446568B (en) 2018-03-19 2018-03-19 Histogram data publishing method for trend analysis differential privacy protection

Country Status (1)

Country Link
CN (1) CN108446568B (en)

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109492047A (en) * 2018-11-22 2019-03-19 河南财经政法大学 A kind of dissemination method of the accurate histogram based on difference privacy
CN109886030B (en) * 2019-01-29 2021-06-11 南京邮电大学 Privacy minimum exposure method facing service combination
CN110443063B (en) * 2019-06-26 2023-03-28 电子科技大学 Adaptive privacy-protecting federal deep learning method
CN110674830B (en) * 2019-12-06 2020-05-19 数字广东网络建设有限公司 Image privacy identification method and device, computer equipment and storage medium
CN111104922B (en) * 2019-12-30 2022-03-08 深圳纹通科技有限公司 Feature matching algorithm based on ordered sampling
CN111259442B (en) * 2020-01-15 2022-04-29 广西师范大学 Differential privacy protection method for decision tree under MapReduce framework
CN112035880B (en) * 2020-09-10 2024-02-09 辽宁工业大学 Track privacy protection service recommendation method based on preference perception
CN112307078B (en) * 2020-09-29 2022-04-15 安徽工业大学 Data stream differential privacy histogram publishing method based on sliding window
CN112560984B (en) * 2020-12-25 2022-04-05 广西师范大学 Differential privacy protection method for self-adaptive K-Nets clustering
CN112667712B (en) * 2020-12-31 2023-03-17 安徽工业大学 Grouped accurate histogram data publishing method based on differential privacy
CN115811726B (en) * 2023-01-20 2023-04-25 武汉大学 Privacy protection method and system for dynamic release of mobile terminal position data
CN116090916B (en) * 2023-04-10 2023-06-16 淄博海草软件服务有限公司 Early warning system for enterprise internal purchase fund accounting

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105138661A (en) * 2015-09-02 2015-12-09 西北大学 Hadoop-based k-means clustering analysis system and method of network security log
CN107766740A (en) * 2017-10-20 2018-03-06 辽宁工业大学 A kind of data publication method based on difference secret protection under Spark frameworks

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104809408B (en) * 2015-05-08 2017-11-28 中国科学技术大学 A kind of histogram dissemination method based on difference privacy
US10628608B2 (en) * 2016-06-29 2020-04-21 Sap Se Anonymization techniques to protect data
CN106991335B (en) * 2017-02-20 2020-02-07 美达科林(南京)医药科技有限公司 Data publishing method based on differential privacy protection

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105138661A (en) * 2015-09-02 2015-12-09 西北大学 Hadoop-based k-means clustering analysis system and method of network security log
CN107766740A (en) * 2017-10-20 2018-03-06 辽宁工业大学 A kind of data publication method based on difference secret protection under Spark frameworks

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Collaborative Differentially Private Outlier Detection for Categorical Data;Hafiz Asif 等;《2016 IEEE 2nd International Conference on Collaboration and Internet Computing (CIC)》;20170109;全文 *
一种基于差分隐私的数据发布方法;马跃雷 等;《北京信息科技大学学报》;20160630;第31卷(第3期);全文 *

Also Published As

Publication number Publication date
CN108446568A (en) 2018-08-24

Similar Documents

Publication Publication Date Title
CN108446568B (en) Histogram data publishing method for trend analysis differential privacy protection
Ramsay et al. Applied functional data analysis: methods and case studies
Lafferty et al. Rodeo: Sparse, greedy nonparametric regression
CN109657891B (en) Load characteristic analysis method based on self-adaptive k-means + + algorithm
CN111737744B (en) Data publishing method based on differential privacy
CN111291822B (en) Equipment running state judging method based on fuzzy clustering optimal k value selection algorithm
CN108509843A (en) A kind of face identification method of the Huber constraint sparse codings based on weighting
CN110119540B (en) Multi-output gradient lifting tree modeling method for survival risk analysis
Sun et al. Nearest neighbors-based adaptive density peaks clustering with optimized allocation strategy
CN110909792A (en) Clustering analysis method based on improved K-means algorithm and new clustering effectiveness index
CN111625578A (en) Feature extraction method suitable for time sequence data in cultural science and technology fusion field
Chu Age-distribution dynamics and aging indexes
CN116167078A (en) Differential privacy synthetic data publishing method based on maximum weight matching
Rattray A model-based distance for clustering
CN114692565A (en) Method, system and equipment for detecting quality of multi-characteristic-parameter high-speed board card in design stage
CN114549062A (en) Consumer preference prediction method, system, electronic equipment and storage product
CN111832942A (en) Criminal transformation quality assessment system based on machine learning
Wan et al. Graph clustering: block-models and model free results
Ichinose et al. Online kernel-based quantile regression using Huberized pinball loss
Liu et al. Approximate conditional gradient descent on multi-class classification
KR102198459B1 (en) Clustering method and system for financial time series with co-movement relationship
CN111488903A (en) Decision tree feature selection method based on feature weight
Moeinzadeh et al. Combination of harmony search and linear discriminate analysis to improve classification
CN113252586B (en) Hyperspectral image reconstruction method, terminal equipment and computer readable storage medium
Tian et al. Semiparametric quantile modelling of hierarchical data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant