CN108446568B

CN108446568B - Histogram data publishing method for trend analysis differential privacy protection

Info

Publication number: CN108446568B
Application number: CN201810228544.1A
Authority: CN
Inventors: 高岭; 杨旭东; 罗昭; 毛勇; 孙骞; 王帆
Original assignee: Northwestern University
Current assignee: Northwestern University
Priority date: 2018-03-19
Filing date: 2018-03-19
Publication date: 2021-04-13
Anticipated expiration: 2038-03-19
Also published as: CN108446568A

Abstract

A method for judging signal sequence trend is introduced into judgment of histogram abnormal distribution, a large number of outliers cause large fluctuation of data distribution and stability reduction, and the histogram barrel counting distribution condition is regarded as continuous digital signals from the viewpoint to carry out data outliers. Meanwhile, aiming at a clustering target function which can cause a large number of outliers in the traditional method, an outlier balancing constraint and a similar punishment constraint are added to balance the influence of similar bucket and outlier bucket data on clustering, so that the occurrence of outliers is reduced; and carrying out outlier data micro-clustering on the outlier data based on the outlier similarity.

Description

Histogram data publishing method for trend analysis differential privacy protection

Technical Field

The invention belongs to the technical field of computer information security, and particularly relates to a histogram data publishing method for trend analysis differential privacy protection.

Background

The histogram-based data publishing method is the most common data publishing mode at present, and because the data distribution form is vividly displayed, the statistical result can provide a theoretical basis for realizing counting query. The histogram divides the data table into a plurality of disjoint subsets, forming a plurality of independent buckets, mainly according to the fact that one or more attributes have different attribute values, and respectively uses statistical values to identify the dividing meaning of each subset (or bucket), wherein the width of each bucket represents a query range, thereby realizing the range counting query.

In the distribution process of the histogram, in order to meet the differential privacy and improve the usability of data distribution, a re-partitioning-based idea is generally adopted, a clustering idea is adopted to combine adjacent similar buckets, reconstruction is performed, and laplacian noise is added to perform differential privacy protection. However, as the number of outliers present in the raw histogram bucket count increases, greater global sensitivity results; and causes a reduction in the merging probability between adjacent buckets, thereby reducing the effect of privacy protection, and histogram reconstruction has little privacy protection effect on such points. To solve this problem, the analysis and processing of outliers is necessary.

Conventional outliers are defined as small or large values in the data set, i.e., the absolute value of the difference between the outlier and other values in the data set is much larger than the absolute value between the normal values. However, this method is time-consuming on one hand, and the accuracy of the conventional outlier determination method is not high enough on the other hand. The method for judging the outliers efficiently and accurately has very important significance for the data clustering effect of the histogram. Meanwhile, no learner provides a good processing method for processing the judged outliers, and how to solve the outlier clustering is also a key problem of the differential privacy utility of the histogram data distribution.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention aims to provide a histogram data distribution method for trend analysis difference privacy protection, which introduces a method for judging signal sequence trend into the judgment of histogram abnormal distribution, so that a large number of outliers can cause the fluctuation of data distribution to be relatively large and the stationarity to be reduced, and the histogram bucket counting distribution condition is regarded as continuous digital signals from the viewpoint to carry out data outliers. Meanwhile, aiming at a clustering target function which can cause a large number of outliers in the traditional method, an outlier balancing constraint and a similar punishment constraint are added to balance the influence of similar bucket and outlier bucket data on clustering, so that the occurrence of outliers is reduced; and carrying out outlier data micro-clustering on the outlier data based on the outlier similarity.

Definition of differential privacy: a histogram issuing method A is given, and if the result H output by the method A on H and H' randomly meets the following inequality, A meets the difference privacy

Pr[A(H)＝H]≤exp(ε)×Pr[A(H′)＝H]

Global sensitivity: let f be a query, the global sensitivity of f be

Implementation of differential privacy: laplace mechanism

Wherein and the exponential mechanism

Firstly, ordering the histogram through trend-removing ordering processing, clustering the histogram by utilizing a balanced outlier of the histogram and a clustering algorithm of similar constraint, and then adding Laplace noise to finish differential privacy protection.

In order to achieve the purpose, the invention adopts the technical scheme that:

a histogram data publishing method for trend-removing analysis differential privacy protection comprises the following steps:

1. and (3) trend-removing histogram processing:

firstly, obtaining the integral ordered degree of the histogram barrel counting through a trend-removing correlation coefficient to judge; secondly, dividing the disordered histogram sequence into a plurality of subsequences, performing ordered adjustment, judging whether the sequence conforms to the ascending or descending order of the whole sequence or not by comparing the trend-free correlation coefficient with the minimum ordered subsequence, and performing ordered adjustment on the unsatisfied subsequences to obtain an integrally ordered histogram sequence;

2. histogram sequence detrending analysis:

detrended cross-correlation analysis (DCCA) is an effective model for measuring cross-correlation of non-stationary time series by first computing the "contours" of two time series; secondly, dividing the data into [ N/S ] irrelevant intervals with the length S, and being characterized in that the data jumping property in the adjacent intervals is small; fitting and eliminating the local trend by using a quadratic minimum fitting function, and calculating a local de-trend covariance function; repeating the above processes to obtain fluctuation functions corresponding to different scales s, and finally, measuring whether the characterization sequence has cross correlation relations, namely three relations of correlation, opposition and independence according to fluctuation function data, and using the idea of trend removal analysis for reference, wherein the histogram sequence is orderly judged by adopting the trend removal analysis;

from the above analysis, it can be known that the detrending analysis according to the overall dispersion of the sequence can determine the ordered degree of the sequence, first, the histogram count is regarded as an ordered sequence, the count is the overall detrending correlation analysis in the sequence, two sequences based on time are defined as front and back sequences in the bucket count sequence, according to the fluctuation coefficient determination, that is, α is 0.5, it indicates that the sequences are not correlated, and is an independent random process, that is, the current state does not affect the future state, α is less than 0.5, it indicates that if the time sequence is anti-correlated, that is, the trends in a certain time period and its next time period are opposite, the overall sequence dispersion analysis can be obtained by comparing the scale index of the fluctuation function of the subsequence, that is to determine whether the sequence is ordered, when the overall fluctuation index function of the data in the sequence is greater than the threshold value, it is considered that the histogram sequence dispersion needs to be adjusted excessively large, that is, the sequence does not satisfy the ordered constraint, and the histogram de-trend correlation threshold value alpha is obtained by reasoning according to the inverse correlation when the scale index of the fluctuation function is less than 0.5 in de-trend analysis

Wherein Ns is N/L N for the total data length and L is the minimum ordered sequence length;

3. histogram detrending ordered adjustment:

and obtaining the integral ordering degree of the histogram sequence according to trend-removing analysis of the integral data, adjusting the histogram sequence which does not meet the ordering constraint, and defining a minimum ordered sequence, namely, an initial most ordered sequence, namely, an ordered, ascending or descending sequence with the shortest length in the counting of the original buckets.

By dividing the histogram sequence into a plurality of sub-sequences according to the minimum ordered sequence, the sub-sequences in all the histogram sequences are facilitated and compared with the minimum ordered sub-sequence, and the traversed sub-sequences are adjusted when an anti-correlation relationship exists between the two sequences, and the specific algorithm is as follows:

defining sequence similarity, and measuring the similarity between two subsequences by adopting the product of Euclidean distance of elements in the sequence and the difference value of the sequence length as follows:

W_d＝L(C_i)-L(C_j)，

therefore, the sequence similarity is Ops (Ci, Cj) ═ W_d*dis(C_i，C_j)，

And (3) based on the dispersion descending sampling sorting of trend analysis, inputting: original histogram H, output: the ordered histogram sequence H' is approximated by the following steps:

1) finding a minimum ordered sequence with the length of L from the original data;

2) dividing the original sequence into Ns (N/L) irrelevant equal-length sequences according to L;

3) in order to prevent the loss of the terminal information, the same operation is carried out on the reverse order of the sequence;

4) in proportion to

Probability extraction of the subsequences of the responses:

5) calculating each histogram sequence contour

Where H (j) is the jth data in the sequence,

is the average of the sequence;

6) in each interval v, fitting data by using a least square method, and marking a time sequence after filtering the trend as Y_t(i) Representing the difference between the original sequence and the fit value, i.e. Y_s(i)＝Y(i)-P_s(i) Wherein, P_s(i) Is a quadratic fitting function;

7) calculating a root mean square of fluctuation of the cumulative detrended time series

In general, F (n) increases as n becomes larger. log F (n) to log_nSlope of 8)

The scale-defining index (self-affine parameter) alpha is a Hurst index, and if the curve of the double-logarithm graph is a straight line, the self-similarity can be represented by the following formula F (n) · n^α；

9) Sequencing and adjusting the sequence with the fluctuation coefficient alpha less than or equal to 0.5;

10) repeating the above processes until the final sequence is obtained in a circulating manner;

4. histogram clustering

Clustering the clustering functions of all buckets subjected to initial clustering by adopting a clustering algorithm, adding similar preference punishment constraint and outlier influence balance constraint in the clustering functions, balancing the influence of similar data and outlier data on clustering, and performing secondary clustering on outliers existing after clustering based on outlier similarity;

1) histogram bucket clustering

In order to balance the influence of outlier data in the clustering process, an outlier function is added into a fuzzy clustering target function, the larger size of the target function influenced by an outlier is corrected, the function oversize caused by a singular value in the target function is punished, and the similar preference punishment constraint of the dispersion is violated: in order to reduce the special condition that the same data are all gathered together, a similar penalty factor is added, the clustering effect is influenced by the selection of an objective function when the number of similar data in the set is larger, and the addition of a similarity penalty constraint and an outlier weighting constraint in a clustering objective function is helpful for balancing the clustering result of outliers and similar pointsTherefore, a better clustering effect is obtained, the objective function of the clustering is composed of three parts, and the error function, the similarity constraint and the outlier contribution balance data are divided: h is the original histogram bucket count, C_iFor partitioning of the original data set, i.e. C₁＝{H₁，H₂…H_i}，C₂＝{H_i+1…H_i+n}，…，C_j＝{H_i+n+1… } in which H is_iBelongs to the group H;

outlier contribution equalization constraints

Firstly, because the data influence of the neighborhood is a key measurement factor of outlier, the influence histogram neighborhood of the bucket merging data set after data clustering is judged according to the relation with the neighborhood data: histogram data neighbors are histograms with a pre-post neighbor relationship after sorting, denoted as S (H)_i，H_j)＝{H_i：|H_i-H_j< ε, where H_i∈H；

Histogram neighborhood set: all histogram sets with front and back neighbor relations and all histogram sets satisfying histogram neighborhood relations

N(H_j)＝{H_j|S(H_j，H_i)＝true，H_j∈H\H_i}，S(H_j，H_i) Represents a histogram H_iAnd H_jIn order to reduce the cost of bucket combination, a histogram neighborhood set is mainly measured according to the difference value of bucket counts;

histogram weighted distance: h_i∈H，X_HiIs the bucket count of the histogram, w_ijIs the histogram outlier contribution, and 0 < w < 1, then the histogram bucket H_iAnd H_jAre weighted by a distance of

Wherein w is w-X ', X' represents the histogram mean, the farther away from the mean represents the greater the degree of outlier contribution, and as can be seen from the above formula, the greater w evidences the greater the degree of outlier;

histogram neighborhood distance: neighborhood of histogram bucketDistance is the average of the weighted distances of the histogram from all the histograms in its neighborhood, i.e.

Wherein N (H)_j) Representing the number of histograms in the neighborhood set,

in order to eliminate the influence of the poles in the neighborhood on the calculation of the neighborhood distance, an average eliminating method is adopted, the distance of the poles in the neighborhood is eliminated, and then the average distance between the histogram and the neighborhood is calculated.

Histogram neighborhood outlier coefficient: the neighborhood distance of the histogram is compared with its neighbors to derive the degree of deviation of the histogram in the neighborhood space, i.e. the local outlier coefficients of the histogram for the partition Ci to be aggregated,

then the outlier balance constraint is:

wherein,

the ratio of the number of the histograms in the neighborhood set to the total number of the histograms;

similar preference penalty constraints

When a large number of similar or identical bucket count values exist in a data set, the difference of any point can cause that data cannot be clustered, and a large number of outliers exist, the influence of the concept of discrete entropy on similar buckets in a cluster is adopted, so that the negative influence of similar data buckets on outliers is reduced, and the method specifically comprises the following steps:

when dataset partition Ci ═ H₁，H₂....H_nIn which H is_i+1....H_jData betweenThe dispersion is smaller than a certain value, Hi-Hj < Lo is similar or similar, Count (Hi-Hj < Lo) > Lo 1 time causes the risk of data outlier significantly increased, and according to the conventional adaptive clustering function, H is caused_i+1To reduce such instances, which are punished using dispersion to constrain, the ratio between data count values may effectively indicate the dispersion in the data set:

wherein i is greater than j X_i，jBelong to C_i

The information entropy can effectively indicate the discrete degree of the data, so the discrete degree of the clusters is as follows:

wherein p is_ij＝P(x)log P(x)

The larger the ratio is, the lower the similarity is, and because the clustering objective function is required to obtain the minimum value, the similar preference punishment constraint can only play a punishment role in the positive direction;

clustering objective function design

Through the analysis, in order to reduce the appearance of outliers and balance the privacy and availability of data distribution, the objective function is designed as follows:

where, is an adaptive weight coefficient, the objective function should take into account that the set of clusters formed should not only be the smallest within-class distance, but also the ability to generate outliers for forming the set. And measured by the outlier contribution rate;

2) histogram outlier data micro-clustering

When outlier data still existing after outlier equalization is added, performing micro-clustering C on the histogram according to the similarity of the outlier data_iAnd form clustersThe cluster of (2) is added with noise and then data is released, firstly, in order to measure the difference between different outlier data sets, an outlier division similarity OPS is introduced, if Ck belongs to X, K is 1, 2, and K is less than n, the following conditions are met: 1) c_k≠Θ；2)C_k1∩C_k2＝O(k1≠k2)；3)C₁∪C₂∪C₃...∪C_kH, then { C₁，C₂...C_kConstitute a partition of X. C_d＝{XO_d，X-XO_dDenotes the XO containing outlier data set_dA data division of (1), like C_s＝{XO_s，X-XO_SDenotes another outlier partition, its similarity OPS can be expressed as:

wherein,

f_supis C_kDegree of support, f_conIs C_kConfidence of (f)_incIs C_kThe degree of inclusion of (a) is,

cis＝card(XO_d∩XO_s) Car α represents the potential of the set, the degree of similarity of the outliers, the degree of similarity of G, ops (C)_S，C_d) Greater values of (A) indicate outliers of the XO_dWith XO_sThe overall trend is more consistent. The support degree, the inclusion degree and the confidence degree show the approaching degree from different angles, the larger the support degree is, the XO and the XO are shown_dOverall more similarly, the inclusion degree represents the XO_sCorrectly reflect XO_dAnd confidence level represents the XO_sDegree of correctness of itself, obviously, 0. ltoreq. ops (C)_d，C_s) 1 or less, and ops (C)_d，C_s) If and only if XO is 1_d＝XO_s；

5. Clustering algorithm of histogram similarity constraint outlier equalization:

according to the sorted histograms, greedy partitioning from left to right without giving a specific number of packets is a basic idea of clustering. The only consideration in the clustering process is the clustering objective function, which is the process of selecting the minimum objective function. It is calculated in three cases:

4) when H is merged with the current cluster, the objective function is

5) When H is not merged with the current cluster but with the next cluster, the objective function is

6) When H is clustered separately to form outlier data, err (C) is formed because H is clustered separately and there is no reconstruction error_iU H) is equal to 0, so the objective function is

Clustering mainly selects whether H is clustered or not according to the size of the current objective function

The algorithm is as follows: clustering algorithm for equalizing histogram constraint

And finally, adding the combined clusters into Laplace noise to realize final differential privacy.

The invention has the beneficial effects that:

the method for judging the signal sequence trend is introduced into the judgment of abnormal distribution of the histogram, a large number of outliers can cause the problem that the data distribution has high volatility and low stability, and the counting distribution condition of the histogram barrel is regarded as continuous digital signals from the angle to perform data outliers. Meanwhile, aiming at a clustering target function which can cause a large number of outliers in the traditional method, an outlier balancing constraint and a similar punishment constraint are added to balance the influence of similar bucket and outlier bucket data on clustering, so that the occurrence of outliers is reduced; and carrying out outlier data micro-clustering on the outlier data based on the outlier similarity.

Drawings

FIG. 1 is a diagram of a differential privacy protection algorithm architecture;

FIG. 2 is a clustering research framework for outlier balancing and constraints;

FIG. 3 is a differential privacy preserving data distribution algorithm;

Detailed Description

The invention is further described below with reference to the accompanying drawings.

Based on the invention, the method mainly comprises the following implementation steps:

1) detrending histogram ordering process

2) Adaptive histogram clustering for outlier equalization and constraints

3) Add to Add noise to clustered data

4) Histogram after differential privacy protection

Detrending histogram processing

The ordered histogram sequence has very important influence on the clustering effect of the histogram barrel, namely, the similar barrel counting histogram is issued after clustering, which is beneficial to reducing reconstruction errors. The advantages of the trend analysis are mainly that on one hand, the data calculation amount is reduced relative to the discrete analysis of the calculated difference value, and on the other hand, the adjustment time is saved in the ordered adjustment process. The histogram ordering is carried out by taking the idea of trend analysis as a reference. Firstly, obtaining the integral ordered degree of the histogram barrel counting through a trend-removing correlation coefficient to judge; and secondly, dividing the disordered histogram sequence into a plurality of subsequences, performing ordered adjustment, judging whether the sequence conforms to the ascending order or the descending order of the whole sequence or not by comparing the trend-removing correlation coefficient with the minimum ordered subsequence, and performing ordered adjustment on the unsatisfied subsequences. Finally, an overall ordered histogram sequence is obtained.

Histogram series detrending analysis

Detrended cross-correlation analysis (DCCA) is an effective model for measuring non-stationary time series cross-correlation. First, calculating the 'outline' of two time series; secondly, dividing the data into [ N/S ] irrelevant intervals with the length S, and being characterized in that the data jumping property in the adjacent intervals is small; fitting and eliminating the local trend by using a quadratic minimum fitting function, and calculating a local de-trend covariance function; repeating the above processes to obtain the fluctuation functions corresponding to different scales s. And finally, according to the fluctuation function data, judging whether the characterization sequences have cross correlation relations, namely, three relations of correlation, opposite relation and irrelevant relation. By using the idea of trend-removing analysis, the histogram sequence is subjected to orderly judgment by adopting the trend-removing analysis.

From the above analysis, it can be known that the degree of order of the sequence can be judged by detrending analysis of the overall dispersion of the sequence. The histogram counts are first considered as an ordered sequence, with the counts being an overall de-trending correlation analysis in the sequence. Two sequences based on time are defined as a front sequence and a rear sequence in the bucket counting sequence, and are judged according to the fluctuation coefficient, namely alpha is 0.5, the sequences are irrelevant, and the sequences are an independent random process, namely the current state does not influence the future state such as white noise. Alpha <0.5, indicating that if the time series is inversely correlated, the trend is reversed in one time segment and its next. For example, the relationship between 3, 4, 5 and 3, 4, 6 is irrelevant; 4, 3, 6 and 3, 4, 5 are also not relevant. The overall sequence discreteness analysis can be obtained by comparing the scale indexes of the volatility functions of the subsequences, namely judging whether the sequences are ordered or not. When the total fluctuation index function of the data in the sequence is larger than the threshold value, the dispersion degree of the histogram sequence is considered to be overlarge and needs to be adjusted, namely the sequence does not meet the ordering constraint. According to the inverse correlation when the scale index of the fluctuation function is less than 0.5 in the trend removing analysis, the histogram trend removing correlation threshold value alpha is obtained by reasoning

Where Ns is N/L N as the total data length and L is the minimum ordered sequence length

Histogram detrending ordered adjustment

And obtaining the integral ordering degree of the histogram sequence according to the trend-removing analysis of the integral data, and adjusting the histogram sequence which does not meet the ordering constraint.

The least ordered sequence (the initial most ordered sequence) is defined, i.e. the sequence with the shortest length of the orders in the original bucket counts (ascending or descending).

By dividing the histogram sequence into a plurality of sub-sequences according to the least ordered sequence, the sub-sequences in all histogram sequences are facilitated and compared to the least ordered sub-sequence, and the traversed sub-sequences are adjusted when there is an anti-correlation relationship between the two sequences. The specific algorithm is as follows:

W_d＝L(C_i)-L(C_j)

therefore, the sequence similarity is Ops (Ci, Cj) ═ W_d*dis(C_i，C_j)

Algorithm 1: dispersion decreasing sampling sequence based on trend analysis

Inputting: original histogram H

And (3) outputting: approximate ordered histogram sequence H'

Finding out the minimum ordered sequence with length L from the original data

Dividing the original sequence according to L into Ns/L irrelevant equal-length sequences;

in order to prevent the loss of the terminal information, the same operation is carried out on the reverse order of the sequence;

③ in direct proportion to

Sub-sequences of probabilistic extraction responses

Fourthly, calculating the outline of each histogram sequence

Where H (j) is the jth data in the sequence,

is the average of the sequence

Within each interval v, fitting data by using a least square method, and marking a time sequence after filtering the trend as Y_t(i) Representing the difference between the original sequence and the fit value, i.e.

Y_s(i)＝Y(i)-P_s(i) Wherein, P_s(i) Fitting a function for a second order

Calculating the fluctuating root mean square of the cumulative detrended time series

In general, F (n) increases as n increases. log F (n) to log_nThe slope of (a) determines the scale index (self-affine parameter) α, which is a Hurst index. When the curves of the log-log graph are straight lines, the self-similarity can be expressed by F (n) · n^α

Eighthly, sequencing and adjusting the sequence with the fluctuation coefficient alpha less than or equal to 0.5;

ninthly, repeating the above processes until a final sequence is obtained in a cycle

Histogram clustering

The traditional barrel clustering only measures the error after the barrel clustering, and the influence of an equilibrium outlier on the privacy of the traditional barrel clustering is avoided. The clustering algorithm adopted by the method firstly clusters the clustering functions of all buckets of initial clustering, adds similar preference punishment constraint and outlier influence balance constraint in the design of the clustering functions, balances the influences of similar data and outlier data on clustering, and secondly clusters the outliers existing after clustering based on the outlier similarity, thereby achieving better privacy protection effect.

1) Histogram bucket clustering

In order to balance the influence of outlier data in the clustering process, an outlier function is added into a fuzzy clustering target function, the larger the target function influenced by outliers is corrected, and the oversize function caused by singular values in the target function is punished (similar preference punishment constraint of the dispersion is violated; in order to reduce the special condition that the same data are all clustered together, a similar punishment factor is added, the more similar data in the set is, the larger the similar data is)

The selection of the objective function influences the clustering effect. Adding similarity punishment constraint and outlier weighting constraint in the clustering objective function is helpful for balancing the clustering results of outliers and similar points, so as to obtain better clustering effect. The objective function of the current clustering consists of three parts, namely an error function, similar constraint and outlier contribution balance

Data division: h is the original histogram bucket count, C_iFor partitioning of the original data set, i.e. C₁＝{H₁，H₂…H_i}，C₂＝{H_i+1…H_i+n}，…，C_j＝{H_i+n+1… } in which H is_iBelongs to the group H.

Outlier contribution equalization constraints

In order to make the difference large, which may result in too many outliers and privacy leakage, an outlier contribution balancing constraint is proposed to balance the relationship between outliers and reconstruction errors.

Firstly, because the data influence of the neighborhood is a key measurement factor of the outlier, the influence of the clustered data on the barrel merged data set is judged according to the relation with the neighborhood data

Histogram neighborhood: histogram data neighbors are histograms with a pre-post neighbor relationship after sorting, denoted as S (H)_i，H_j)＝{H_i：|H_i-H_j< ε, where H_iE, H histogram neighborhood set: all histogram sets with front and back neighbor relations and all histogram sets satisfying histogram neighborhood relations

N(H_j)＝{H_j|S(H_j，H_i)＝true，H_j∈H\H_i}，S(H_j，H_i) And representing the neighborhood relationship of the histograms Hi and Hj, wherein in order to reduce the expense of bucket combination, the histogram neighborhood set is mainly measured according to the difference value of the bucket counts.

Wherein w ═ w-x'

X' represents the mean of the histogram, and the farther away from the mean represents the greater the degree of outlier contribution, from which it can be seen that the greater w evidences the greater the degree of outlier.

Histogram neighborhood distance: the neighborhood distance of a histogram bucket is the average of the weighted distances of the histogram from all the histograms in its neighborhood, i.e.

Wherein N (H)_j) Representing the number of histograms in the neighborhood set,

Histogram neighborhood outlier coefficient: the neighborhood distance of the histogram is compared with the neighbors to obtain the space of the histogram in the neighborhoodDegree of deviation in between, i.e. histogram to partition C to be aggregated_iLocal outlier coefficients.

Then the outlier balance constraint is:

wherein,

is the ratio of the number of histograms in the neighborhood set to the total number of histograms

Similar preference penalty constraints

When there are a large number of similar or identical bucket count values in a data set, differences at any one point can cause the data to be non-clustered, resulting in the existence of a large number of outliers. The influence of the concept of discrete entropy on similar buckets in a cluster is adopted, so that the negative influence of similar data buckets on outliers is reduced, and the method specifically comprises the following steps:

when dataset partition Ci ═ H₁，H₂.....H_nIn which H is_i+1....H_jThe dispersion of the data is smaller than a certain value, and Hi-Hj < Lo is close or similar. Count (Hi | Hi-Hj < Lou) > Lou 1 time, the risk of data outlier will increase significantly. According to the conventional adaptive clustering function, H is caused_i+1Are outliers. So to reduce this, a penalty constraint is imposed on such a situation using dispersion, and the ratio between the data count values can effectively indicate the dispersion in the data set:

wherein i is greater than j X_i，jBelong to C_i

wherein p is_ij＝P(x)log P(x)

The larger the ratio is, the lower the similarity is, because the clustering objective function is required to be the minimum value, so the similar preference punishment constraint can only play a punishment role in the positive

Clustering objective function design

among them, the adaptive weight coefficient.

The objective function should take into account that the set of clusters formed should not only be the smallest in-class distance, but also the ability to generate outliers for forming the set. And measured by outlier contribution

Histogram outlier data micro-clustering

When outlier data still exists after outlier equalization is added, privacy protection by only adding noise is far from enough. Herein, the histogram is micro-clustered according to outlier data similarity C_iAnd performing data distribution after adding noise to the clusters formed by clustering.

Firstly, in order to measure the difference between different outlier data sets, an outlier partition similarity OPS is introduced, and if Ck ∈ X, K is 1, 2, K < n, the following conditions are satisfied: 1) c_k≠Θ；2)C_k1∩C_k2＝O(k1≠k2)；3)C₁∪C₂∪C₃...∪C_kH, then { C₁，C₂...C_kConstitute a partition of X. C_d＝{XO_d，X-XO_dDenotes the XO containing outlier data set_dA data division of (1), like C_s＝{XO_s，X-XO_SDenotes another outlier partition, its similarity OPS can be expressed as:

wherein,

cis＝card(XO_d∩XO_s) And card represents the potential of the set, outlier similarity. Degree of similarity of G, ops (C)_s，C_d) Greater values of (A) indicate outliers of the XO_dWith XO_sThe overall trend is more consistent. The support degree, the inclusion degree and the confidence degree show the approaching degree from different angles, the larger the support degree is, the XO and the XO are shown_dOverall more similarly, the inclusion degree represents the XO_SCorrectly reflect XO_dAnd confidence level represents the XO_sDegree of correctness of itself, obviously, 0. ltoreq. ops (C)_d，C_s) 1 or less, and ops (C)_d，C_s) If and only if XO is 1_d＝XO_s。

Clustering algorithm of histogram similarity constraint outlier equalization:

according to the sorted histograms, greedy partitioning from left to right without giving a specific number of packets is a basic idea of clustering. The only consideration in the clustering process is the clustering objective function. The clustering process is the process of selecting the minimum objective function. It is calculated in three cases:

7) when H is merged with the current cluster, the objective function is

8) When H is not merged with the current cluster but with the next cluster, the objective function is

9) When H is clustered separately to form outlier data, err (C) is formed because H is clustered separately and there is no reconstruction error_iU H) is equal to 0, so the objective function is

Claims

1. A histogram data publishing method for trend-free analysis differential privacy protection is characterized by comprising the following steps:

firstly, trend histogram removing processing:

firstly, calculating a detrending correlation coefficient to obtain the overall order degree condition of a histogram sequence; secondly, dividing the disordered histogram sequence into a plurality of subsequences, performing ordered adjustment, judging whether the sequence conforms to the ascending or descending order of the whole sequence or not by comparing the trend-free correlation coefficient with the minimum ordered subsequence, and performing ordered adjustment on the unsatisfied subsequences to obtain an integrally ordered histogram sequence;

secondly, trend analysis is carried out on the histogram sequence:

detrended cross-correlation analysis (DCCA) is an effective model for measuring cross-correlation of non-stationary time series by first computing the "contours" of two time series; secondly, dividing the n histograms into [ n/S ] irrelevant intervals with the length of S, wherein the dividing is characterized in that the jumping property of data in adjacent intervals is small; fitting and eliminating the local trend by using a quadratic minimum fitting function, and calculating a local de-trend covariance function; obtaining fluctuation functions corresponding to different scales, finally, measuring whether the characterization sequences have cross correlation relations, namely correlation, inverse and irrelevant relations according to fluctuation function data, using the idea of trend removing analysis for reference, and judging the relation between the histogram sequences by adopting a trend removing analysis method; it can be known that the order degree of the sequence can be judged according to trend-removing analysis of the overall dispersion of the sequence, and the specific process is as follows: firstly, regarding histogram counting as an ordered sequence, counting as integral detrending correlation analysis in the sequence, and defining two sequences based on time as a front sequence and a rear sequence in a bucket counting sequence; judging the relation between the sequences according to the fluctuation coefficient alpha, and when the alpha is 0.5, the sequences are irrelevant and are an independent random process, namely the current state does not influence the future state; when alpha is less than 0.5, the fact that if the time sequence is in inverse correlation, namely the trends of a certain time period and the next time period are opposite, overall sequence dispersion analysis can be obtained by comparing the scale indexes of the volatility function of the subsequence, namely whether the sequence is ordered or not is judged, when the total volatility index function of the data in the sequence is greater than a threshold value, the histogram sequence dispersion degree is considered to be too large and needs to be adjusted, namely the sequence does not meet the ordering constraint, when the volatility function scale index is less than 0.5 in the trend removing analysis, the inverse correlation can be inferred, the histogram de-trend correlation threshold value theta is theta less than 0.5Ns, wherein Ns is N/L, N is the total data length, and L is the minimum ordered sequence length is shown in the specification;

thirdly, orderly adjusting the trend of the histogram:

obtaining the integral ordering degree of the histogram sequence according to trend-removing analysis of integral data, adjusting the histogram sequence which does not meet ordering constraint, and defining a minimum ordered sequence, namely, an initial most ordered sequence, namely, an ordered sequence with the shortest length in the counting of an original barrel, an ascending sequence or a descending sequence;

dividing the histogram sequence into a plurality of subsequences according to the minimum ordered sequence, traversing all subsequences in the histogram sequence and comparing the subsequences with the minimum ordered subsequence, and adjusting the traversed subsequences when an anti-correlation relationship exists between the two sequences, wherein a specific algorithm is as follows:

defining the similarity between two subsequences, which is measured by the product of the Euclidean distance of elements in the sequence and the difference value of the length of the sequence, as follows:

W_d＝L(C_i)-L(C_j)，

C_iand C_jRepresents any two subsequences in the histogram sequence, h_i,h_jRespectively represent C in cluster_iAnd C_jN is the number of buckets in the subsequence, L (C)_i)，L(C_j) Represents subsequence C_i，C_jLength, i.e. L (C)_i)，L(C_j) Number of histograms included in a sub-sequence;

the similarity of the terms of the subsequence Ops (Ci, Cj) is W_d*dis(C_i,C_j)，

Dispersion decreasing sampling sorting algorithm based on trend analysis

Inputting: the original histogram of the histogram H is,

and (3) outputting: approximating an ordered histogram sequence H',

the method comprises the following steps:

2) dividing the original sequence into Ns (N) according to L₂L unrelated equal-length sequences;

4) sub-sequence extraction is carried out by adopting a differential privacy method, and is specifically proportional to

Probability extracting corresponding subsequences:

5) calculating each histogram sequence contour

Where H (j) is the jth data in the sequence,

is the average of the sequence;

6) in each interval v, fitting data by using a least square method, filtering out a time sequence of contour values in the interval and recording the time sequence as Y (i), and expressing the difference between an original sequence and the fitting value, namely

error(i)＝Y(i)-P_s(i) Wherein, P_s(i) Is a quadratic fitting function;

7) calculating the fluctuation root mean square of the cumulative detrending time sequence;

8) in general, F (n) will increase as n becomes larger, log₂F (n) log₂The slope of n determines the scale index (self-affine parameter) beta, which is a Hurst index, and if the curve of the log-log graph is a straight line, the self-similarity can be expressed by the following formula F (n) · n^β；

10) repeating the steps 1) to 9) until a final sequence is obtained in a circulating mode;

four, histogram clustering

Firstly, clustering functions of all buckets subjected to initial clustering by adopting a clustering algorithm, adding similar preference punishment constraint and outlier influence balance constraint into the clustering functions, balancing the influence of similar data and outlier data on clustering, and carrying out secondary clustering on clustered outliers by calculating the outlier similarity of the clustered outliers;

1) histogram bucket clustering

In order to balance the influence of outlier data in the clustering process, an outlier function is added into a fuzzy clustering target function, the larger size of the target function influenced by an outlier is corrected, the function oversize caused by a singular value in the target function is punished, and the similar preference punishment constraint of the dispersion is violated: in order to reduce the special condition that the same data are all gathered together, a similar penalty factor is added, the more similar data contained in the set is, the larger the more the similar data is, the more the selection of the target function influences the clustering effect, the similarity penalty constraint and the outlier weighting constraint are added into the clustering target function, which is helpful for balancing the clustering results of the outliers and the similar points, so as to obtain the better clustering effect, the target function of the current clustering consists of three parts, and the error function, the similar constraint and the outlier contribution balance data are divided: h is the original histogram bucket count, C_iFor partitioning of the original data set, i.e. C₁＝{H₁,H₂…H_i},C₂＝{H_i+1…H_i+n},…,C_j＝{H_i+n+1… } in which H is_iBelongs to the group H;

outlier contribution equalization constraints

Firstly, because the influence of adjacent data is a key measurement factor of outlier, the influence histogram neighbor data of the bucket merging data set after data clustering is judged according to the relation with the adjacent data: histogram data neighbor data refers to a histogram having a front-to-back adjacent relationship after sorting, and this relationship is represented as S (H)_i,H_j)＝{H_i:∣H_i-H_j∣<ε }, wherein H_i∈H；

Histogram neighbor set: all histogram sets with front-back adjacent relation and all histogram sets satisfying the adjacent relation

N(H_j)＝{H_j|S(H_j,H_i)＝true,H_j∈H\H_i},S(H_j,H_i) Represents a histogram H_iAnd H_jNeighbor relationships exist, and to reduce the overhead of bucket merging, the histogram neighbor set is mainlyMeasured in terms of the difference in bucket counts;

histogram weighted distance: h_i∈H,X_HiIs the bucket count of the histogram, w_ijIs the histogram outlier contribution, and 0<w_ij<1, then histogram bucket H_iAnd H_jAre weighted by a distance of

Wherein w_ij＝w_ijX ', x' represents the mean value in the histogram cluster, the further away from the mean represents the greater the degree of outlier contribution, and w is shown by the above equation_ijThe greater the evidence of greater outliers;

Wherein N (H)_j) Representing the number of histograms in the neighborhood set,

then the outlier balance constraint is:

wherein,

similar preference penalty constraints

when data set is divided into C_i＝{H₁,H₂.....H_nIn which H is_i+1....H_jThe dispersion of data between is less than a certain value, H_i-H_j<The locus is similar or similar to the locus of qi and has a data dispersion Count (H)_i|H_i-H_j<ó)>ó₁Shi, Lou Qi₁Since the Log represents a smaller value, the risk of outliers generated by the data will be significantly increased, and H will be caused according to the conventional adaptive clustering function_i+1To reduce such instances, which are punished using dispersion to constrain, the ratio between data count values may effectively indicate the dispersion in the data set:

wherein i>j X_Hi,X_HjBelong to C_i

wherein P is_ij＝P(x)log P(x)

clustering objective function design

where err represents the clustering error, lap represents the Laplace noise, λ₁，λ₂The self-adaptive weight coefficient is adopted, the target function considers that the formed cluster set has the minimum intra-class distance and the minimum capability of generating outliers for the formed set, and the outlier contribution rate is used for measurement;

2) histogram outlier data micro-clustering

When outlier data still existing after outlier equalization is added, performing micro-clustering C on the histogram according to the similarity of the outlier data_iAnd adding noise to the cluster formed by clustering and then issuing data, firstly, in order to measure the difference between different outlier data sets, introducing outlier division similarity ops', if C_k∈X,k＝1,2...,K,K<n, satisfies: 1) c_k≠Θ；2)C_k1∩C_k2＝Ο(k1≠k2)；3)C₁∪C₂∪C₃...∪C_kH, then { C₁,C₂...C_kForm a partition of X, C_d＝{XO_d,X-XO_dDenotes the XO containing outlier data set_dA data division of (1), like C_s＝{XO_s,X-XO_SDenotes another outlier partition, its similarity ops' can be expressed as:

f_supis C_kDegree of support, f_conIn two setsConfidence, i.e. degree of data correctness, f_incRepresenting the degree of inclusion of two sets, measured by the same amount of included data between the sets, cis-card (XO)_d∩XO_s) Card denotes the potential of the set, the degree of similarity of the outliers, the degree of similarity of G, ops' (C)_S，C_d) Greater values of (A) indicate outliers of the XO_dWith XO_sThe general trends are more consistent, the support degree, the inclusion degree and the confidence degree show the approach degree from different angles, and the larger the support degree is, the X0 is shown_sAnd X0_dOverall more similarly, the inclusion degree represents the XO_SCorrectly reflecting X0_dAnd confidence level represents the XO_sDegree of accuracy of itself, obviously, 0. ltoreq. ops' (C)_d，C_s) 1 ≦ and ops' (C)_d，C_s) 1 if and only if X0_d＝XO_s；

Fifthly, clustering algorithm of similarity constraint outlier balance of the histogram:

according to the ordered histograms, greedy division from left to right is a basic idea of clustering without giving specific numbers to groups, a clustering objective function is considered only in the clustering process, the clustering process is a process of selecting a minimum objective function, and calculation is performed according to three conditions:

1) when H is merged with the current cluster, the objective function is

2) When H is not merged with the current cluster but with the next cluster, the objective function is

3) When H is clustered separately to form outlier data, err (C) is formed because H is clustered separately and there is no reconstruction error_iU H) is equal to 0, so the objective function is

Counting representative buckets, judging whether the current histogram is clustered or not by clustering according to the size of the current target function, wherein the specific algorithm is as follows: