CN113254988B

CN113254988B - High-dimensional sensitive data privacy classified protection publishing method, system, medium and equipment

Info

Publication number: CN113254988B
Application number: CN202110446261.6A
Authority: CN
Inventors: 赵兴文; 洪意阳; 李晖; 朱辉; 寇笑语
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2021-04-25
Filing date: 2021-04-25
Publication date: 2022-10-14
Anticipated expiration: 2041-04-25
Also published as: CN113254988A

Abstract

The invention belongs to the technical field of information security data release, and discloses a method, a system, a medium and equipment for classified protection and release of high-dimensional sensitive data privacy, wherein the method for classified protection and release of the high-dimensional sensitive data privacy comprises the following steps: the issuing party carries out data acquisition and selects a total privacy protection parameter epsilon; after the data acquisition of the publisher is finished, carrying out data preprocessing; the data publisher evaluates the privacy protection level of the data attribute; the issuing party carries out data privacy noise disturbance; obtaining a k-dimensional low-dimensional matrix by using a sparse matrix transformation method; restoring the low-dimensional matrix to further restore the matrix before equalization; and forming m rows and n columns of matrixes and data tables for privacy protection and noise addition, and forming data for externally releasing a privacy version. According to the method, the operation efficiency of privacy protection release of the mass high-dimensional sensitive data is greatly improved under the condition that privacy protection degree requirements of all attributes of the mass high-dimensional sensitive data set are different; under the same privacy protection degree, the usability of the data is improved.

Description

Method, system, medium and device for hierarchical protection and release of high-dimensional sensitive data privacy

技术领域technical field

本发明属于信息安全数据发布技术领域，尤其涉及一种高维敏感数据隐私分级保护发布方法、系统、介质及设备。The invention belongs to the technical field of information security data release, and in particular relates to a method, system, medium and device for releasing high-dimensional sensitive data privacy graded protection.

背景技术Background technique

目前，随着大数据云计算的时代到来，信息时代随时都充满着各式各样的高维数据。而对于医疗、民生、财政、公安等领域相关的权威部门经常需要发布一些数据以提供给第三方进行分析统计。然而，上述权威部门所发布的信息往往包含极其敏感的数据，若直接发布原始信息，则个体样本的敏感数据很有可能被第三方利用。当攻击者具有强大的背景知识且具有非法的意图时，这些敏感信息的泄露将会造成无法预估的后果。At present, with the advent of the era of big data cloud computing, the information age is filled with all kinds of high-dimensional data at any time. Authoritative departments related to medical care, people's livelihood, finance, public security and other fields often need to release some data to provide third parties for analysis and statistics. However, the information released by the above-mentioned authorities often contains extremely sensitive data. If the original information is directly released, the sensitive data of individual samples is likely to be used by third parties. When attackers have strong background knowledge and illegitimate intentions, the leakage of this sensitive information will have unpredictable consequences.

而数据发布是信息公开的重要方式，因此确保敏感数据发布的信息安全是一项重要的措施。传统的敏感数据数据发布可采用经典的拉普拉斯算法直接添加噪音扰动，使得在保证统计结果相对近似的情况下，无法泄露个体样本的信息。然而，医疗、民生、财政、公安等领域需要发布的数据往往都是维度较高且样本较多。若采用传统的拉普拉斯机制加噪扰动方法，则因为数据量过大而导致噪音过多，从而导致数据的失真度较大，这样发布则导致数据的可用性较差，降低权威部门数据发布的公信力。同时，在一些需要发布的庞大数据集中往往有着很多不同敏感等级需求的数据，其中有一部分属性需要较强的保密性，即使以牺牲一定的可用性作为代价；但是还有一部分属性并不需要严格保密，这部分属性应当仍以尽可能保证可用性和减少误差率为原则。因此，根据不同属性各自的隐私保护需求制定不同的隐私保护方案具有较为实用的意义。Data release is an important way of information disclosure, so ensuring the information security of sensitive data release is an important measure. In traditional sensitive data release, the classical Laplacian algorithm can be used to directly add noise disturbance, so that the information of individual samples cannot be leaked under the condition that the statistical results are relatively similar. However, the data that needs to be released in the fields of medical care, people's livelihood, finance, and public security are often high-dimensional and have many samples. If the traditional Laplacian mechanism plus noise perturbation method is used, the excessive amount of data will lead to too much noise, resulting in a large degree of distortion of the data. Such publishing will lead to poor data availability and reduce data publishing by authoritative departments. 's credibility. At the same time, in some huge data sets that need to be published, there are often a lot of data with different sensitivity levels. Some of the attributes require strong confidentiality, even at the expense of a certain degree of availability; but there are still some attributes that do not require strict confidentiality. , this part of the properties should still be based on the principle of ensuring the availability and reducing the error rate as much as possible. Therefore, it is of practical significance to formulate different privacy protection schemes according to the privacy protection requirements of different attributes.

通过上述分析，现有技术存在的问题及缺陷为：Through the above analysis, the existing problems and defects in the prior art are:

(1)传统的隐私保护机制通常为拉普拉斯加扰机制，该机制当面对海量高维度数据时，会引入过量噪声，如何控制数据失真度的问题亟待解决。(1) The traditional privacy protection mechanism is usually the Laplacian scrambling mechanism. When faced with massive high-dimensional data, this mechanism will introduce excessive noise. The problem of how to control the degree of data distortion needs to be solved urgently.

(2)传统的针对海量高维度数据隐私保护算法在现有的硬件资源中运算处理速度往往较慢，如何在有限的硬件资源下提高海量高维数据的隐私保护算法运算处理效率等问题也需要解决。(2) Traditional privacy protection algorithms for massive high-dimensional data are often slow in computing and processing speed in existing hardware resources. How to improve the computing and processing efficiency of privacy-preserving algorithms for massive high-dimensional data under limited hardware resources also needs to be solve.

(3)传统的针对数据集的隐私保护方法并未考虑到不同属性所需的隐私保护程度需求不同，因此在兼顾不同属性之间隐私保护程度的差异性上还具有一些提升空间。(3) The traditional privacy protection methods for datasets do not take into account the different requirements for the degree of privacy protection required by different attributes, so there is still some room for improvement in taking into account the differences in the degree of privacy protection between different attributes.

(4)目前现有针对海量高维数据的隐私保护的方法往往都是先将原始数据矩阵转化为低维投影矩阵，然后在投影矩阵上加入扰动噪声，进而恢复与原数据矩阵规模相同的隐私保护版本的数据矩阵。然而，这种方法大多属于在计算投影矩阵之后加入噪声扰动，这种方案在隐私保护和可用性的平衡上，还存在较大的提升空间。(4) At present, the existing privacy protection methods for massive high-dimensional data often first convert the original data matrix into a low-dimensional projection matrix, and then add disturbance noise to the projection matrix to restore the privacy of the same scale as the original data matrix. Protected version of the data matrix. However, most of this method belongs to adding noise perturbation after calculating the projection matrix, and there is still a large room for improvement in the balance between privacy protection and usability.

解决以上问题及缺陷的难度为：设计一种方案需要既兼顾数据集内不同属性的隐私性与可用性的平衡，又对海量高维敏感数据集整体隐私保护发布的运行效率有所提升。The difficulty of solving the above problems and defects is: designing a scheme needs to take into account the balance between privacy and availability of different attributes in the data set, and improve the operation efficiency of the overall privacy protection release of massive high-dimensional sensitive data sets.

解决以上问题及缺陷的意义为：本发明在兼顾海量高维敏感数据集各个属性的隐私保护程度需求不同的情况下，大幅提升海量高维敏感数据隐私保护发布的运行效率。同时，与传统隐私保护处理方法相比，在相同隐私保护程度下，提升数据的可用性。The significance of solving the above problems and defects is that the present invention greatly improves the operation efficiency of the privacy protection release of the massive high-dimensional sensitive data while taking into account the different privacy protection requirements of each attribute of the massive high-dimensional sensitive data set. At the same time, compared with the traditional privacy protection processing method, the availability of data is improved under the same degree of privacy protection.

发明内容SUMMARY OF THE INVENTION

针对现有技术存在的问题，本发明提供了一种高维敏感数据隐私分级保护发布方法、系统、介质及设备，尤其涉及一种基于分块稀疏矩阵变换的高维敏感数据隐私分级保护发布方法、系统、介质及设备。In view of the problems existing in the prior art, the present invention provides a method, system, medium and device for publishing high-dimensional sensitive data privacy hierarchical protection, in particular to a high-dimensional sensitive data privacy hierarchical protection publishing method based on block sparse matrix transformation , systems, media and equipment.

本发明是这样实现的，一种高维敏感数据隐私分级保护发布方法，所述高维敏感数据隐私分级保护发布方法包括：The present invention is implemented in this way, a method for publishing high-dimensional sensitive data privacy hierarchical protection, the high-dimensional sensitive data privacy hierarchical protection publishing method includes:

接受输入原始数据一共m个数据样本，每个样本的维数为n维，需要设定的隐私保护参数为ε；输出为隐私保护版本的数据为m个数据样本，每个样本的维数为n维，将隐私保护版本的扰动数据作为最终对外公开发布的数据。Accepts a total of m data samples of the input original data, the dimension of each sample is n, and the privacy protection parameter that needs to be set is ε; the output data of the privacy protection version is m data samples, and the dimension of each sample is n-dimensional, the perturbed data of the privacy-preserving version is regarded as the final public release data.

其中，所述高维敏感数据隐私分级保护发布方法包括七大阶段：数据采集阶段、数据预处理阶段、数据隐私保护等级评估阶段、数据扰动阶段、数据变换阶段、数据复原阶段和数据发布阶段。Wherein, the high-dimensional sensitive data privacy protection release method includes seven stages: data collection stage, data preprocessing stage, data privacy protection level evaluation stage, data disturbance stage, data transformation stage, data restoration stage and data release stage.

进一步，所述高维敏感数据隐私分级保护发布方法包括以下步骤：Further, the method for publishing high-dimensional sensitive data privacy classification protection includes the following steps:

步骤一，数据采集：发布方进行数据采集，选择合适的总隐私保护参数ε；收集数据，是后续步骤的基础。同时选择相应的参数。Step 1, data collection: The publisher collects data and selects the appropriate total privacy protection parameter ε; data collection is the basis for the subsequent steps. Also select the corresponding parameters.

步骤二，数据预处理：发布方数据采集完毕后，进行数据预处理；分别针对每条样本进行扫描，若样本的n个属性值存在某个属性为空值，则以0填充，保证n维度每一个属性都有数值；将所有数据进行整合排布成一个整体为m行n列的矩阵，即所述矩阵的样本数为m，维数为n；将步骤一中的原始数据进行处理，形成数据矩阵。Step 2: Data preprocessing: After the data collection of the publisher is completed, data preprocessing is performed; each sample is scanned separately. If any of the n attribute values of the sample has a null value, it will be filled with 0 to ensure n dimension. Each attribute has a value; all data are integrated and arranged into a matrix with m rows and n columns, that is, the number of samples of the matrix is m and the dimension is n; the original data in step 1 is processed, form a data matrix.

步骤三，数据隐私保护等级评估：数据发布方评估数据属性的隐私保护等级，进而重新排列属性；将数据集划分成相对低维的分块矩阵，分块后，每个分块矩阵的维数为p，并标注分块矩阵整体的敏感等级。先进行数据矩阵分块，是为了避免步骤四中直接计算原协方差矩阵的处理时间过长。同时标注分块矩阵的敏感等级，为后续的隐私预算分配做铺垫。Step 3: Evaluation of data privacy protection level: The data publisher evaluates the privacy protection level of the data attributes, and then rearranges the attributes; divides the data set into relatively low-dimensional block matrices, and after block, the dimension of each block matrix is p, and label the sensitivity level of the block matrix as a whole. The data matrix is divided into blocks first, in order to avoid the long processing time of directly calculating the original covariance matrix in step 4. At the same time, the sensitivity level of the block matrix is marked to pave the way for the subsequent privacy budget allocation.

步骤四，数据扰动：发布方在分块矩阵的协方差矩阵上加入Wishart隐私噪声扰动；进行数据隐私保护噪声扰动，保护原始信息不被泄露。Step 4: Data perturbation: The publisher adds Wishart privacy noise perturbation to the covariance matrix of the block matrix; performs data privacy protection noise perturbation to protect the original information from being leaked.

步骤五，数据变换：利用稀疏矩阵变换方法，得到特征向量矩阵和相应的特征值对角阵，取特征值对角矩阵的前k个最大值所对应的特征向量；将特征向量矩阵转置并与均值化后的矩阵相乘，得到k维的低维矩阵，k≤p；利用稀疏矩阵变换方法，能够进一步大幅提升矩阵降维的运算处理效率。Step 5, data transformation: use the sparse matrix transformation method to obtain the eigenvector matrix and the corresponding eigenvalue diagonal matrix, and take the eigenvectors corresponding to the first k maximum values of the eigenvalue diagonal matrix; transpose the eigenvector matrix and combine it. Multiply with the averaged matrix to obtain a k-dimensional low-dimensional matrix, k≤p; using the sparse matrix transformation method can further greatly improve the processing efficiency of matrix dimensionality reduction.

步骤六，数据复原：将低维矩阵复原，进而恢复均值化之前的矩阵；恢复矩阵的初始形式。Step 6, data restoration: restore the low-dimensional matrix, and then restore the matrix before averaging; restore the initial form of the matrix.

步骤七，数据发布：将所有隐私噪声扰动的低维矩阵复原所形成的m行p列分块矩阵进行拼接，形成隐私保护加噪的m行n列矩阵，并在数据表头加上相应的属性名称；对照原始数据表的属性排列顺序进行调整复原，形成完整的数据表，并形成对外发布隐私版本的数据。在步骤六的基础上，进行相应的数据处理，包括恢复均值化之前的数据矩阵，并在最后一个分块中去掉之前用0填充的列。最后加入表头名称，形成完整的数据发布表。Step 7: Data release: splicing the m-row and p-column block matrices formed by restoring the low-dimensional matrix perturbed by all privacy noises to form an m-row and n-column matrix with privacy protection and noise, and add the corresponding data to the header of the data table. Attribute name; adjust and restore according to the attribute arrangement order of the original data table, form a complete data table, and form a private version of the data to be released to the outside world. On the basis of step 6, perform corresponding data processing, including restoring the data matrix before averaging, and removing the columns filled with 0 before in the last block. Finally, add the header name to form a complete data release table.

进一步，所述数据采集阶段，包括数据收集阶段和参数选取阶段。Further, the data collection stage includes a data collection stage and a parameter selection stage.

数据收集阶段，将每个个体样本的数据收集，确定每个样本中n个属性中的具体值，并实时更新计算数据集合中个体样本总数的值m；In the data collection stage, the data of each individual sample is collected, the specific values of n attributes in each sample are determined, and the value m of the total number of individual samples in the calculation data set is updated in real time;

参数选取阶段，隐私保护参数ε表示隐私分配的预算，若ε越大，则隐私保护的程度越小，数据的可用性越强；若ε越小，则隐私保护的程度越大，数据的可用性越差。因此ε值的选择需要根据实际的隐私保护需求不断调整，最终根据实际隐私保护程度的需求确定合适的总隐私保护参数ε。In the parameter selection stage, the privacy protection parameter ε represents the budget of privacy allocation. If ε is larger, the degree of privacy protection is smaller, and the availability of data is stronger; if ε is smaller, the degree of privacy protection is larger and the availability of data is stronger. Difference. Therefore, the selection of the ε value needs to be continuously adjusted according to the actual privacy protection requirements, and finally the appropriate total privacy protection parameter ε is determined according to the actual privacy protection degree requirements.

进一步，所述数据隐私保护等级评估阶段，包括属性敏感等级评估阶段、属性重新排列阶段、数据集划分阶段、分块敏感等级标注阶段。Further, the data privacy protection level evaluation stage includes an attribute sensitivity level evaluation stage, an attribute rearrangement stage, a data set division stage, and a block sensitivity level labeling stage.

属性敏感等级评估阶段，评估每一列属性的敏感等级，每列等级按照相对的高、中、低进行标注；In the attribute sensitivity level evaluation stage, the sensitivity level of each column of attributes is evaluated, and each column level is marked according to the relative high, medium and low;

属性重新排列阶段，在标注敏感等级之后，将矩阵根据每一个属性的敏感等级进行排列，若该属性的敏感等级越高，则所排列的维度就越靠前，所在的维度越低。将所有属性按照敏感等级排列之后，重新形成一个新的m行n列的矩阵；In the attribute rearrangement stage, after the sensitivity level is marked, the matrix is arranged according to the sensitivity level of each attribute. After arranging all attributes according to the sensitivity level, a new matrix with m rows and n columns is formed again;

数据集划分阶段，将所有属性按照敏感等级排列之后形成一个新的m行n列的矩阵，进而按照维数阈值为p进行分块；对于最后一个分块维数不足p的分块矩阵用数值0进行填充，形成p维的填充矩阵，p≤10；In the data set division stage, all attributes are arranged according to the sensitivity level to form a new matrix with m rows and n columns, and then block according to the dimension threshold p; for the block matrix whose last block dimension is less than p, use the numerical value 0 is filled to form a p-dimensional filled matrix, p≤10;

分块敏感等级标注阶段，对于m行p列的p维分块矩阵进行等级标注，由于在上两个阶段中属性重新排列阶段和数据集划分阶段已经将属性的敏感等级按照高中低进行排序和划分，故将划分后的分块矩阵整体的隐私保护程度分为高中低三个等级。In the block sensitivity level labeling stage, the level labeling is performed on the p-dimensional block matrix with m rows and p columns, because in the previous two stages, the attribute rearrangement stage and the data set division stage have already sorted the sensitivity levels of attributes according to high, high and low. Therefore, the overall privacy protection degree of the divided block matrix is divided into three levels: high, medium and low.

至此，一个完整的m行n列被划分形成新的若干个m行p列的分块矩阵X，并将分块矩阵根据不同隐私保护程度的需求划分为高中低三个等级。So far, a complete m-row and n-column block matrix X has been divided into several new m-row and p-column block matrices, and the block matrix is divided into three levels: high, middle and low according to the needs of different privacy protection degrees.

进一步，所述数据扰动模块包括：数据均值化阶段、计算协方差矩阵阶段、隐私预算参数分配阶段、隐私噪声抽取阶段和数据隐私加噪阶段。Further, the data perturbation module includes: a data averaging stage, a covariance matrix calculation stage, a privacy budget parameter allocation stage, a privacy noise extraction stage, and a data privacy noise addition stage.

数据均值化阶段，将每个数值减去该列的均值，形成均值化的矩阵X；In the data averaging stage, the mean value of the column is subtracted from each value to form an averaged matrix X;

计算协方差矩阵阶段，将每个p维分块矩阵均值化之后的X计算其协方差矩阵A，

In the stage of calculating the covariance matrix, X calculates its covariance matrix A after averaging each p-dimensional block matrix,

隐私预算参数分配阶段，根据不同敏感属性的隐私保护等级强度分配不同的隐私保护需求；对于差分隐私保护技术来说，隐私保护参数越小，则相应隐私保护等级越高，数据可用性越低；将隐私预算按照高敏感属性列、中敏感属性列、低敏感属性列按照比例1：9：90的比例配比每个属性的隐私预算；在将每个分块矩阵中所有属性列的隐私预算进行求和，得到分块矩阵的隐私预算ε_i；In the privacy budget parameter allocation stage, different privacy protection requirements are allocated according to the privacy protection level strength of different sensitive attributes; for differential privacy protection technology, the smaller the privacy protection parameter, the higher the corresponding privacy protection level and the lower the data availability; the The privacy budget is matched with the privacy budget of each attribute according to the high-sensitive attribute column, the medium-sensitive attribute column, and the low-sensitive attribute column according to the ratio of 1:9:90; Summation to get the privacy budget ε _i of the block matrix;

隐私噪声抽取阶段，对于m行p列的分块矩阵而言，生成m行m列的正定矩阵C；所述正定矩阵C满足m个特征值均相等，且全部特征值设定为

从Wishart分布W(m+1,C)中提取m行p列的噪声样本矩阵W；In the privacy noise extraction stage, for a block matrix with m rows and p columns, a positive definite matrix C with m rows and m columns is generated; the positive definite matrix C satisfies that the m eigenvalues are all equal, and all eigenvalues are set as

Extract the noise sample matrix W with m rows and p columns from the Wishart distribution W(m+1,C);

数据隐私加噪阶段，将所述噪声样本矩阵W添加到协方差矩阵A上，形成加扰噪声之后的协方差矩阵A'，即A'＝A+W。In the data privacy noise adding stage, the noise sample matrix W is added to the covariance matrix A to form a covariance matrix A' after scrambling noise, that is, A'=A+W.

进一步，所述数据复原阶段包括低维矩阵复原阶段和均值化复原阶段。Further, the data restoration stage includes a low-dimensional matrix restoration stage and an averaging restoration stage.

低维矩阵复原阶段，将特征向量矩阵与k维的低维矩阵相乘得到复原矩阵，但此时的复原矩阵还并未去均值化；In the low-dimensional matrix restoration stage, the eigenvector matrix is multiplied by the k-dimensional low-dimensional matrix to obtain the restoration matrix, but the restoration matrix at this time has not been de-averaged;

均值化复原阶段，将复原矩阵的每个元素加上相对应原始矩阵所在列的均值，得到均值化复原后的矩阵，并去掉最后一个分块中之前用0填充的属性列。In the mean restoration stage, add each element of the restoration matrix to the mean value of the column corresponding to the original matrix to obtain the mean restoration matrix, and remove the attribute column previously filled with 0 in the last block.

本发明的另一目的在于提供一种应用所述的高维敏感数据隐私分级保护发布方法的高维敏感数据隐私分级保护发布系统，所述高维敏感数据隐私分级保护发布系统包括：Another object of the present invention is to provide a high-dimensional sensitive data privacy hierarchical protection publishing system applying the high-dimensional sensitive data privacy hierarchical protection publishing method, and the high-dimensional sensitive data privacy hierarchical protection publishing system includes:

数据采集模块，用于发布方进行数据采集，选择合适的总隐私保护参数ε；The data collection module is used for the publisher to collect data and select the appropriate total privacy protection parameter ε;

数据预处理模块，用于发布方数据采集完毕后，进行数据预处理；分别针对每条样本进行扫描，若样本的n个属性值存在某个属性为空值，则以0填充，保证n维度每一个属性都有数值；将所有数据进行整合排布成一个整体为m行n列的矩阵，即所述矩阵的样本数为m，维数为n；The data preprocessing module is used for data preprocessing after the publisher's data collection is completed; each sample is scanned separately. If a certain attribute of the n attribute values of the sample is empty, it will be filled with 0 to ensure n dimension. Each attribute has a value; all data are integrated and arranged into a matrix with m rows and n columns as a whole, that is, the number of samples of the matrix is m and the dimension is n;

数据隐私保护等级评估模块，用于数据发布方评估数据属性的隐私保护等级，进而重新排列属性；将数据集划分成相对低维的分块矩阵，分块后，每个分块矩阵的维数为p，并标注分块矩阵整体的敏感等级；The data privacy protection level evaluation module is used for the data publisher to evaluate the privacy protection level of the data attributes, and then rearrange the attributes; divide the data set into relatively low-dimensional block matrices, and after block, the dimension of each block matrix is p, and mark the overall sensitivity level of the block matrix;

数据扰动模块，用于发布方在分块矩阵的协方差矩阵上加入Wishart隐私噪声扰动；The data perturbation module is used for the publisher to add Wishart privacy noise perturbation to the covariance matrix of the block matrix;

数据变换模块，用于利用稀疏矩阵变换方法，得到特征向量矩阵和相应的特征值对角阵，取特征值对角矩阵的前k个最大值所对应的特征向量；将特征向量矩阵转置并与均值化后的矩阵相乘，得到k维的低维矩阵，k≤p；The data transformation module is used to obtain the eigenvector matrix and the corresponding eigenvalue diagonal matrix by using the sparse matrix transformation method, and obtain the eigenvectors corresponding to the first k maximum values of the eigenvalue diagonal matrix; transpose the eigenvector matrix and combine Multiply with the averaged matrix to obtain a k-dimensional low-dimensional matrix, k≤p;

数据复原模块，用于将低维矩阵复原，进而恢复均值化之前的矩阵；The data restoration module is used to restore the low-dimensional matrix, and then restore the matrix before averaging;

数据发布模块，用于将所有经过隐私噪声扰动的低维矩阵复原所形成的m行p列分块矩阵进行拼接，形成隐私保护加噪的m行n列矩阵，并在数据表头加上相应的属性名称；对照原始数据表的属性排列顺序进行调整复原，形成完整的数据表，并形成对外发布隐私版本的数据。The data release module is used to splicing all m-row and p-column block matrices formed by restoring the low-dimensional matrix perturbed by privacy noise to form a privacy-preserving noise-added m-row and n-column matrix, and add the corresponding data in the header of the data table. The attribute name of the original data table is adjusted and restored according to the attribute arrangement order of the original data table to form a complete data table, and form the data of the private version released to the outside world.

本发明的另一目的在于提供一种计算机设备，所述计算机设备包括存储器和处理器，所述存储器存储有计算机程序，所述计算机程序被所述处理器执行时，使得所述处理器执行如下步骤：Another object of the present invention is to provide a computer device, the computer device includes a memory and a processor, the memory stores a computer program, and when the computer program is executed by the processor, the processor executes the following step:

数据采集，发布方进行数据采集，选择合适的总隐私保护参数ε；Data collection, the publisher conducts data collection and selects the appropriate total privacy protection parameter ε;

数据预处理，发布方数据采集完毕后，进行数据预处理；分别针对每条样本进行扫描，若样本的n个属性值存在某个属性为空值，则以0填充，保证n维度每一个属性都有数值；将所有数据进行整合排布成一个整体为m行n列的矩阵，即所述矩阵的样本数为m，维数为n；Data preprocessing. After the data collection of the publisher is completed, data preprocessing is performed; each sample is scanned separately. If a certain attribute is empty in the n attribute values of the sample, it will be filled with 0 to ensure that each attribute in the n dimension is All data are integrated and arranged into a matrix with m rows and n columns as a whole, that is, the number of samples of the matrix is m and the dimension is n;

数据隐私保护等级评估，数据发布方评估数据属性的隐私保护等级，进而重新排列属性；将数据集划分成相对低维的分块矩阵，分块后，每个分块矩阵的维数为p，并标注分块矩阵整体的敏感等级；Data privacy protection level evaluation, the data publisher evaluates the privacy protection level of the data attributes, and then rearranges the attributes; the data set is divided into relatively low-dimensional block matrices. After block, the dimension of each block matrix is p, And mark the overall sensitivity level of the block matrix;

数据扰动，用于发布方在分块矩阵的协方差矩阵上加入Wishart隐私噪声扰动；Data perturbation, which is used by the publisher to add Wishart privacy noise perturbation to the covariance matrix of the block matrix;

数据变换，利用稀疏矩阵变换方法，得到特征向量矩阵和相应的特征值对角阵，取特征值对角矩阵的前k个最大值所对应的特征向量；将特征向量矩阵转置并与均值化后的矩阵相乘，得到k维的低维矩阵，k≤p；Data transformation, use the sparse matrix transformation method to obtain the eigenvector matrix and the corresponding eigenvalue diagonal matrix, and take the eigenvectors corresponding to the first k maximum values of the eigenvalue diagonal matrix; transpose the eigenvector matrix and average it. The resulting matrix is multiplied to obtain a k-dimensional low-dimensional matrix, k≤p;

数据复原，将低维矩阵复原，进而恢复均值化之前的矩阵；Data restoration, restore the low-dimensional matrix, and then restore the matrix before averaging;

数据发布，将所有的经过变换隐私保护噪声加扰并复原后的m行p列分块矩阵进行拼接，形成隐私保护加噪的m行n列矩阵，并在数据表头加上相应的属性名称；对照原始数据表的属性排列顺序进行调整复原，形成完整的数据表，并形成对外发布隐私版本的数据。Data release, splicing all m-row and p-column block matrices scrambled and restored by transformed privacy-preserving noise to form a privacy-preserving noise-enhanced m-row n-column matrix, and add the corresponding attribute name to the header of the data table ; Adjust and restore according to the attribute arrangement order of the original data table, form a complete data table, and form a private version of the data released to the outside world.

本发明的另一目的在于提供一种计算机可读存储介质，存储有计算机程序，所述计算机程序被处理器执行时，使得所述处理器执行如下步骤：Another object of the present invention is to provide a computer-readable storage medium, which stores a computer program, and when the computer program is executed by a processor, causes the processor to perform the following steps:

数据发布，将所有经过隐私噪声扰动的低维矩阵复原所形成的m行p列分块矩阵进行拼接，形成隐私保护加噪的m行n列矩阵，并在数据表头加上相应的属性名称；对照原始数据表的属性排列顺序进行调整复原，形成完整的数据表，并形成对外发布隐私版本的数据。Data release, splicing all m-row and p-column block matrices formed by restoring the low-dimensional matrix perturbed by privacy noise to form an m-row and n-column matrix with privacy protection and noise, and add the corresponding attribute name to the header of the data table ; Adjust and restore according to the attribute arrangement order of the original data table, form a complete data table, and form a private version of the data released to the outside world.

本发明的另一目的在于提供一种信息数据处理终端，所述信息数据处理终端用于实现所述的高维敏感数据隐私分级保护发布系统。Another object of the present invention is to provide an information data processing terminal, the information data processing terminal is used to implement the high-dimensional sensitive data privacy classification protection publishing system.

结合上述的所有技术方案，本发明所具备的优点及积极效果为：本发明提供的基于分块稀疏矩阵变换的高维敏感数据隐私分级保护发布方法，在兼顾海量高维敏感数据集各个属性的隐私保护程度需求不同的情况下，大幅提升海量高维敏感数据隐私保护发布的运行效率。同时，与传统隐私保护处理方法相比，在相同隐私保护程度下，提升数据的可用性。Combined with all the above technical solutions, the advantages and positive effects of the present invention are as follows: the hierarchical protection and release method for privacy protection of high-dimensional sensitive data based on the block sparse matrix transformation provided by the present invention can take into account the various attributes of massive high-dimensional sensitive data sets. In the case of different privacy protection requirements, the operation efficiency of the privacy protection release of massive high-dimensional sensitive data is greatly improved. At the same time, compared with the traditional privacy protection processing method, the availability of data is improved under the same degree of privacy protection.

本发明充分利用分块稀疏矩阵变换的高速降维特性，针对直接数据发布可能遭受到的安全风险，以及不同类别属性所需的隐私保护等级不同，提出了一种基于分块稀疏矩阵变换的高维数据隐私分级保护处理方案。同时将差分隐私技术引入高维敏感数据隐私保护发布方法中，保证了大规模海量高维敏感数据的分级安全发布，并且能够抵抗攻击者的强背景知识攻击。与此同时，本发明提出的分块稀疏矩阵变换方法能够大幅提高海量数据的隐私保护运算速度，使得在提供与传统方法相同的隐私保护程度下，减小数据的失真度。本发明可适用于解决医疗、民生、财政、公安等领域的海量高维敏感信息数据的隐私分级保护数据发布需求。The invention makes full use of the high-speed dimensionality reduction characteristics of the block sparse matrix transformation, and proposes a high-speed sparse matrix transformation based on the security risks that direct data release may suffer and the different privacy protection levels required by different types of attributes. Dimensional data privacy hierarchical protection solution. At the same time, the differential privacy technology is introduced into the privacy protection release method of high-dimensional sensitive data, which ensures the hierarchical and secure release of large-scale and massive high-dimensional sensitive data, and can resist the strong background knowledge attack of attackers. At the same time, the block sparse matrix transformation method proposed in the present invention can greatly improve the computing speed of privacy protection of massive data, so that the distortion of the data can be reduced while providing the same degree of privacy protection as the traditional method. The invention can be applied to solve the data release requirements of privacy classification protection of massive high-dimensional sensitive information data in the fields of medical treatment, people's livelihood, finance, public security and the like.

附图说明Description of drawings

为了更清楚地说明本发明实施例的技术方案，下面将对本发明实施例中所需要使用的附图做简单的介绍，显而易见地，下面所描述的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下还可以根据这些附图获得其他的附图。In order to illustrate the technical solutions of the embodiments of the present invention more clearly, the following will briefly introduce the accompanying drawings that need to be used in the embodiments of the present invention. Obviously, the drawings described below are only some embodiments of the present invention. For those of ordinary skill in the art, other drawings can also be obtained from these drawings without creative effort.

图1是本发明实施例提供的高维敏感数据隐私分级保护发布方法流程图。FIG. 1 is a flowchart of a method for publishing high-dimensional sensitive data privacy protection by hierarchical protection provided by an embodiment of the present invention.

图2是本发明实施例提供的高维敏感数据隐私分级保护发布系统结构框图；2 is a structural block diagram of a high-dimensional sensitive data privacy hierarchical protection publishing system provided by an embodiment of the present invention;

图中：1、数据采集模块；2、数据预处理模块；3、数据隐私保护等级评估模块；4、数据扰动模块；5、数据变换模块；6、数据复原模块；7、数据发布模块。In the figure: 1. Data acquisition module; 2. Data preprocessing module; 3. Data privacy protection level evaluation module; 4. Data perturbation module; 5. Data transformation module; 6. Data restoration module; 7. Data publishing module.

图3是本发明实施例提供的高维敏感数据隐私分级保护发布方法的应用场景示意图。FIG. 3 is a schematic diagram of an application scenario of the method for publishing high-dimensional sensitive data privacy protection by classification according to an embodiment of the present invention.

图4是本发明实施例提供的高维敏感数据隐私分级保护发布方法的实例数据处理流程图。FIG. 4 is an example data processing flow chart of the method for publishing high-dimensional sensitive data privacy protection by classification according to an embodiment of the present invention.

图5是本发明实施例提供的将传统的差分隐私拉普拉斯加噪机制算法与本发明的高维敏感数据隐私分级保护发布方法进行比较的示意图。FIG. 5 is a schematic diagram comparing the traditional differential privacy Laplacian noise mechanism algorithm and the high-dimensional sensitive data privacy hierarchical protection and release method of the present invention provided by an embodiment of the present invention.

图6是本发明实施例提供的对比实验示意图。FIG. 6 is a schematic diagram of a comparative experiment provided by an embodiment of the present invention.

具体实施方式Detailed ways

为了使本发明的目的、技术方案及优点更加清楚明白，以下结合实施例，对本发明进行进一步详细说明。应当理解，此处所描述的具体实施例仅仅用以解释本发明，并不用于限定本发明。In order to make the objectives, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail below with reference to the embodiments. It should be understood that the specific embodiments described herein are only used to explain the present invention, but not to limit the present invention.

针对现有技术存在的问题，本发明提供了一种高维敏感数据隐私分级保护发布方法、系统、介质及设备，下面结合附图对本发明作详细的描述。Aiming at the problems existing in the prior art, the present invention provides a method, system, medium and device for publishing high-dimensional sensitive data privacy protection by grading. The present invention is described in detail below with reference to the accompanying drawings.

如图1所示，本发明实施例提供的高维敏感数据隐私分级保护发布方法，包括以下步骤：As shown in FIG. 1 , the method for publishing high-dimensional sensitive data privacy classification protection provided by an embodiment of the present invention includes the following steps:

S101，数据采集：发布方进行数据采集，选择合适的总隐私保护参数；S101, data collection: the publisher collects data and selects appropriate total privacy protection parameters;

S102，数据预处理：发布方数据采集完毕后，进行数据预处理；S102, data preprocessing: after the data collection of the publisher is completed, data preprocessing is performed;

S103，数据隐私保护等级评估：数据发布方评估数据属性的隐私保护等级；S103, data privacy protection level evaluation: the data publisher evaluates the privacy protection level of the data attribute;

S104，数据扰动：发布方进行数据隐私噪声扰动；S104, data disturbance: the publisher conducts data privacy noise disturbance;

S105，数据变换：利用稀疏矩阵变换方法得到k维的低维矩阵；S105, data transformation: use a sparse matrix transformation method to obtain a k-dimensional low-dimensional matrix;

S106，数据复原：将低维矩阵复原，进而恢复均值化之前的矩阵；S106, data restoration: restore the low-dimensional matrix, and then restore the matrix before averaging;

S107，数据发布：形成隐私保护加噪的m行n列矩阵和数据表，并形成对外发布隐私版本的数据。S107, data release: form an m-row and n-column matrix and a data table with added noise for privacy protection, and form data for releasing a private version to the outside world.

如图2所示，本发明实施例提供的高维敏感数据隐私分级保护发布系统，包括：As shown in FIG. 2 , the high-dimensional sensitive data privacy classification protection publishing system provided by the embodiment of the present invention includes:

数据采集模块1，用于发布方进行数据采集，选择合适的总隐私保护参数ε；Data collection module 1, which is used by the publisher to collect data and select an appropriate total privacy protection parameter ε;

数据预处理模块2，用于发布方数据采集完毕后，进行数据预处理；分别针对每条样本进行扫描，若样本的n个属性值存在某个属性为空值，则以0填充，保证n维度每一个属性都有数值；将所有数据进行整合排布成一个整体为m行n列的矩阵，即所述矩阵的样本数为m，维数为n；The data preprocessing module 2 is used for data preprocessing after the publisher's data collection is completed; each sample is scanned separately. If there is a null value for an attribute in the n attribute values of the sample, it will be filled with 0 to ensure n Each attribute of the dimension has a value; all data are integrated and arranged into a matrix with m rows and n columns as a whole, that is, the number of samples of the matrix is m, and the number of dimensions is n;

数据隐私保护等级评估模块3，用于数据发布方评估数据属性的隐私保护等级，进而重新排列属性；将数据集划分成相对低维的分块矩阵，分块后，每个分块矩阵的维数为p，并标注分块矩阵整体的敏感等级；The data privacy protection level evaluation module 3 is used for the data publisher to evaluate the privacy protection level of the data attributes, and then rearrange the attributes; divide the data set into relatively low-dimensional block matrices, and after block, the dimension of each block matrix The number is p, and the sensitivity level of the whole block matrix is marked;

数据扰动模块4，用于发布方在分块矩阵的协方差矩阵上加入Wishart隐私噪声扰动；Data perturbation module 4, for the publisher to add Wishart privacy noise perturbation to the covariance matrix of the block matrix;

数据变换模块5，用于利用稀疏矩阵变换方法，得到特征向量矩阵和相应的特征值对角阵，取特征值对角矩阵的前k个最大值所对应的特征向量；将特征向量矩阵转置并与均值化后的矩阵相乘，得到k维的低维矩阵，k≤p；The data transformation module 5 is used to obtain the eigenvector matrix and the corresponding eigenvalue diagonal matrix by using the sparse matrix transformation method, and obtain the eigenvectors corresponding to the first k maximum values of the eigenvalue diagonal matrix; transpose the eigenvector matrix And multiplied with the averaged matrix to obtain a k-dimensional low-dimensional matrix, k≤p;

数据复原模块6，用于将低维矩阵复原，进而恢复均值化之前的矩阵；The data restoration module 6 is used to restore the low-dimensional matrix, and then restore the matrix before averaging;

数据发布模块7，用于将所有隐私噪声扰动的低维矩阵复原所形成的m行p列分块矩阵进行拼接，形成隐私保护加噪的m行n列矩阵，并在数据表头加上相应的属性名称；对照原始数据表的属性排列顺序进行调整复原，形成完整的数据表，并形成对外发布隐私版本的数据。The data release module 7 is used to splicing the m-row and p-column block matrix formed by restoring the low-dimensional matrix perturbed by all privacy noises to form an m-row and n-column matrix of privacy protection and noise, and add the corresponding data in the header of the data table. The attribute name of the original data table is adjusted and restored according to the attribute arrangement order of the original data table to form a complete data table, and form the data of the private version released to the outside world.

本发明实施例提供的高维敏感数据隐私分级保护发布方法的应用场景示意图如图3所示。FIG. 3 is a schematic diagram of an application scenario of the method for publishing high-dimensional sensitive data privacy protection by classification provided by the embodiment of the present invention.

下面结合实施例对本发明的技术方案作进一步描述。The technical solutions of the present invention will be further described below in conjunction with the embodiments.

本发明目的在于针对现有海量高维敏感数据发布隐私保护效果不够理想，运算时间过长等问题，提出了一种基于分块稀疏矩阵变换的高维敏感数据隐私保护发布方法，从而实现在海量高维敏感数据发布时运算处理的高效性，以及隐私保护与数据误差之间的平衡性。The purpose of the present invention is to solve the problems of unsatisfactory privacy protection effect and long operation time of existing mass high-dimensional sensitive data publishing, and proposes a high-dimensional sensitive data privacy protection publishing method based on block sparse matrix transformation, so as to realize the large-scale high-dimensional sensitive data privacy protection publishing method. Efficiency of computational processing when high-dimensional sensitive data is released, and the balance between privacy protection and data errors.

如图4所示，本发明的基于分块稀疏矩阵变换的高维敏感数据隐私保护发布方法。接受输入原始数据一共m个数据样本，每个样本的维数为n维，需要设定的隐私保护参数为ε；输出为隐私保护版本的数据为m个数据样本，每个样本的维数为n维，将隐私保护版本的扰动数据作为最终对外公开发布的数据。整个过程包含七大阶段：数据采集阶段、数据预处理阶段、数据隐私保护等级评估阶段、数据扰动阶段、数据变换阶段、数据复原阶段、数据发布阶段。具体每个阶段如下：As shown in FIG. 4 , the disclosure method for privacy protection of high-dimensional sensitive data based on block sparse matrix transformation of the present invention. Accepts a total of m data samples of the input original data, the dimension of each sample is n, and the privacy protection parameter that needs to be set is ε; the output data of the privacy protection version is m data samples, and the dimension of each sample is n-dimensional, the perturbed data of the privacy-preserving version is regarded as the final public release data. The whole process includes seven stages: data collection stage, data preprocessing stage, data privacy protection level assessment stage, data disturbance stage, data transformation stage, data restoration stage, and data release stage. The specific stages are as follows:

(1)数据采集阶段，包含两个子阶段：数据收集阶段、参数选取阶段。(1) Data collection stage, including two sub-stages: data collection stage and parameter selection stage.

数据收集阶段：将每个个体样本的数据收集，确定每个样本中n个属性中的具体值，并实时更新计算数据集合中个体样本总数的值m。Data collection stage: collect the data of each individual sample, determine the specific values of n attributes in each sample, and update the value m of the total number of individual samples in the data set in real time.

参数选取阶段：隐私保护参数ε表示隐私分配的预算，若ε越大，则隐私保护的程度越小，数据的可用性越强；若ε越小，则隐私保护的程度越大，数据的可用性越差。因此ε值的选择需要根据实际的隐私保护需求不断调整，最终根据实际隐私保护程度的需求确定合适的总隐私保护参数ε。Parameter selection stage: The privacy protection parameter ε represents the budget of privacy allocation. If ε is larger, the degree of privacy protection is smaller and the availability of data is stronger; if ε is smaller, the degree of privacy protection is larger and the availability of data is stronger. Difference. Therefore, the selection of the ε value needs to be continuously adjusted according to the actual privacy protection requirements, and finally the appropriate total privacy protection parameter ε is determined according to the actual privacy protection degree requirements.

(2)数据预处理阶段：分别针对每条样本进行扫描，若样本的n个属性值存在某个属性为空值，则以0填充，保证n维度每一个属性都有数值。将所有数据进行排布成一个整体为m行n列的矩阵，即这个矩阵的样本数为m，维数为n。(2) Data preprocessing stage: Scan each sample separately. If there is a null value for an attribute in the n attribute values of the sample, it will be filled with 0 to ensure that each attribute in the n dimension has a value. Arrange all the data into a matrix with m rows and n columns as a whole, that is, the number of samples of this matrix is m and the dimension is n.

(3)数据隐私保护等级评估阶段，包括四个子阶段：属性敏感等级评估阶段、属性重新排列阶段、数据集划分阶段、分块敏感等级标注阶段。(3) Data privacy protection level evaluation stage, including four sub-stages: attribute sensitivity level evaluation stage, attribute rearrangement stage, dataset division stage, and block sensitivity level labeling stage.

属性敏感等级评估阶段：评估每一列属性的敏感等级，每列等级按照相对的高、中、低进行标注。Attribute sensitivity level evaluation stage: evaluate the sensitivity level of each column of attributes, and each column level is marked according to the relative high, medium and low.

属性重新排列阶段：在标注敏感等级之后，将矩阵根据每一个属性的敏感等级进行排列，若该属性的敏感等级越高，则所排列的维度就越靠前(所在的维度越低)。将所有属性按照敏感等级排列之后，重新形成一个新的m行n列的矩阵。Attribute rearrangement stage: After the sensitivity level is marked, the matrix is arranged according to the sensitivity level of each attribute. If the sensitivity level of the attribute is higher, the arranged dimension will be higher (the lower the dimension). After arranging all attributes according to the sensitivity level, a new matrix with m rows and n columns is reconstructed.

数据集划分阶段：由于通常按照属性按照敏感等级排列之后的矩阵维数值n的较大，非常影响后续数据处理阶段中计算协方差矩阵的运行效率，例如当前人口普查数据表中的维数十分巨大，各个属性类别通常上百条，对于计算协方差矩阵来说，无疑将非常影响运算效率。因此，将按照敏感等级排列后的n维的矩阵划分成多个相对低维数的矩阵对于提高后续协方差处理的运行效率具有十分重要的意义。划分后的效果应当控制每个分块矩阵的维数在10以内为宜。因此，先将所有属性按照敏感等级排列之后形成一个新的m行n列的矩阵，进而按照维数阈值为p(p≤10)进行分块。对于最后一个分块维数不足p的分块矩阵用数值0进行填充，使之形成一个p维的填充矩阵。Data set division stage: Because the matrix dimension value n is usually large after being arranged according to attributes and sensitivity levels, it greatly affects the operation efficiency of calculating the covariance matrix in the subsequent data processing stage. For example, the dimension in the current census data table is very large. It is huge, and there are usually hundreds of attribute categories. For the calculation of the covariance matrix, it will undoubtedly greatly affect the operation efficiency. Therefore, dividing the n-dimensional matrix arranged according to the sensitivity level into a plurality of relatively low-dimensional matrices is of great significance to improve the operation efficiency of subsequent covariance processing. The effect after division should control the dimension of each block matrix within 10. Therefore, a new matrix with m rows and n columns is formed after arranging all the attributes according to the sensitivity level, and then blocks according to the dimension threshold value p (p≤10). The last block matrix whose block dimension is less than p is filled with a value of 0 to form a p-dimensional filled matrix.

分块敏感等级标注阶段：对于m行p列的p维分块矩阵进行等级标注，由于在上两个阶段中属性重新排列阶段和数据集划分阶段已经将属性的敏感等级按照高中低进行排序和划分。因此，本阶段的分块矩阵的敏感等级也可较为容易的划分为高中低三个等级，即将划分后的分块矩阵整体的隐私保护程度分为高中低三个等级。Block Sensitivity Level Labeling Stage: For the level labeling of the p-dimensional block matrix with m rows and p columns, since in the previous two stages, the attribute rearrangement stage and the data set division stage have already sorted the sensitivity levels of the attributes according to high, high and low. Divide. Therefore, the sensitivity level of the block matrix at this stage can also be easily divided into three levels: high, medium and low, that is, the overall privacy protection degree of the divided block matrix is divided into three levels: high, medium and low.

(4)数据扰动阶段，包含五个子阶段：数据均值化阶段、计算协方差矩阵阶段、隐私预算参数分配阶段、隐私噪声抽取阶段、数据隐私加噪阶段。(4) Data perturbation stage, including five sub-stages: data averaging stage, covariance matrix calculation stage, privacy budget parameter allocation stage, privacy noise extraction stage, and data privacy noise addition stage.

数据均值化阶段：为了避免量纲的影响，将每个数值减去该列的均值，之后形成一个均值化的矩阵X。Data averaging stage: In order to avoid the influence of dimension, each value is subtracted from the mean of the column, and then an averaged matrix X is formed.

隐私预算参数分配阶段：根据不同敏感属性的隐私保护等级强度分配不同的隐私保护需求。对于差分隐私保护技术来说，隐私保护参数越小，则相应的隐私保护等级越高，数据的可用性越低。因此，本阶段分配的方法是，将隐私预算按照高敏感属性列、中敏感属性列、低敏感属性列按照比例1：9：90的比例配比每个属性的隐私预算。之后，在将每个分块矩阵中所有属性列的隐私预算进行求和，得到分块矩阵的隐私预算ε_i。Privacy budget parameter allocation stage: allocate different privacy protection requirements according to the privacy protection level strength of different sensitive attributes. For differential privacy protection technology, the smaller the privacy protection parameter, the higher the corresponding privacy protection level and the lower the data availability. Therefore, the allocation method at this stage is to match the privacy budget of each attribute according to the high-sensitive attribute column, medium-sensitive attribute column, and low-sensitive attribute column in a ratio of 1:9:90. After that, sum the privacy budgets of all attribute columns in each block matrix to obtain the privacy budget ε _i of the block matrix.

隐私噪声抽取阶段：由于协方差矩阵具有对称和半正定等特性，因此在协方差上加入的扰动噪声需要满足这两个性质。具体而言，对于m行p列的分块矩阵而言，首先生成一个m行m列的正定矩阵C，要求这个正定矩阵C满足m个特征值均相等，且全部特征值设定为

进而从Wishart分布W(m+1,C)中提取一个m行p列的噪声样本矩阵W。这样保证了加入噪声扰动的整个分块矩阵满足(ε,0)-差分隐私，即对于相邻的分块矩阵(两个分块矩阵只有一个元素不同)，经过噪声扰动后，它们的输出几乎是完全一致的。这样即使攻击者具备了除了某个元素之外的所有信息，也无法获知该元素的具体值，这样就具备了强大的隐私保护特性。Privacy noise extraction stage: Since the covariance matrix has characteristics such as symmetry and positive semi-definite, the disturbance noise added to the covariance needs to satisfy these two properties. Specifically, for a block matrix with m rows and p columns, first generate a positive definite matrix C with m rows and m columns. This positive definite matrix C is required to satisfy m eigenvalues that are all equal, and all eigenvalues are set as

Then, a noise sample matrix W with m rows and p columns is extracted from the Wishart distribution W(m+1,C). This ensures that the entire block matrix added with noise perturbation satisfies (ε, 0)-differential privacy, that is, for adjacent block matrices (two block matrices differ only by one element), after noise perturbation, their outputs are almost is exactly the same. In this way, even if the attacker has all the information except a certain element, he cannot know the specific value of the element, which has a strong privacy protection feature.

数据隐私加噪阶段：将本模块中的噪声样本矩阵W添加到协方差矩阵A上，形成了加扰噪声之后的协方差矩阵A'，即A'＝A+W。Data privacy and noise addition stage: add the noise sample matrix W in this module to the covariance matrix A to form the covariance matrix A' after scrambled noise, that is, A'=A+W.

(5)数据变换阶段：(5) Data transformation stage:

采用一个基于特征向量估计的稀疏矩阵变换的算法，构建一个投影算子。而稀疏矩阵变换的正交特征矩阵能够表示为一系列有限次的连续Gives旋转相乘的形式。具体采用稀疏矩阵变换的实施方式如下：A projection operator is constructed using a sparse matrix transformation algorithm based on eigenvector estimation. The orthogonal eigenmatrix of sparse matrix transformation can be expressed as a series of finite times of continuous Gives rotation multiplication. The specific implementation of sparse matrix transformation is as follows:

for h＝1to H：for h=1toH:

①调整i,j使其满足

达到最大值①Adjust i,j to satisfy

Reaches the maximum value

②计算角度

②Calculate the angle

③每轮的稀疏矩阵G_h都是对坐标(i_h,j_h)进行Givens旋转变换，G_h＝I+Θ(i_h,j_h,θ_h)，其中I是单位矩阵。③ The sparse matrix G _h of each round is to perform Givens rotation transformation on the coordinates (i _h , j _h ), G _h =I+Θ(i _h , j _h , θ _h ), where I is the identity matrix.

④计算

④Calculation

⑤计算

⑤Calculation

其中，采用迭代法得到每一步的G_h，迭代次数H可采用交叉验证的方式得到，进而利用该稀疏矩阵变换方法，得到特征向量矩阵

并计算

得到特征值，之后将特征向量按照其所对应的特征值从大到小排列。并计算特征值求和

的累计贡献率，将特征向量的累计贡献率和大于95％的特征值矩阵中的前k个最大值组成Λ＝{λ₁,λ₂,......,λ_k}，并得到特征值所对应的特征向量矩阵E_k，通常主成分越靠前则包含的信息越充足。进而将特征向量转置并与均值化后的矩阵相乘，得到k维(k≤p)的低维矩阵。Among them, the iterative method is used to obtain G _h of each step, and the number of iterations H can be obtained by cross-validation, and then the sparse matrix transformation method is used to obtain the eigenvector matrix

and calculate

The eigenvalues are obtained, and then the eigenvectors are arranged in descending order according to their corresponding eigenvalues. and calculate the sum of the eigenvalues

The cumulative contribution rate of , the cumulative contribution rate of the eigenvectors and the top k maximum values in the eigenvalue matrix greater than 95% are composed of Λ={λ ₁ ,λ ₂ ,...,λ _k }, and get The eigenvector matrix E _k corresponding to the eigenvalues usually contains more sufficient information as the principal components are in the front. Then, the eigenvectors are transposed and multiplied by the averaged matrix to obtain a k-dimensional (k≤p) low-dimensional matrix.

(6)数据复原阶段：包含两个子阶段，低维矩阵复原阶段，均值化复原阶段。(6) Data restoration stage: It includes two sub-stages, a low-dimensional matrix restoration stage and an average restoration stage.

低维矩阵复原阶段：将特征向量矩阵与k维的低维矩阵相乘得到复原矩阵。但此时的复原矩阵还并未去均值化。Low-dimensional matrix restoration stage: The restoration matrix is obtained by multiplying the eigenvector matrix with the k-dimensional low-dimensional matrix. However, the restoration matrix at this time has not been de-averaged.

均值化复原阶段：将复原矩阵的每个元素加上相对应原始矩阵所在列的均值，得到均值化复原后的矩阵。The mean value restoration stage: add each element of the restoration matrix to the mean value of the column corresponding to the original matrix to obtain the mean value restoration matrix.

(7)数据发布阶段：将所有经过隐私噪声扰动的低维矩阵复原所形成的m行p列分块矩阵进行拼接，形成隐私保护加噪的m行n列矩阵，并在数据表头加上相应的属性名称，对照原始数据表的属性排列顺序进行调整复原，形成完整的数据表。至此可形成对外发布隐私版本的数据。(7) Data release stage: splicing all m-row and p-column block matrices formed by restoring the low-dimensional matrix perturbed by privacy noise to form a privacy-preserving noise-added m-row and n-column matrix, and add it to the header of the data table. The corresponding attribute name is adjusted and restored according to the attribute arrangement order of the original data table to form a complete data table. So far, the private version of the data can be formed.

图5是将传统的差分隐私拉普拉斯加噪机制算法与本发明提供的的基于分块稀疏矩阵变换的高维敏感数据隐私分级保护发布方法比较，当控制上述两种方法中相同属性的隐私保护预算相等的前提下，进行近似误差率对比。Fig. 5 is a comparison between the traditional differential privacy Laplacian noise mechanism algorithm and the high-dimensional sensitive data privacy protection and publishing method based on block sparse matrix transformation provided by the present invention. When controlling the same attributes in the above two methods On the premise that the privacy protection budget is equal, the approximate error rate is compared.

图6为本发明实施例提供的一项对比实验，目的在于检验分块之后对于隐私运算处理速度的影响。具体对比为，控制其他因变量不变，只改变“是否采用分块方案”这一个变量，来对比两者之间的运行效率。而传统未分块矩阵的运行时间为1893×10^-4秒，由此可得出结论，分块方案大大提高了隐私保护运算处理效率。FIG. 6 is a comparative experiment provided by an embodiment of the present invention, the purpose of which is to examine the impact on the processing speed of the privacy operation after the block is divided. The specific comparison is to control other dependent variables unchanged, and only change the variable "whether to adopt the block scheme" to compare the operating efficiency between the two. While the running time of the traditional unblocked matrix is 1893×10 ^-4 seconds, it can be concluded that the block scheme greatly improves the processing efficiency of privacy-preserving operations.

表1近似误差率对比表格Table 1 Approximate error rate comparison table

隐私保护预算privacy protection budget 传统拉普拉斯方法误差率Traditional Laplace method error rate 本发明误差率Error rate of the present invention 0.10.1 1670％1670% 152％152% 0.50.5 580％580% 98％98% 11 180％180% 46％46% 1010 35％35% 5％5% 100100 12％12% 0.90％0.90%

表2分库矩阵运行时间对比Table 2 Sub-library matrix running time comparison

分块数number of blocks 本发明分块方法运行时间/10<sup>-4</sup>秒The running time of the block method of the present invention/10<sup>-4</sup> seconds 55 2626 66 2020 77 1818 88 1616 99 22twenty two 1010 2626

而传统未分块矩阵的运行时间为1893×10^-4秒，由此可得出结论，分块方案大大提高了隐私保护运算处理效率。While the running time of the traditional unblocked matrix is 1893×10 ^-4 seconds, it can be concluded that the block scheme greatly improves the processing efficiency of privacy-preserving operations.

在上述实施例中，可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用全部或部分地以计算机程序产品的形式实现，所述计算机程序产品包括一个或多个计算机指令。在计算机上加载或执行所述计算机程序指令时，全部或部分地产生按照本发明实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中，或者从一个计算机可读存储介质向另一个计算机可读存储介质传输，例如，所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL)或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输)。所述计算机可读取存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质，(例如，软盘、硬盘、磁带)、光介质(例如，DVD)、或者半导体介质(例如固态硬盘SolidState Disk(SSD))等。In the above-mentioned embodiments, it may be implemented in whole or in part by software, hardware, firmware or any combination thereof. When implemented in whole or in part in the form of a computer program product, the computer program product includes one or more computer instructions. When the computer program instructions are loaded or executed on a computer, all or part of the processes or functions described in the embodiments of the present invention are generated. The computer may be a general purpose computer, special purpose computer, computer network, or other programmable device. The computer instructions may be stored in or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions may be downloaded from a website site, computer, server or data center Transmission to another website site, computer, server, or data center by wire (eg, coaxial cable, fiber optic, digital subscriber line (DSL), or wireless (eg, infrared, wireless, microwave, etc.)). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device such as a server, a data center, or the like that includes an integration of one or more available media. The usable media may be magnetic media (eg, floppy disks, hard disks, magnetic tapes), optical media (eg, DVD), or semiconductor media (eg, Solid State Disk (SSD)), among others.

以上所述，仅为本发明的具体实施方式，但本发明的保护范围并不局限于此，任何熟悉本技术领域的技术人员在本发明揭露的技术范围内，凡在本发明的精神和原则之内所作的任何修改、等同替换和改进等，都应涵盖在本发明的保护范围之内。The above are only specific embodiments of the present invention, but the protection scope of the present invention is not limited to this, any person skilled in the art is within the technical scope disclosed by the present invention, and all within the spirit and principle of the present invention Any modifications, equivalent replacements and improvements made within the scope of the present invention should be included within the protection scope of the present invention.

Claims

1. A classified protection and release method for privacy of high-dimensional sensitive data is characterized by comprising the following steps:

receiving m data samples in total of input original data, wherein the dimension of each sample is n-dimension, and the privacy protection parameter needing to be set is epsilon; outputting data of the privacy protection version as m data samples, wherein the dimension of each sample is n-dimension, and taking the disturbance data of the privacy protection version as the data which is finally published to the outside;

the classified protection and release method for the privacy of the high-dimensional sensitive data comprises seven major stages: the method comprises a data acquisition stage, a data preprocessing stage, a data privacy protection level evaluation stage, a data disturbance stage, a data transformation stage, a data restoration stage and a data release stage;

the privacy classified protection publishing method of the high-dimensional sensitive data comprises the following steps:

step one, data acquisition: the issuing party performs data acquisition and selects a proper total privacy protection parameter epsilon;

step two, data preprocessing: after the data acquisition of the publisher is finished, carrying out data preprocessing; scanning each sample respectively, and if n attribute values of the samples have a certain attribute as a null value, filling the n attribute values with 0 to ensure that each attribute of n dimensions has a numerical value; integrating and arranging all data into a matrix which is m rows and n columns as a whole, namely the number of samples of the matrix is m, and the dimension is n;

step three, evaluating the data privacy protection level: the data publisher evaluates the privacy protection level of the data attributes, and then rearranges the attributes; dividing the data set into relatively low-dimensional block matrixes, wherein after the data set is divided into blocks, the dimension of each block matrix is p, and the sensitivity level of the whole block matrix is marked;

step four, data disturbance: adding Wishart privacy noise disturbance to a covariance matrix of a block matrix by a publisher;

step five, data transformation: obtaining a characteristic vector matrix and a corresponding characteristic value diagonal matrix by using a sparse matrix transformation method, and taking characteristic vectors corresponding to the first k maximum values of the characteristic value diagonal matrix; transposing the characteristic vector matrix and multiplying the transposed characteristic vector matrix and the averaged matrix to obtain a k-dimensional low-dimensional matrix, wherein k is less than or equal to p;

step six, data restoration: restoring the low-dimensional matrix to further restore the matrix before equalization;

step seven, data release: splicing m rows and p columns of block matrixes formed by restoring all privacy noise disturbed low-dimensional matrixes to form m rows and n columns of matrixes with privacy protection and noise, and adding corresponding attribute names to the head of a data table; and adjusting and restoring the attribute arrangement sequence of the original data table to form a complete data table and form data of the external release privacy version.

2. The privacy-based hierarchical protection and release method for high-dimensional sensitive data according to claim 1, wherein the data acquisition phase comprises a data collection phase and a parameter selection phase;

a data collection stage, namely collecting the data of each individual sample, determining specific values of n attributes in each sample, and updating and calculating the value m of the total number of the individual samples in the data set in real time;

in the parameter selection stage, a privacy protection parameter epsilon represents the budget of privacy allocation, and if epsilon is larger, the smaller the privacy protection degree is, and the stronger the data availability is; if epsilon is smaller, the greater the degree of privacy protection, the poorer the usability of data; therefore, the value of epsilon needs to be continuously adjusted according to the actual privacy protection requirement, and finally, a proper total privacy protection parameter epsilon is determined according to the actual privacy protection requirement.

3. The method for privacy-preserving and-protecting issuance of high-dimensional sensitive data according to claim 1, wherein the data privacy-preserving level evaluation phase comprises an attribute sensitivity level evaluation phase, an attribute rearrangement phase, a data set division phase, and a blocking sensitivity level labeling phase;

an attribute sensitivity level evaluation stage, wherein the sensitivity level of each row of attributes is evaluated, and each row of levels are labeled according to relative high, medium and low;

in the attribute rearrangement stage, after the sensitivity levels are marked, the matrix is arranged according to the sensitivity level of each attribute, and if the sensitivity level of the attribute is higher, the arranged dimensionality is closer to the front, and the dimensionality is lower; after all attributes are arranged according to the sensitivity levels, a new matrix with m rows and n columns is formed again;

in the data set division stage, all attributes are arranged according to the sensitivity level to form a new matrix with m rows and n columns, and then blocking is carried out according to the dimension threshold value as p; filling the block matrix with the last block dimension less than p by using a numerical value 0 to form a p-dimension filling matrix, wherein p is less than or equal to 10;

in the blocking sensitivity level marking stage, level marking is carried out on a p-dimensional blocking matrix of m rows and p columns, and since the sensitivity levels of the attributes are sorted and divided according to high, medium and low in the attribute rearranging stage and the data set dividing stage in the last two stages, the integral privacy protection degree of the divided blocking matrix is divided into three levels of high, medium and low;

at this point, a complete m rows and n columns are divided to form a new block matrix X with a plurality of m rows and p columns, and the block matrix is divided into three levels of high, medium and low according to the requirements of different privacy protection degrees.

4. The privacy-preserving publication method for high-dimensional sensitive data according to claim 1, wherein the data perturbation module comprises: a data averaging stage, a covariance matrix calculation stage, a privacy budget parameter allocation stage, a privacy noise extraction stage and a data privacy noise adding stage;

in the data equalization stage, subtracting the average value of the columns of the block matrix X where the element is positioned from each numerical value to form an equalized matrix X;

a stage of calculating covariance matrix, namely calculating covariance matrix A of X after averaging each p-dimensional block matrix,

in the privacy budget parameter distribution stage, different privacy protection budget parameters are distributed according to the privacy protection level strengths of different sensitive attributes; for the differential privacy protection technology, the smaller the privacy protection parameter is, the higher the corresponding privacy protection level is, and the lower the data availability is; the privacy budget is calculated according to the proportion of a high sensitive attribute column, a medium sensitive attribute column and a low sensitive attribute column of 1:9: a ratio of 90 matches the privacy budget for each attribute; summing the privacy budgets of all attribute columns in each block matrix to obtain the privacy budget epsilon of the block matrix _i ；

In the privacy noise extraction stage, for a block matrix with m rows and p columns, a positive definite matrix C with m rows and m columns is generated; the positive definite matrix C satisfies that m eigenvalues are all equal, and all eigenvalues are set as

Extracting a noise sample matrix W with m rows and p columns from the Wishart distribution W (m +1,C);

and a data privacy and noise adding stage, namely adding the noise sample matrix W to a covariance matrix A to form a covariance matrix A 'after scrambling noise, namely A' = A + W.

5. The privacy-preserving and publishing method for high-dimensional sensitive data according to claim 1, wherein the data recovery phase comprises a low-dimensional matrix recovery phase and an averaging recovery phase;

in the low-dimensional matrix restoration stage, a feature vector matrix is multiplied by a k-dimensional low-dimensional matrix to obtain a restoration matrix, but the restoration matrix is not subjected to averaging;

and in the equalization restoration stage, adding the average value of the column where the corresponding original matrix is positioned to each element of the restoration matrix to obtain an equalized and restored matrix, and removing the column filled with 0 before in the last block.

6. A privacy-preserving and-publishing system for high-dimensional sensitive data, which implements the privacy-preserving and-publishing method for high-dimensional sensitive data according to any one of claims 1 to 5, wherein the privacy-preserving and-publishing system for high-dimensional sensitive data comprises:

the data acquisition module is used for acquiring data by a publisher and selecting a proper total privacy protection parameter epsilon;

the data preprocessing module is used for preprocessing the data after the data acquisition of the publisher is finished; scanning each sample respectively, and if some attribute of n attribute values of the samples is a null value, filling the n attribute values with 0 to ensure that each attribute of n dimensions has a numerical value; integrating and arranging all data into a matrix which is m rows and n columns as a whole, namely the number of samples of the matrix is m, and the dimension is n;

the data privacy protection level evaluation module is used for evaluating the privacy protection level of the data attribute by the data publisher so as to rearrange the attribute; dividing the data set into relatively low-dimensional block matrixes, wherein after the data set is divided into blocks, the dimension of each block matrix is p, and the sensitivity level of the whole block matrix is marked;

the data disturbance module is used for adding Wishart privacy noise disturbance to the covariance matrix of the block matrix by the publisher;

the data transformation module is used for obtaining a characteristic vector matrix and a corresponding characteristic value diagonal matrix by using a sparse matrix transformation method, and taking characteristic vectors corresponding to the first k maximum values of the characteristic value diagonal matrix; transposing the characteristic vector matrix and multiplying the transposed characteristic vector matrix and the averaged matrix to obtain a k-dimensional low-dimensional matrix, wherein k is less than or equal to p;

the data recovery module is used for recovering the low-dimensional matrix so as to recover the matrix before equalization;

the data publishing module is used for splicing m rows and p columns of block matrixes formed by recovering all low-dimensional matrixes disturbed by privacy noise to form m rows and n columns of matrixes with privacy protection and noise, and adding corresponding attribute names to the head of a data table; and adjusting and restoring the attribute arrangement sequence of the original data table to form a complete data table and form data of the external release privacy version.

7. A computer device, characterized in that the computer device comprises a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to carry out the steps of:

data acquisition, namely acquiring data by a publisher, and selecting a proper total privacy protection parameter epsilon;

data preprocessing, namely preprocessing the data after the data acquisition of the issuing party is finished; scanning each sample respectively, and if some attribute of n attribute values of the samples is a null value, filling the n attribute values with 0 to ensure that each attribute of n dimensions has a numerical value; integrating and arranging all data into a matrix which is m rows and n columns as a whole, namely the number of samples of the matrix is m, and the dimension is n;

evaluating the privacy protection level of the data, and evaluating the privacy protection level of the data attribute by a data publisher so as to rearrange the attribute; dividing the data set into relatively low-dimensional block matrixes, wherein after the data set is divided into blocks, the dimension of each block matrix is p, and the sensitivity level of the whole block matrix is marked;

data disturbance: adding Wishart privacy noise disturbance to a covariance matrix of a block matrix by a publisher;

data transformation, namely obtaining a characteristic vector matrix and a corresponding characteristic value diagonal matrix by using a sparse matrix transformation method, and taking characteristic vectors corresponding to the first k maximum values of the characteristic value diagonal matrix; transposing the characteristic vector matrix and multiplying the transposed characteristic vector matrix and the averaged matrix to obtain a k-dimensional low-dimensional matrix, wherein k is less than or equal to p;

data recovery, recovering the low-dimensional matrix, and further recovering the matrix before equalization;

data publishing, namely splicing m rows and p columns of block matrixes formed by restoring all privacy noise disturbed low-dimensional matrixes to form m rows and n columns of matrixes for privacy protection and noise addition, and adding corresponding attribute names to a data table header; and adjusting and restoring the attribute arrangement sequence of the original data table to form a complete data table and form data of the external release privacy version.

8. A computer-readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of:

data recovery, namely recovering the low-dimensional matrix so as to recover the matrix before equalization;

data publishing, namely splicing m rows and p columns of block matrixes formed by recovering all privacy noise disturbed low-dimensional matrixes to form m rows and n columns of matrixes for privacy protection and noise, and adding corresponding attribute names to the head of a data table; and adjusting and restoring the attribute arrangement sequence of the original data table to form a complete data table and form data of the external release privacy version.

9. An information data processing terminal, characterized in that the information data processing terminal is used for implementing the privacy-preserving and publishing system of high-dimensional sensitive data according to claim 6.