CN110443061B

CN110443061B - Data encryption method and device

Info

Publication number: CN110443061B
Application number: CN201810413622.5A
Authority: CN
Inventors: 李梁; 周俊; 李小龙
Original assignee: Advanced New Technologies Co Ltd
Current assignee: Advanced New Technologies Co Ltd; Advantageous New Technologies Co Ltd
Priority date: 2018-05-03
Filing date: 2018-05-03
Publication date: 2023-06-20
Anticipated expiration: 2038-05-03
Also published as: CN110443061A

Abstract

The embodiment of the specification discloses a data encryption method and device, wherein the method comprises the following steps: acquiring an original data matrix; acquiring a first parameter for defining the differential privacy algorithm and a second parameter for representing the validity of the data after encryption; acquiring an intermediate data matrix, wherein the intermediate data matrix defines a plurality of points in the first dimension space, and the plurality of points defined by the intermediate data matrix are points obtained by respectively perturbing the plurality of points defined by the original data matrix, wherein the perturbation is an offset based on the first parameter and the second parameter; and multiplying the intermediate data matrix with a projection matrix to obtain an encrypted data matrix.

Description

Data encryption method and device

Technical Field

The embodiment of the specification relates to the technical field of Internet, in particular to a data encryption method and device.

Background

Under the requirement of internet big data modeling analysis, how to protect the privacy of users is a very important problem. In this context, differential privacy techniques are increasingly being used. Differential privacy is a formalized definition of data privacy security that ensures that information of individual data is not revealed while modeling analysis is performed on all data. Differential privacy is the most reasonable guarantee of individual privacy security under the requirement of big data modeling analysis. For random encryption algorithms in differential privacy algorithms, the validity of the data is typically considered. General data validity considers that the performance of encrypted data on a specific index is approximately equal to the performance of the original data on the same index. However, there is a certain contradiction between the validity of data and the privacy security, and in general, the higher the validity of data is, the worse the privacy security is relatively; conversely, the better the privacy security, the worse the validity of the data. Thus, there is a need for a more efficient data encryption scheme that allows for more efficient balancing of data validity and privacy security.

Disclosure of Invention

Embodiments of the present disclosure aim to provide a more efficient data encryption method to address the deficiencies in the prior art.

To achieve the above object, an aspect of the present specification provides a data encryption method, which implements a differential privacy algorithm, including: acquiring an original data matrix, wherein the original data matrix defines a plurality of points of a first dimension space, the number of the points corresponds to the number of users, the points correspond to feature vectors of the users, and the dimension number of the first dimension space is the dimension number of the feature vectors; acquiring a first parameter for defining the differential privacy algorithm and a second parameter for representing the validity of the data after encryption; acquiring an intermediate data matrix, wherein the intermediate data matrix defines a plurality of points in the first dimension space, and the plurality of points defined by the intermediate data matrix are points obtained by respectively perturbing the plurality of points defined by the original data matrix, wherein the perturbation is an offset based on the first parameter and the second parameter; and multiplying the intermediate data matrix with a projection matrix to obtain an encrypted data matrix, the projection matrix being for: and projecting the plurality of points defined by the intermediate data matrix into a plurality of points corresponding to any two points in the second dimension space, and enabling the ratio of the Euclidean distance between any two points in the second dimension space to the Euclidean distance between the corresponding two points in the first dimension space to be in a certain range, wherein the dimension number of the second dimension space is obtained based on the first parameter and the second parameter.

In one embodiment, in the data encryption method, the acquiring the intermediate data matrix includes: singular value decomposition is carried out on the original data matrix so as to represent the original data matrix as a product of three matrices, wherein the number of diagonal elements of a diagonal matrix positioned in the middle in the product of the three matrices is equal to the dimension number of the second dimension space; determining a disturbance parameter based on the first parameter and the second parameter; shifting each diagonal element of the diagonal matrix based on the perturbation parameters; and calculating the product of the three matrices after the offset to serve as the intermediate data matrix.

In one embodiment, in the data encryption method, performing singular value decomposition on the original data matrix includes performing an average removing operation on a value of each dimension of each point defined by the original data matrix, where the average is an average value of values of the plurality of points in the same dimension; and performing singular value decomposition on the original data matrix after the mean value removing operation.

In one embodiment, in the above data encryption method, the encrypted data matrix defines a plurality of points in the second dimension space, and the plurality of points defined by the encrypted data matrix correspond to the plurality of points defined by the original data matrix, respectively, and a difference between a distance between two points of the plurality of points defined by the encrypted data matrix and a distance between corresponding two points of the plurality of points defined by the original data matrix is related to the disturbance parameter.

In one embodiment, in the data encryption method, the projection matrix is obtained randomly from a random matrix, each element of the random matrix is a random variable, and each random variable is independent of each other and has the same distribution, wherein the random matrix satisfies: the desired value of the product of the transpose of the random matrix and the random matrix is the identity matrix.

In one embodiment, in the data encryption method, the second dimension space is an r dimension space, and the random variable satisfies a gaussian distribution with an expected value of 0 and a variance of 1/r.

In one embodiment, in the data processing method, the second dimension space is an r dimension space, and the random variable is satisfied in the following way

Uniformly distributed on the surface.

In one embodiment, in the data encryption method, the second dimension space is an r dimension space, and the random variables satisfy the following requirements respectively

Probability value +.>

Is a distribution of (a).

In one embodiment, in the above data encryption method, the differential privacy algorithm is an (epsilon, delta) -differential privacy algorithm, and the first parameter includes epsilon and delta.

In one embodiment, in the above data encryption method, the second parameter includes η and ν, where η represents a maximum relative error in a distance between a pair of points defined by the original data matrix after processing by the method, ν represents a maximum probability of a failure in multiple executions of the method, and determining values of a dimension number and a disturbance parameter of a second dimension space based on the first parameter and the second parameter includes determining the dimension number of the second dimension space based on η and ν.

In one embodiment, in the above data encryption method, the determining the number of dimensions of the second dimension space and the value of the disturbance parameter based on the first parameter and the second parameter includes determining the value of the disturbance parameter based on epsilon, delta, and the number of dimensions of the second dimension space.

Another aspect of the present specification provides a data encryption apparatus, the apparatus implementing a differential privacy algorithm, comprising: a first acquisition unit configured to acquire an original data matrix, the original data matrix defining a plurality of points of a first dimension space, wherein the number of the plurality of points corresponds to a user number, the points correspond to a feature vector of the user, and the dimension number of the first dimension space is the dimension number of the feature vector; a second acquisition unit configured to acquire a first parameter for defining the differential privacy algorithm and a second parameter for representing validity of data after encryption; a perturbation unit configured to acquire an intermediate data matrix, wherein the intermediate data matrix defines a plurality of points in the first dimension space, and the plurality of points defined by the intermediate data matrix are points obtained by perturbing the plurality of points defined by the original data matrix, respectively, wherein the perturbation is an offset based on the first parameter and the second parameter; and a projection unit configured to multiply the intermediate data matrix with a projection matrix to obtain an encrypted data matrix, the projection matrix being for: and projecting the plurality of points defined by the intermediate data matrix into a plurality of points corresponding to any two points in the second dimension space, and enabling the ratio of the Euclidean distance between any two points in the second dimension space to the Euclidean distance between the corresponding two points in the first dimension space to be in a certain range, wherein the dimension number of the second dimension space is obtained based on the first parameter and the second parameter.

In one embodiment, in the above data encryption device, the disturbing unit further includes the following sub-units: a decomposition subunit configured to perform singular value decomposition on the original data matrix to represent the original data matrix as a product of three matrices, where the number of diagonal elements of a diagonal matrix located in the middle in the product of three matrices is equal to the number of dimensions of the second dimension space; a determining subunit configured to determine a disturbance parameter based on the first parameter and the second parameter; an offset subunit configured to offset each diagonal element of the diagonal matrix based on the perturbation parameter; and a calculating subunit configured to calculate a product of the three matrices after the offset as the intermediate data matrix.

In one embodiment, in the above data encryption device, the decomposition subunit is further configured to perform a mean removing operation on a value of each dimension of each point defined by the original data matrix, where the mean is an average value of values of the plurality of points in the same dimension; and performing singular value decomposition on the original data matrix after the mean value removing operation.

In the data encryption scheme according to the embodiment of the present specification, by disturbing the original data based on the differential privacy parameter and the data validity parameter, the contradiction between security and validity can be balanced, and strict quantization guarantee can be provided for security and validity. Meanwhile, the projection matrix based on the J-L lemma is used for projecting the original data, so that the data effectiveness is further ensured.

Drawings

The embodiments of the present specification may be further clarified by describing the embodiments of the present specification with reference to the accompanying drawings:

fig. 1 shows an application scenario of a data encryption method according to an embodiment of the present specification;

FIG. 2 shows a flow chart of a data encryption method according to an embodiment of the present description;

FIG. 3 shows a flow chart of a method of acquiring an intermediate data matrix;

FIG. 4 shows a process of singular value decomposition of matrix A;

FIG. 5 shows a reduction process for a diagonal matrix in singular value decomposition; and

fig. 6 shows a data encryption apparatus 600 according to an embodiment of the present specification.

Detailed Description

Embodiments of the present specification will be described below with reference to the accompanying drawings.

Fig. 1 shows an application scenario of a data encryption method according to an embodiment of the present specification. The scenario includes a plurality of data providers 11 and data processors 12. The data providers 11 are, for example, shopping websites, social APP, etc., and each has own user group and characteristic data of the user group. The data processor 12 typically has a large data processing capacity, such as an ant gold garment. The plurality of data providers 11 respectively provide the data processing side 12 with the encrypted characteristic data of its user, so that the data processing side 12 individually models and analyzes the encrypted data of the respective data providers 11. In this scenario, by using differential privacy techniques, user privacy is preserved while data availability is preserved. Since the data processor 12 individually processes the data of the different data providers 11, the data provider 11 may encrypt the local original data and then send the encrypted original data to the data processor 12, where no portion is disclosed in the encryption process. The data processor 12, after receiving the encrypted data, may perform modeling analysis on the encrypted data, but may not be able to obtain any information related to the original data.

In the server of the data provider 11, first, a disturbance is applied to the original data matrix based on the parameters of the differential privacy algorithm and the data validity parameters, thereby obtaining an intermediate data matrix, and then, the intermediate data matrix is multiplied by the projection matrix, thereby obtaining an encrypted data matrix. Wherein the projection matrix is a randomly obtained projection matrix satisfying the J-L axiom (Johnson-Lindenstraauss axiom). By the processing, the obtained encrypted data matrix meets the differential privacy standard and the data validity standard, so that the data validity and the privacy security are effectively balanced.

Fig. 2 shows a flowchart of a data encryption method according to an embodiment of the present specification. In one embodiment, the method is performed at the server side of the data provider, however, the method is not limited to being performed at the server side of the data provider, it may be performed, for example, at the server side of a third party, at the server side of the data processor, etc. The method implements a differential privacy algorithm comprising the steps of: in step S21, an original data matrix is obtained, where the original data matrix defines a plurality of points in a first dimension space, the number of the points corresponds to the number of users, the points correspond to feature vectors of the users, and the number of dimensions in the first dimension space is the number of dimensions of the feature vectors; in step S22, a first parameter for defining the differential privacy algorithm and a second parameter for representing the validity of the data after encryption are acquired; in step S23, an intermediate data matrix is acquired, wherein the intermediate data matrix defines a plurality of points in the first dimension space, and the plurality of points defined by the intermediate data matrix are points obtained by respectively perturbing the plurality of points defined by the original data matrix, wherein the perturbation is an offset based on the first parameter and the second parameter; and multiplying the intermediate data matrix with a projection matrix to obtain an encrypted data matrix at step S24, the projection matrix being for: and projecting the plurality of points defined by the intermediate data matrix into a plurality of points corresponding to any two points in the second dimension space, and enabling the ratio of the Euclidean distance between any two points in the second dimension space to the Euclidean distance between the corresponding two points in the first dimension space to be in a certain range, wherein the dimension number of the second dimension space is obtained based on the first parameter and the second parameter.

First, in step S21, an original data matrix is acquired, where the original data matrix defines a plurality of points in a first dimension space, where the number of the plurality of points corresponds to a number of users, the points correspond to feature vectors of the users, and the number of dimensions in the first dimension space is the number of dimensions of the feature vectors. For example, a matrix a of n rows and d columns is pre-stored at the data provider server, so as to obtain the pre-stored matrix a as an original data matrix. In another embodiment, the data providing server side randomly selects d characteristic data of n users from the stored user data according to the query request of the data processing server side, and forms a matrix A of n rows and d columns as an original data matrix.

In an original data matrix of, for example, n rows and d columns, n represents, for example, the number of users, which is in the order of millions, tens of millions, etc. d is, for example, a feature number of each user, for example, when the data provider is a shopping website, each user includes features including, for example, gender, age, address, kind of purchased goods, price of purchased goods, shopping period, and the like, the number of which is in the order of thousands, tens of thousands, and the like. Each element A of the matrix A _ij A feature value representing the jth feature of the ith user, wherein 1.ltoreq.i.ltoreq.n, 1.ltoreq.j.ltoreq.d. The raw data matrix a may be understood as defining n points of a d-dimensional space, i.e. each row in the raw data matrix a may be regarded as a feature vector of a user, the dimension of the d-dimensional space being the dimension of the feature vector of the user, i.e. the number of columns of the matrix a. It will be appreciated that the raw data matrix is not limited to a matrix of n rows and columns, which may be, for example, a matrix of d rows and n columns.

In step S22, a first parameter defining the differential privacy algorithm and a second parameter representing the validity of the data after encryption are acquired. In the embodiment of the present specification, the differential privacy algorithm may select various differential privacy algorithms according to specific scene requirements, such as (epsilon, delta) -differential privacy algorithm, epsilon-differential privacy algorithm, random differential privacy algorithm, and the like. Parameters of each differential privacy algorithm may be acquired accordingly. For example, for the (ε, δ) -differential privacy algorithm A (X), i.e., there is a difference in only one user feature for all satisfactionInputs X and X' of (a) and all possible outputs

The condition shown in the following formula (1) holds:

wherein ε quantifies the maximum relative error of the likelihood that a feature of a single data record is compromised by privacy; delta quantifies the percentage of the overall record that may be compromised by privacy. The smaller the values of epsilon and delta, the higher the privacy security. The values of epsilon and delta may be determined according to the specific application scenario. For example, in a scenario where the raw data matrix represents a plurality of eigenvalues of n users, ε may be set to a magnitude less than Ln (10), and δ may be set to a magnitude less than 0.1.

Parameters for representing the validity of the encrypted data may be determined according to a specific application scenario. For example, in one embodiment, the parameters used to represent the validity of the encrypted data include η and ν, where η represents the maximum relative error in the distance between the pairs of points defined by the matrix of raw data after processing by the method and ν represents the maximum probability of error in the execution of the method, wherein the smaller the values of η and ν, the higher the data validity. Values of η and ν may be selected according to a specific application scenario, for example, in a scenario in which the original data matrix represents a plurality of eigenvalues of n users, both η and ν may be taken on the order of less than 0.1. The parameters indicating the validity of the encrypted data are not limited to η and ν described above, and for example, in a scene where gaussian disturbance is added to the original data matrix to encrypt, the validity of the encrypted data may be indicated by the variance σ of the gaussian disturbance, that is, the smaller the σ is, the better the validity of the data is.

In step S23, an intermediate data matrix is acquired, wherein the intermediate data matrix defines a plurality of points in the first dimension space, and the plurality of points defined by the intermediate data matrix are points obtained by respectively perturbing the plurality of points defined by the original data matrix, wherein the perturbation is an offset based on the first parameter and the second parameter.

In one embodiment, the intermediate data matrix is obtained by the following steps as shown in FIG. 3:

in step S31, singular value decomposition is performed on the original data matrix to represent the original data matrix as a product of three matrices, where the number of diagonal elements of the diagonal matrix located in the middle in the product of three matrices is equal to the number of dimensions of the second dimension space.

As known to those skilled in the art, referring to the process of singular value decomposition of matrix A shown in FIG. 4, a matrix of arbitrary m rows and n columns is given

The singular value decomposition of a can be expressed as the product of three matrices: a=uΣv ^T . Wherein the method comprises the steps of

And->

Respectively orthogonal matrices, Σ being a diagonal matrix. The non-zero values on the main diagonal of the diagonal matrix Σ are called the singular values of matrix a, which are arranged from large to small on the main diagonal, schematically shown in the size of a square.

Referring to the process of reducing the diagonal matrix in the singular value decomposition shown in fig. 5, in the singular value decomposition, when the singular value on the diagonal is smaller than a predetermined threshold, the importance thereof to the initial matrix a is small. Thus, by taking the number of singular values as r, discarding the smaller singular values after the r-th singular value, the singular value decomposition of matrix a can be equivalently expressed as a=uΣv ^T Wherein

Is a matrix of singular values for the diagonal. By this reduction process, the amount of data calculation can be reduced while maintaining the data validity.

In this embodiment, the original data matrix is decomposed into products of three matrices by singular value decomposition, wherein the diagonal matrix in the middle is the diagonal matrix after reduction, and the number of diagonal elements is the dimension number of the second dimension space. The second dimension space is a dimension space obtained by projecting the intermediate data matrix by a projection matrix to be described later. Wherein the number of dimensions of the second dimension space is obtained based on the first parameter and the second parameter. In one embodiment, the second parameter includes η and ν, wherein the number of dimensions of the second dimension space is determined based on η and ν. In one embodiment, let the number of dimensions of the second dimension space be r, the value of r is obtained based on the following equation (2):

from this formula, the magnitude of r is

On the order of (2).

In one embodiment, the value of r is more specifically defined based on the following equation (3):

in one embodiment, performing singular value decomposition on the original data matrix includes performing a de-averaging operation on values of each dimension of each point defined by the original data matrix, where the average is an average of values of the plurality of points in the same dimension; and performing singular value decomposition on the original data matrix after the mean value removing operation. For example, the original data matrix is a matrix of n rows and d columns, where n represents the number of users, d represents the number of features of the users, and the original data matrix is subjected to a de-averaging operation, that is, an operation shown in formula (4), where a represents the original data matrix:

1 in the formula (4) represents an all-1 column vector.

The calculation process can be simplified by performing the de-averaging operation as shown in the formula (4), which is equivalent to shifting the feature vector of each user to the vicinity of the origin of the first dimension space. It will be appreciated that this de-averaging operation is not necessary and that the singular value decomposition may be performed on the original data matrix as well without de-averaging.

In step S32, a disturbance parameter is determined based on the first parameter and the second parameter. The perturbation parameters are used to determine the offset to the diagonal elements of the diagonal matrix Σ in the above step S31 to perform perturbation to the original data matrix (or the raw data matrix subjected to the de-averaging) by the offset to the diagonal elements. In one embodiment, for the raw data matrix with the mean value removed, let the disturbance parameter be w, then w can be determined by the following equation (5):

wherein r is the number of dimensions of the second dimension space determined by the above formula (2) or (3).

As can be determined by equation (5), the magnitude of w is

On the order of (2).

In one embodiment, for a de-averaged raw data matrix, w may be more specifically determined by the following equation (6):

For an original data matrix that has not been de-averaged, the value of w may be similarly obtained, e.g., based on ε, δ, η, and ν, so that the processing of the original data matrix may satisfy differential privacy while preserving data validity.

In step S33, each diagonal element of the diagonal matrix is shifted based on the perturbation parameter. The specific offset operation may be an offset operation as shown in the following formula (7):

wherein I is _n×d Representing an n x d identity matrix. The process shown in equation (7) is to square the non-zero term of Σ first, then add the square of w, and then sum it.

In step S34, the product of the three matrices after the offset is calculated as the intermediate data matrix. The product of the above formula (7) is calculated by shifting Σ with respect to the disturbance parameter, that is, shifting each term of the original data matrix (or the raw data matrix subjected to the de-averaging process) with respect to the disturbance parameter, thereby obtaining a disturbed original data matrix, that is, an intermediate data matrix. Since the intermediate data matrix is only obtained by shifting the matrix elements of the original data matrix, the intermediate data matrix still defines a plurality of points in the first dimension space, i.e. in case the original data matrix is an n x d matrix, the intermediate data matrix is also an n x d dimensional matrix. And, since each characteristic value of each user is subjected to an offset related to the disturbance parameter for each user, the vector sum of the offsets of all the characteristics of each user in the first dimension space is the offset of the point corresponding to the user, and the offset of the point of the user is also related to the disturbance parameter. That is, the plurality of points defined by the intermediate data matrix are points obtained by respectively perturbing the plurality of points defined by the original data matrix, wherein the perturbation is an offset based on the first parameter and the second parameter.

It will be appreciated that in the embodiments of the present disclosure, the method of acquiring the intermediate data matrix is not limited to the above-described singular value decomposition method, and for example, the intermediate data matrix may be acquired by applying a random gaussian disturbance or a random laplace disturbance to each matrix element of the original data matrix.

Returning again to fig. 2, in step S24, the intermediate data matrix is multiplied by a projection matrix to obtain an encrypted data matrix, the projection matrix being used for: and projecting the plurality of points defined by the intermediate data matrix into a plurality of points corresponding to any two points in the second dimension space, and enabling the ratio of the Euclidean distance between any two points in the second dimension space to the Euclidean distance between the corresponding two points in the first dimension space to be in a certain range, wherein the dimension number of the second dimension space is obtained based on the first parameter and the second parameter.

For example, when the original data matrix is a matrix a of n rows and d columns, where n represents the number of users and d is the feature number of each user, as described above, the intermediate data matrix B is also a matrix of n rows and d columns. Thus, the projection matrix M can be determined as a matrix of d rows and r columns with respect to the intermediate data matrix. Where r is the dimension of the second dimension space to which the projection matrix M projects the intermediate data matrix, which is obtained based on the first and second parameters, which in one embodiment may be determined by the foregoing equations (2) or (3).

By right multiplying the projection matrix M with the original data matrix a, an n-row r-column encrypted data matrix can be obtained, which can be understood as projecting n points of the d-dimensional space as n points of the r-dimensional space. It will be appreciated that the projection matrix is not limited to right multiplication with the original data matrix V, for example, when the original matrix V is a matrix of d rows and n columns, the projection matrix may be a matrix of r rows and d columns, and by left multiplication of the projection matrix with the original data matrix V, an intermediate data matrix of r rows and n columns may be obtained, and the process may be equally understood as projecting n points of a d-dimensional space as n points of an r-dimensional space.

The projection matrix M satisfies the J-L lemma, i.e., after the projection matrix M projects n points of the d-dimensional space as n points of the r-dimensional space, the corresponding points in the two spaces satisfy the following formula (8):

wherein lambda is _JL For a predetermined small real number, e.g. 0<λ _JL <0.1, wherein x and y are two points in d-dimensional space,

is the square of the Euclidean distance between point x and point y in d-dimensional space. xM and yM are two points corresponding to points x and y in the d-dimensional space, respectively, projected into the r-dimensional space through the projection matrix. />

Is the square of the Euclidean distance between point xM and point yM in r-dimensional space. As can be seen from the above equation (8), the Euclidean distance between points x and y differs from the Euclidean distance between points xM and yM by 1.+ -. Lambda _JL Is a factor of (2). Thus, the ratio of the euclidean distance between the points x and y to the euclidean distance between the points xM and yM is within a certain range. Due to lambda _JL With smaller values, e.g. lambda _JL =0.05, and thus, the euclidean distance between the point xM and the point yM can be considered to be approximately unchanged from the euclidean distance between the point x and the point y. By generating a projection matrix M meeting the J-L lemma and projecting a plurality of points in the d-dimensional space as a plurality of points in the r-dimensional space by using the projection matrix M, the plurality of points in the d-dimensional space can be encrypted by the projection matrix M, and meanwhile, the analysis result for the plurality of points in the d-dimensional space can be obtained by learning the plurality of points in the r-dimensional space due to the guarantee of the J-L lemma on the data validity. In one embodiment, r<d, the plurality of points in the d-dimensional space are subjected to dimension reduction processing through the projection matrix M, so that the computational complexity is reduced.

In one embodiment, the projection matrix M satisfying the J-L axiom may be a projection matrix satisfying M ^T * M=i, i.e., M is an orthogonal matrix, where I is an identity matrix. For example, when n=3, r=3, i.e.M is a 3×3 matrix, and M may be an orthogonal matrix as shown below:

when 3 points (for example, points representing feature vectors of 3 users) in a d-dimensional space (for example, d=5) are projected to 3 points in an r=3-dimensional space through the orthogonal matrix, for example, the orthogonal matrix is multiplied by an original data matrix (3×5 matrix) to the right, so that the distance between every two points in the 3-dimensional space and the distance between every two corresponding points in the d-dimensional space are basically unchanged, that is, the J-L quotients are satisfied. However, since the arbitrarily obtained real matrix is adopted as the projection matrix M, the projection matrix M itself does not have any randomness, and thus does not contribute to the security of the differential privacy algorithm. Here, the projection matrix is not limited to a square matrix, for example, M may also be a 3×2 matrix or the like as long as it satisfies M ^T * M=i.

In one embodiment, the projection matrix M satisfying the J-L lemma may be obtained randomly from a random matrix Q, where each element of the random matrix Q is a random variable that is independent of each other and has the same distribution. Wherein the random matrix satisfies: the desired value of the product of the transpose of the random matrix and the random matrix is the identity matrix, i.e. E (Q ^T * Q) =i. For example, the random matrix Q may be represented by a random variable f as follows _ij (x) (i=1, 2,3, j=1, 2, 3):

wherein each f _ij (x) Is a random variable which is independently and uniformly distributed, and E (Q ^T * Q) =i. When calculating the projection matrix M, for each f _ij (x) Independently acquiring random values of x within a predetermined range, e.g. in [0,1]Random value in and then pass f _ij (x) Function calculation f _ij (x) To obtain each element of the projection matrix M. The random matrix Q is not limited to a square matrix here, and, for example,q may also be a 3X 2 matrix or the like as long as it satisfies E (Q ^T * Q) =i. By randomly obtaining the projection matrix M from the random matrix Q, the security of the differential privacy algorithm is further increased by obtaining the randomness of the projection matrix M.

In one embodiment, as an example of the random matrix Q described above, for an r-dimensional second dimension space, f _ij (x) A gaussian distribution with an expected value of 0 and a variance of 1/r is satisfied. Namely f _ij (x) N (0, 1/r), i.e. f _ij (x) Is an inverse function of a Gaussian cumulative probability distribution function, and the value range of x is 0,1]Represents f _ij (x) Gaussian cumulative distribution probability (probability integration from- ≡to that value) for each value of (c). When calculating the projection matrix M, for each f _ij (x) Independently obtain [0,1]The random value in is taken as the value of x and then passed through f _ij (x) Calculating f by expression of (2) _ij (x) To obtain each element of the projection matrix M.

In one embodiment, as an example of the random matrix Q described above, for an r-dimensional second dimension space, f _ij (x) Satisfy at

Uniformly distributed on the surface. The inverse of its cumulative probability distribution function, i.e., f, can be obtained from the probability distribution of the variable as such _ij (x) Expression for x, where x ranges from 0,1]Represents f _ij (x) Cumulative distribution probability of individual values of (a). By for each f _ij (x) Independently obtain [0,1]The random value on is taken as the value of x and is passed through f _ij (x) Calculating f by expression of (2) _ij (x) So that each element of the projection matrix M can be acquired equally.

In one embodiment, as an example of the random matrix Q, the second dimension space is r dimension space, f _ij (x) Respectively to satisfy

Probability value +.>

0、/>

Is a distribution of (a). Here, f _ij (x) As discrete random variables, each element of the projection matrix M can be equally acquired with reference to the above.

After the encrypted data matrix is obtained, the data provider may send the encrypted data matrix to a data processor for modeling analysis of the encrypted data matrix. The encrypted data matrix defines a plurality of points in a second dimension space, and the plurality of points defined by the encrypted data matrix correspond to the plurality of points defined by the original data matrix, respectively. The difference between the distance between two points of the plurality of points defined by the encrypted data matrix and the distance between corresponding two points of the plurality of points defined by the original data matrix is related to the perturbation parameter. In one embodiment, in the case of performing singular value decomposition on the raw data matrix subjected to the mean removal process to obtain the intermediate data matrix, a difference between a distance between two points of the plurality of points defined by the encrypted data matrix and a distance between corresponding two points of the plurality of points defined by the raw data matrix is approximately w ² Wherein the value of w is determined by the above formula (5) or (6).

Fig. 6 shows a data encryption apparatus 600 according to an embodiment of the present specification. The apparatus implements a differential privacy algorithm comprising:

a first obtaining unit 61 configured to obtain an original data matrix, where the original data matrix defines a plurality of points in a first dimension space, the number of the plurality of points corresponds to a user number, the points corresponds to a feature vector of the user, and the dimension number of the first dimension space is the dimension number of the feature vector;

a second acquisition unit 62 configured to acquire a first parameter for defining the differential privacy algorithm and a second parameter for representing the validity of the data after encryption;

a perturbation unit 63 configured to acquire an intermediate data matrix, wherein the intermediate data matrix defines a plurality of points in the first dimension space, and the plurality of points defined by the intermediate data matrix are points obtained by perturbing the plurality of points defined by the original data matrix, respectively, wherein the perturbation is an offset based on the first parameter and the second parameter; and

a projection unit 64 configured to multiply the intermediate data matrix with a projection matrix for obtaining an encrypted data matrix: and projecting the plurality of points defined by the intermediate data matrix into a plurality of points corresponding to any two points in the second dimension space, and enabling the ratio of the Euclidean distance between any two points in the second dimension space to the Euclidean distance between the corresponding two points in the first dimension space to be in a certain range, wherein the dimension number of the second dimension space is obtained based on the first parameter and the second parameter.

In one embodiment, in the above data encryption apparatus, the perturbation unit 63 further includes the following sub-units:

a decomposition subunit 631 configured to perform singular value decomposition on the original data matrix to represent the original data matrix as a product of three matrices, where the number of diagonal elements of a diagonal matrix located in the middle in the product of three matrices is equal to the number of dimensions of the second dimension space;

a determining subunit 632 configured to determine a disturbance parameter based on the first and second parameters;

an offset subunit 633 configured to offset each diagonal element of the diagonal matrix based on the perturbation parameter; and

a calculating subunit 634 configured to calculate a product of the three matrices after the offset as the intermediate data matrix.

In one embodiment, in the above data encryption apparatus, the decomposition subunit 631 is further configured to perform a mean removing operation on a value of each dimension of each point defined by the original data matrix, where the mean is an average value of values of the plurality of points in the same dimension; and performing singular value decomposition on the original data matrix after the mean value removing operation.

Those of ordinary skill would further appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, in computer software, or in a combination of the two, and that the elements and steps of the examples have been generally described in terms of function in the foregoing description to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Those of ordinary skill in the art may implement the described functionality using different approaches for each particular application, but such implementation is not to be considered as beyond the scope of the present application.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied in hardware, in a software module executed by a processor, or in a combination of the two. The software modules may be disposed in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The foregoing description of the embodiments has been provided for the purpose of illustrating the general principles of the invention, and is not meant to limit the scope of the invention, but to limit the invention to the particular embodiments, and any modifications, equivalents, improvements, etc. that fall within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims

1. A method of data encryption, the method implementing a differential privacy algorithm, comprising:

acquiring an original data matrix, wherein the original data matrix defines a plurality of points of a first dimension space, the number of the points corresponds to the number of users, the points correspond to feature vectors of the users, and the dimension number of the first dimension space is the dimension number of the feature vectors;

acquiring a first parameter for limiting the privacy security of the encrypted differential privacy algorithm data and a second parameter for representing the validity of the encrypted data;

acquiring an intermediate data matrix, wherein the intermediate data matrix defines a plurality of points in a first dimension space, and the plurality of points defined by the intermediate data matrix are points obtained by respectively perturbing the plurality of points defined by the original data matrix, wherein the perturbation is an offset based on the first parameter and the second parameter; and

Multiplying the intermediate data matrix with a projection matrix to obtain an encrypted data matrix, the projection matrix being used for: and projecting the plurality of points defined by the intermediate data matrix into a plurality of points corresponding to a second dimension space respectively, and enabling the ratio of Euclidean distance between any two points in the second dimension space to Euclidean distance between the corresponding two points in the first dimension space to be in a preset range, wherein the dimension number of the second dimension space is obtained based on the first parameter and the second parameter.

2. The data encryption method of claim 1, wherein the obtaining the intermediate data matrix comprises:

singular value decomposition is carried out on the original data matrix so as to represent the original data matrix as a product of three matrices, wherein the number of diagonal elements of a diagonal matrix positioned in the middle in the product of the three matrices is equal to the dimension number of the second dimension space;

determining a disturbance parameter based on the first parameter and the second parameter;

shifting each diagonal element of the diagonal matrix based on the perturbation parameters; and

and calculating the product of the three shifted matrixes to serve as the middle data matrix.

3. The data encryption method according to claim 2, wherein performing singular value decomposition on the original data matrix includes performing a de-averaging operation on values of each dimension of each point defined by the original data matrix, wherein the average is an average value of values of the plurality of points in the same dimension; and performing singular value decomposition on the original data matrix after the mean value removing operation.

4. The data encryption method according to claim 2, wherein the encrypted data matrix defines a plurality of points in a second dimension space, and the plurality of points defined by the encrypted data matrix correspond to the plurality of points defined by the original data matrix, respectively, and a difference between a distance between two points of the plurality of points defined by the encrypted data matrix and a distance between corresponding two points of the plurality of points defined by the original data matrix is correlated with the disturbance parameter.

5. The data encryption method of claim 1, wherein the projection matrix is randomly obtained from a random matrix, each element of the random matrix being a random variable, each of the random variables being independent of each other and having the same distribution, wherein the random matrix satisfies: the desired value of the product of the transpose of the random matrix and the random matrix is the identity matrix.

6. The data encryption method according to claim 5, wherein the second dimension space is an r-dimension space, and the random variable satisfies a gaussian distribution with a desired value of 0 and a variance of 1/r.

7. The data encryption method according to claim 5, wherein the second dimensionThe degree space is an r-dimensional space, and the random variable is satisfied in

Uniformly distributed on the surface.

8. The data encryption method according to claim 5, wherein the second dimension space is an r-dimension space, and the random variables satisfy the respective conditions of

Probability value +.>

0、/>

Is a distribution of (a).

9. A data encryption method according to claim 1, wherein the differential privacy algorithm is a (epsilon, delta) -differential privacy algorithm, the first parameter comprising epsilon and delta.

10. The data encryption method of claim 9, wherein the second parameters include η and v, wherein η represents a maximum relative error in a distance between pairs of points defined by the raw data matrix after processing by the method, v represents a maximum probability of a fault in the method occurring in multiple executions, and determining values of a dimension number and a perturbation parameter of a second dimension space based on the first and second parameters includes determining the dimension number of the second dimension space based on η and v.

11. The data encryption method of claim 10, wherein determining the number of dimensions of a second dimension space and the value of the perturbation parameter based on the first and second parameters comprises determining the value of the perturbation parameter based on epsilon, delta, and the number of dimensions of the second dimension space.

12. A data encryption device that implements a differential privacy algorithm, comprising:

a first acquisition unit configured to acquire an original data matrix, the original data matrix defining a plurality of points of a first dimension space, wherein the number of the plurality of points corresponds to a user number, the points correspond to a feature vector of the user, and the dimension number of the first dimension space is the dimension number of the feature vector;

a second acquisition unit configured to acquire a first parameter for defining privacy security after encryption of the differential privacy algorithm data and a second parameter for representing validity of the data after encryption;

a perturbation unit configured to acquire an intermediate data matrix, wherein the intermediate data matrix defines a plurality of points in a first dimension space, and the plurality of points defined by the intermediate data matrix are points obtained by perturbing the plurality of points defined by the original data matrix, respectively, wherein the perturbation is an offset based on the first parameter and the second parameter; and

A projection unit configured to multiply the intermediate data matrix with a projection matrix to obtain an encrypted data matrix, the projection matrix being for: and projecting the plurality of points defined by the intermediate data matrix into a plurality of points corresponding to a second dimension space respectively, and enabling the ratio of Euclidean distance between any two points in the second dimension space to Euclidean distance between the corresponding two points in the first dimension space to be in a preset range, wherein the dimension number of the second dimension space is obtained based on the first parameter and the second parameter.

13. The data encryption device of claim 12, wherein the perturbation unit further comprises the following sub-units:

a decomposition subunit configured to perform singular value decomposition on the original data matrix to represent the original data matrix as a product of three matrices, where the number of diagonal elements of a diagonal matrix located in the middle in the product of three matrices is equal to the number of dimensions of the second dimension space;

a determining subunit configured to determine a disturbance parameter based on the first parameter and the second parameter;

an offset subunit configured to offset each diagonal element of the diagonal matrix based on the perturbation parameter; and

And a calculating subunit configured to calculate a product of the three matrices after the offset as the intermediate data matrix.

14. The data encryption device of claim 13, wherein the decomposition subunit is further configured to perform a de-averaging operation on the value of each dimension of each point defined by the raw data matrix, wherein the average is an average of the values of the plurality of points in the same dimension; and performing singular value decomposition on the original data matrix after the mean value removing operation.

15. The data encryption device of claim 13, wherein the encrypted data matrix defines a plurality of points in a second dimension space, and the plurality of points defined by the encrypted data matrix correspond to the plurality of points defined by the original data matrix, respectively, and a difference between a distance between two points of the plurality of points defined by the encrypted data matrix and a distance between corresponding two points of the plurality of points defined by the original data matrix is related to the perturbation parameter.

16. The data encryption device of claim 12, wherein the projection matrix is randomly derived from a random matrix, each element of the random matrix being a random variable, each of the random variables being independent of each other and having the same distribution, wherein the random matrix satisfies: the desired value of the product of the transpose of the random matrix and the random matrix is the identity matrix.

17. The data encryption device according to claim 16, wherein the second dimension space is an r-dimension space, and the random variable satisfies a gaussian distribution with a desired value of 0 and a variance of 1/r.

18. The data encryption device of claim 16, wherein the second dimension space is an r-dimension space, and the random variable satisfies the following conditions

Uniformly distributed on the surface.

19. The data encryption device of claim 16, wherein the second dimension space is an r-dimension space, and the random variables satisfy the following criteria, respectively

Probability value +.>

0、/>

Is a distribution of (a).

20. The data encryption device of claim 12, wherein the differential privacy algorithm is a (epsilon, delta) -differential privacy algorithm, and the first parameter comprises epsilon and delta.

21. The data encryption device of claim 20, wherein the second parameter comprises η and v, wherein η represents a maximum relative error in a distance between pairs of points defined by the raw data matrix after processing by the device, v represents a maximum probability of a miss in multiple executions of the device, and determining values of a dimension number and a perturbation parameter of a second dimension space based on the first and second parameters comprises determining the dimension number of the second dimension space based on η and v.

22. The data encryption device of claim 21, wherein determining the number of dimensions of a second dimension space and the value of the perturbation parameter based on the first and second parameters comprises determining the value of the perturbation parameter based on epsilon, delta, and the number of dimensions of the second dimension space.