CN107766742B

CN107766742B - Multi-correlation differential privacy matrix decomposition method under non-independent same-distribution environment

Info

Publication number: CN107766742B
Application number: CN201711065040.4A
Authority: CN
Inventors: 李先贤; 傅星珵; 王利娥; 刘鹏; 褚宏光
Original assignee: Guangxi Normal University
Current assignee: Guangxi Normal University
Priority date: 2017-11-02
Filing date: 2017-11-02
Publication date: 2021-02-19
Anticipated expiration: 2037-11-02
Also published as: CN107766742A

Abstract

The invention discloses a multi-correlation differential privacy matrix decomposition method under a non-independent same-distribution environment, which considers the multi-correlation of other attributes of data, uses a correlation target perturbation mechanism to simultaneously introduce the correlation properties of the data into a model target function, and simultaneously ensures the safety and the effectiveness of a prediction result. The method mainly comprises two parts, namely, a generated random noise matrix satisfies correlation noise matrix calculation of a prediction result which satisfies differential privacy under the assumption of non-independent same distribution, and a correlation differential privacy matrix decomposition training process which introduces other attribute multi-correlation and adds the random noise matrix. The invention can improve the prediction precision as much as possible to offset the precision loss caused by privacy protection under the condition of ensuring the data privacy safety.

Description

Multi-correlation differential privacy matrix decomposition method under non-independent same-distribution environment

Technical Field

The invention relates to the technical field of data privacy protection, in particular to a multi-correlation differential privacy matrix decomposition method in a non-independent and same-distribution environment.

Background

The recommendation system is widely applied under the current society, particularly the Internet industry. Matrix factorization is a popular collaborative filtering method for constructing recommendation systems. In the collaborative filtering recommendation system, since the scoring of items by users may reveal personal privacy, for example, personal preferences (scoring data) may be utilized to infer the health condition, political tendency, or even true identity of the users, the scoring in the raw scoring data is sensitive, the scoring matrix contains privacy information of the users, and the risk of privacy disclosure is caused when the scoring matrix is not used, which is now appreciated by related researchers.

Many researchers have proposed many anonymous protection models at present, and if the researchers combine the differential privacy model, a differential privacy matrix decomposition model is proposed for the credibility and the incredibility of the recommendation system. However, matrix decomposition and differential privacy are both proposed on the assumption that the data sets are independently and identically distributed, whereas data in real scenes are often correlated. Therefore, under real data, the matrix decomposition has the problem of recommendation precision, and the original privacy protection capability is lost due to the addition of the relevance between data.

In view of the fact that the non-independent and identically distributed data with the correlation characteristics are closer to reality and have greater research value, research on the correlation data is also a current hot problem. In the existing privacy protection research, most research is based on the assumption of independent co-distributed data, and the association between individuals is not taken into consideration, so that compared with the independent co-distributed data, the non-independent co-distributed data with complex association has higher value and is more challenging. For the non-independent same distribution matrix decomposition, the main problems are shown in the following aspects:

(1) the relevance exists between users and between items, and the traditional differential privacy model adds too much noise when the evaluation data which are not independently distributed are implemented, so that the data effectiveness is greatly reduced;

(2) the correlation property between the user and the project can be used as auxiliary information to be provided for matrix decomposition to improve the prediction accuracy while enhancing the background knowledge of the attacker. However, conventional matrix decomposition methods do not take these correlation properties into account;

(3) on the premise of introducing respective correlation properties of users and items to improve matrix decomposition utility, the traditional differential privacy mechanism can no longer ensure privacy security, so that a new differential privacy mechanism is needed to ensure that privacy is not leaked.

Disclosure of Invention

The invention aims to solve the problem that the conventional differential privacy matrix decomposition loses the original privacy protection capability when the non-independent and identically distributed data are faced, and provides a multi-correlation differential privacy matrix decomposition method under the non-independent and identically distributed environment.

In order to solve the problems, the invention is realized by the following technical scheme:

the method for decomposing the multi-correlation differential privacy matrix in the non-independent same-distribution environment specifically comprises the following steps:

step 1, preprocessing attribute spaces of users and items, and respectively calculating a user correlation coefficient matrix and an item correlation coefficient matrix;

step 2, based on a difference privacy model, introducing a target function of matrix decomposition of multi-correlation to generate a random noise matrix which obeys Laplace distribution; namely:

step 2.1, calculating the value ranges of the user correlation coefficient, the project correlation coefficient and the grading data, namely the difference between the maximum value and the minimum value, and calculating the sensitivity of the user factor matrix and the sensitivity of the project matrix according to the difference;

2.2, calculating random numbers which obey Laplace distribution according to the sensitivity of the user factor matrix and the sensitivity of the item matrix respectively, and uniformly and randomly generating a group of random numbers to ensure that an L1 norm value of the group of random numbers as a vector is exactly equal to the obtained random numbers which obey Laplace distribution, thereby obtaining a user random noise matrix and an item random noise matrix;

step 3, training the target function by adopting a random gradient descent method to realize correlation difference privacy matrix decomposition;

step 3.1, uniformly and randomly selecting a vector formed by random numbers from an L1 norm sphere, and constructing a user factor matrix and a project factor matrix, wherein the user factor matrix is a matrix of d × n, the project factor matrix is a matrix of d × m, n is the number of users, m is the number of projects, and d is a decomposition dimension;

step 3.2, judging whether iteration is finished, namely whether the current iteration frequency reaches the set maximum iteration frequency, and if not, continuing to execute downwards; if yes, executing step 3.6;

step 3.3, calculating an Error matrix Error of the iteration:

Error＝R-U^T*V

wherein, R represents a project rating matrix of a user, U represents a current user factor matrix, V represents a current project factor matrix, and T represents transposition;

step 3.4, traversing each row of the scoring matrix R, calculating the partial derivative of the objective function of each row to the current user factor matrix U, and updating the user factor matrix U by adding the partial derivatives of each user of the current user factor matrix U and the corresponding row;

step 3.5, traversing each row of the original scoring matrix R, calculating the partial derivative of each row of objective functions to the current project factor matrix V, and updating the project factor matrix V by adding each project of the current project factor matrix V and the partial derivative of the corresponding row;

step 3.6, repeating the steps 3.2 to 3.5 until the iteration is finished, and when the iteration is finished, calculating and outputting a prediction matrix R':

R′＝U^T*V

wherein, U represents the current user factor matrix, V represents the current item factor matrix, and T represents the transposition.

In step 1, calculating a correlation coefficient Jaccard (X, Y) of 2 users or items by using the Jaccard similarity distance as follows:

where | X @ Y | represents the number of common attributes of 2 users or items, and | X @ Y | represents the number of all attributes of 2 users or items.

In the step 2.1, the sensitivity USens of the user factor matrix is:

in the step 2.1, the sensitivity VSens of the project factor matrix is:

wherein RRange represents the value range of the grading data, URange represents the value range of the user correlation coefficient, and VRange tableShowing the value range of the related coefficient of the item,

representing the correlation coefficient between user i and user o,

representing the correlation coefficient between item j and user w, o e n]^-iIndicating that user o belongs to the set of 1 to n except user i, w e m]^-jIndicating that the item w belongs to a set of items 1 to m except the item j, wherein n is the number of users, and m is the number of items.

In step 2.2, the ith column vector of the random noise matrix is used

Comprises the following steps:

in step 2.2, the jth column vector of the random noise matrix is processed

Comprises the following steps:

where, USEns represents the sensitivity of the user factor matrix, VSens represents the sensitivity of the item factor matrix, epsilon represents the set privacy budget, and Lap (.) represents the probability density function of the Laplace distribution.

In step 3.4, the partial derivative of the ith row of the user factor matrix U

Comprises the following steps:

wherein v is_jA column vector, u, representing the corresponding jth item in the item factor matrix V_iRepresenting the column vector, r, of the ith user in the user factor matrix U_ijRepresents the grade of the ith user to the jth item, lambda is the user regular item parameter, u_lA column vector representing the corresponding ith user in the user factor matrix U,

representing the correlation coefficient between the ith user and the ith user,

the ith column vector representing the user random noise matrix,

representing the set of user and item pairs with scoring values in the scoring matrix R, M representing the number of user and item pairs with scoring values in the scoring matrix R,

indicates the item number j, l ∈ [ n ] evaluated by the ith user]^-iIndicating that user i belongs to a set of 1 to n except user i, and T indicates transposition.

In step 3.5, the partial derivative of the jth row of the item factor matrix V

Comprises the following steps:

wherein v is_jA column vector, u, representing the corresponding jth item in the item factor matrix V_iRepresenting the column vector, r, of the ith user in the user factor matrix U_ijRepresents the grade of the ith user to the jth item, mu is a regular item parameter of the item, v_kRepresenting the corresponding kth entry in the entry factor matrix VThe vector of the column of the destination,

representing the correlation coefficient between the jth item and the kth item,

the jth column vector representing the term random noise matrix,

user number i, k ∈ [ m ] indicating that item j was evaluated]^-iIndicating that item k belongs to a set of items 1 to m except item j and T indicates a transpose.

Compared with the prior art, the method improves the original privacy protection model according to the correlation based on the non-independent same distribution data aiming at the application background of the real data under the recommendation system, and the improved privacy protection model has the following characteristics:

(1) because of the relevance between the non-independent and distributed data, the improved privacy protection model must introduce the relevance between the data as a factor into the recommendation system. Therefore, according to the data characteristics, the correlation matrix of the data items is calculated and constructed, and the correlation coefficient error regular term is calculated and introduced into the objective function of matrix decomposition, so that the model prediction accuracy is improved.

(2) The method protects the data privacy by using a differential privacy method, and designs a new perturbation mechanism algorithm, namely a correlation target perturbation mechanism, for ensuring that a new model meets the requirements of a differential privacy model while introducing the data correlation into a matrix decomposition training process.

(3) The invention provides a multi-relevance differential privacy matrix decomposition method by surrounding the auxiliary background knowledge of relevance and fully considering two conditions that an attacker can strengthen attack success probability by using the background knowledge and the prediction precision is improved by using the relevance through matrix decomposition, so that the precision loss caused by privacy protection is counteracted by improving the prediction precision as much as possible under the condition of ensuring the data privacy safety.

Drawings

FIG. 1 is a view of a model structure;

FIG. 2 is a flow chart of a correlation random noise matrix calculation;

fig. 3 is a correlation differential privacy matrix decomposition training process.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings in conjunction with specific examples.

A method for decomposing a multi-correlation differential privacy matrix in a non-independent same distribution environment specifically comprises the following steps:

step 1, preprocessing data and calculating parameters needed by training a model.

The factor matrices U and V are first initialized, i.e. column vectors of factor matrices U and V are chosen uniformly and randomly from the L1 norm sphere. Then, the correlation coefficient matrixes delta of the users and the items are respectively calculated according to the attribute spaces of the users and the items^UAnd Δ^V。

And 2, calculating a noise matrix which needs to be added to the new model to meet the difference privacy.

The scoring data used by the invention is the scoring data of the user on the items, and the background knowledge of the attacker can be the correlation degree between the two items and other information besides the scoring.

Firstly, according to the original scoring matrix R and the user correlation coefficient matrix delta^UAnd the item correlation matrix coefficient Δ^VAnd respectively calculating sensitivities USEns and VSens of the user factor matrix and the project factor matrix according to a correlation target perturbation mechanism and a privacy budget E. Then, uniformly and randomly selecting random numbers meeting Laplace (US/∈) and Laplace (VS/∈) of Laplace distribution, and respectively using the random numbers as two factor matrix directions of U and VMeasuring the L1 norm of each column vector of noise, uniformly and randomly generating a group of random numbers with L1 norms as the value, and adding the random numbers as column vectors into a noise matrix to obtain a user random noise matrix N^USum term random noise matrix N^V。

And 3, realizing the training process of the model.

Due to the sparsity of the original scoring data R, a random gradient descent method is adopted for training. In each iteration, the original scoring data has value elements, errors are calculated according to the target function of the model, the correlation coefficient matrix obtained in the previous step and the noise matrix, and U and V are updated through a gradient formula. Finally, an inner product matrix R 'of the U and the V is solved, and a result R' is output.

The objective function of the multi-correlation differential privacy matrix decomposition is:

wherein,

for the correlation coefficient of the ith user and the ith user,

correlation coefficients for the jth item (movie) and the kth item (movie);

the key steps and principles of the process of the invention are described in further detail below:

model structure

As shown in fig. 1, the model structure of the multi-correlation differential privacy matrix decomposition based on the non-independent co-distributed data is described as follows:

(1) the module is composed of two parts: a data preprocessing module and a correlation target perturbation mechanism module.

(2) The data preprocessing module is mainly used for preprocessing the original scores R and the attribute space of the user items and respectively calculating the correlation coefficient matrixes delta of the users and the items^UAnd Δ^V。

(3) The correlation target perturbation mechanism module comprises two sub-modules: and (3) generating a correlation random noise matrix and performing correlation differential privacy matrix decomposition training. From raw scores R and a matrix of correlation coefficients Δ^U,Δ^VRespectively calculating user and project random noise matrixes N^U,N^VAnd then adding noise in the matrix decomposition training process according to the random noise matrix.

Second, data preprocessing

The data preprocessing is mainly used for calculating a correlation coefficient matrix of the user and the item, correlation coefficients of the user and the item are calculated based on data of an attribute space of the user and the item, and common calculation methods include Jacard similarity distance, Pearson correlation coefficient and the like.

TABLE 1 user rating of movies

TABLE 2 attribute space of users in movie rating data

User	Sex	Age	Occupation	Zip-code
					Alice	F	Under 18	K-12 student	48267
Bob	M	56+	self-employed	70072
					Cindy	M	25-34	scientist	55117
Dale	M	45-49	executive/managerial	02460
					Eric	F	50-55	homemaker	55117

TABLE 3 Attribute space for movies in movie rating data

Movie	Genres
		Toy Story	Animation\|Children's\|Comedy
Jumanji	Adventure\|Children's\|Fantasy
		Grumpier Old Men	Comedy\|Romance
Waiting to Exhale	Comedy\|Drama
		Father of the Bride Part II	Comedy
Heat	Thriller

The correlation coefficient matrix is calculated from the attribute values in the attribute space, and since most attributes are non-numerical, the Jacard similarity coefficient is used, and the formula is as follows:

where X and Y represent the attribute vectors of user 1 and user 2, respectively, and the jaccard similarity factor is equivalent to the ratio of the number of attributes common to both users to the number of attributes owned by both users. As shown in table 2, Alice and Bob have no same attribute value in the user attribute space, so their jaccard similarity coefficient is 0; if the zip code number of Candy is the same as that of Eric, the Jacard similarity coefficient is

TABLE 3 film Attribute space since only one value is an aggregate-type Attribute, the aggregate-value Attribute is taken as the AttributeSex set calculation, i.e. Jacard similarity coefficient of Toy Story and Jumann ji

The above-mentioned companies can obtain user-user and film-film pairwise correlation Jacard similarity coefficients, i.e. user-user correlation coefficient matrix and item-item correlation coefficient matrix.

Third, correlation target perturbation mechanism

The invention provides a differential privacy matrix decomposition method considering multi-correlation of other attributes of data based on non-independent same-distribution data, and a correlation target perturbation mechanism is used for introducing the correlation properties of the data into a model target function at the same time, so that the safety and the effectiveness of a prediction result are ensured. The method mainly comprises two parts, namely, a generated random noise matrix satisfies correlation noise matrix calculation of a prediction result which satisfies differential privacy under the assumption of non-independent same distribution, and a correlation differential privacy matrix decomposition training process which introduces other attribute multi-correlation and adds the random noise matrix.

(1) Correlation random noise matrix

The correlation target perturbation mechanism is based on a difference privacy model, and a random noise matrix which obeys Laplace distribution is generated according to a target function which introduces matrix decomposition of multi-correlation. Referring to fig. 2, the detailed steps are as follows:

step 1, calculating the value ranges of the grading data, the user correlation coefficient and the project correlation coefficient respectively, namely the difference between the maximum value and the minimum value, and recording as RRange, URange and VRange.

And 2, respectively calculating sensitivities USEns and VSens of the user factor matrix and the project matrix.

The sensitivity USens of the user factor matrix is:

the sensitivity VSens of the project factor matrix is:

wherein RRange represents the value range of the grading data, URange represents the value range of the user correlation coefficient, VRange represents the value range of the project correlation coefficient,

representing the correlation coefficient between user i and user o,

Step 3, calculating random numbers obeying Laplace distribution according to the obtained sensitivity, and uniformly and randomly generating a group of random numbers, so that the L1 norm value of the group of random numbers as a vector is exactly equal to the previously obtained random numbers obeying Laplace distribution, and the formula of the random numbers obeying Laplace distribution is as follows:

ith column vector of user random noise matrix

Comprises the following steps:

jth column vector of item random noise matrix

Comprises the following steps:

where, USEns represents the sensitivity of the user factor matrix, VSens represents the sensitivity of the item factor matrix, epsilon represents the set privacy budget, Lap (one.) represents the probability density function of Laplace distribution, and-represents that the value of the random vector is proportional to the probability density function.

Step 4, returning to the user random noise matrix N^USum term random noise matrix N^V。

(2) Correlation differential privacy matrix decomposition

The correlation difference privacy matrix decomposition is a training stage of the multi-correlation difference privacy matrix decomposition method, and a random gradient descent method is adopted for an objective function of the multi-correlation difference privacy matrix decomposition method. In this stage, the correlation coefficient matrix and the random noise matrix calculated in the foregoing are used to satisfy the requirement of the differential privacy protection model, and meanwhile, the correlation is used to improve the prediction accuracy so as to offset the accuracy loss caused by protecting privacy. Referring to fig. 3, the detailed steps of the training process are as follows:

step 1, uniformly and randomly selecting a vector formed by random numbers from an L1 norm sphere, and constructing factor matrixes U and V, wherein a user factor matrix is a matrix with the size of d multiplied by n, a project factor matrix is a matrix with the size of d multiplied by m, n is the number of users, m is the number of projects, and d is a decomposition dimension.

And 2, judging whether the iteration is finished or not, and if not, continuing to execute downwards. If so, go to step 7.

Step 3, calculating an error matrix of the iteration, wherein the formula is as follows:

Error＝R-U^T*V

where R represents the original user-item scoring matrix of dxm, U represents the user factor matrix, V represents the item factor matrix, and T represents the transpose.

And 4, traversing each row of the original scoring matrix R, and calculating the partial derivative of the objective function of each row to U. The calculation formula of the U partial derivative in the ith row is as follows:

wherein v is_jA column vector, u, representing the corresponding jth item in the item factor matrix V_iFor indicatingThe column vector r corresponding to the ith user in the user factor matrix U_ijRepresents the grade of the ith user to the jth item, lambda is the user regular item parameter, u_lA column vector representing the corresponding ith user in the user factor matrix U,

representing the correlation coefficient between the ith user and the ith user,

the ith column vector representing the user random noise matrix,

And 5, traversing each column of the original scoring matrix R, and calculating the partial derivative of each column of the objective function to V. The partial derivative of V in line j is calculated as follows:

wherein v is_jA column vector, u, representing the corresponding jth item in the item factor matrix V_iRepresenting the column vector, r, of the ith user in the user factor matrix U_ijRepresents the grade of the ith user to the jth item, mu is a regular item parameter of the item, v_kA column vector representing the corresponding kth entry in the entry factor matrix V,

representing the correlation coefficient between the jth item and the kth item,

the jth column vector representing the term random noise matrix,

And 6, respectively updating the corresponding U vector and V vector by using the partial derivatives obtained in the steps 4 and 5. The update formula is as follows:

wherein i belongs to [ n ], j belongs to [ m ].

And 7, repeating the steps 2 to 6 until iteration is completed, and calculating a prediction matrix R ═ U^TV, and output R'.

The method and the device realize matrix decomposition meeting a difference privacy model under the condition of non-independent and same distributed data, and can effectively improve the prediction precision while ensuring the safety. When the recommendation system carries out recommendation, the grading data of the user are protected, and the recommendation precision of the recommendation system is improved.

It should be noted that, although the above-mentioned embodiments of the present invention are illustrative, the present invention is not limited thereto, and thus the present invention is not limited to the above-mentioned embodiments. Other embodiments, which can be made by those skilled in the art in light of the teachings of the present invention, are considered to be within the scope of the present invention without departing from its principles.

Claims

1. The method for decomposing the multi-correlation differential privacy matrix in the non-independent same distribution environment is characterized by comprising the following steps of:

step 2, based on a difference privacy model, introducing a target function of matrix decomposition of multi-correlation to generate a user random noise matrix and a project random noise matrix which obey Laplace distribution; namely:

step 2.1, calculating the value ranges of the user correlation coefficient, the project correlation coefficient and the grading data, namely the difference between the maximum value and the minimum value, and calculating the sensitivity of the user factor matrix and the sensitivity of the project factor matrix according to the difference;

2.2, calculating random numbers which obey Laplace distribution according to the sensitivity of the user factor matrix and the sensitivity of the project factor matrix respectively, and uniformly and randomly generating a group of random numbers to ensure that an L1 norm value of the group of random numbers as a vector is exactly equal to the obtained random numbers which obey Laplace distribution, thereby obtaining a user random noise matrix and a project random noise matrix;

step 3.3, calculating an Error matrix Error of the iteration:

Error＝R-U^T*V

R′＝U^T*V

2. The method of claim 1, wherein in step 1, Jaccard's similarity distance is used to calculate the correlation coefficient Jaccard (X, Y) of 2 users or items as:

3. The method for decomposition of a multiple correlation differential privacy matrix in a non-independent, co-distributed environment according to claim 1, in step 2.1,

the sensitivity USens of the user factor matrix is:

the sensitivity VSens of the project factor matrix is:

representing the correlation coefficient between user i and user o,

representing the correlation coefficient between item j and user w, o e n]^-iIndicating that user o belongs to the set of 1 to n except user i, w e m]^-jIndicating that the item w belongs to a set from 1 to m except the item j, wherein n is the number of users, m is the number of items, and d is the dimension of decomposition.

4. The method for decomposition of a multiple correlation differential privacy matrix in a non-independent, co-distributed environment, according to claim 1, in step 2.2,

ith column vector of user random noise matrix

Comprises the following steps:

jth column vector of item random noise matrix

Comprises the following steps:

5. The method according to claim 1, wherein in step 3.4, the partial derivatives of the ith row of the Uf matrix U are calculated according to the partial derivatives of the Uf matrix U

Comprises the following steps:

representing the correlation coefficient between the ith user and the ith user,

the ith column vector representing the user random noise matrix,

express possession of scores in the score matrix RThe user and item pair set with scores, M represents the number of the user and item pairs with scores in the scoring matrix R,

6. The method of claim 1, wherein the partial derivatives of the jth row of the item factor matrix V in step 3.5 are partial derivatives of the jth row of the dependent co-distributed environment

Comprises the following steps:

representing the correlation coefficient between the jth item and the kth item,

the jth column vector representing the term random noise matrix,

user number i, k ∈ [ m ] indicating that item j was evaluated]^-jIndicating that item k belongs to a set of items 1 to m except item j and T indicates a transpose.