CN113407986B

CN113407986B - Frequent item set mining method for local differential privacy protection based on singular value decomposition

Info

Publication number: CN113407986B
Application number: CN202110556455.1A
Authority: CN
Inventors: 董恺; 池平川
Original assignee: Nanjing Yizhi Network Space Technology Innovation Research Institute Co ltd; Southeast University
Current assignee: Nanjing Yizhi Network Space Technology Innovation Research Institute Co ltd; Southeast University
Priority date: 2021-05-21
Filing date: 2021-05-21
Publication date: 2024-02-23
Anticipated expiration: 2041-05-21
Also published as: CN113407986A

Abstract

The invention discloses a local differential privacy protection frequent item set mining method based on singular value decomposition, which specifically comprises the following steps: step 1: estimating frequent project frequency; step 2: the server side establishes an initial matrix by utilizing the result of the frequent item frequency estimation in the step 1, carries out singular value decomposition on the matrix, and obtains a left matrix and a right matrix to be sent to the user side; step 3: the user side establishes a matrix according to the locally owned items, calculates a corresponding singular value matrix with the accepted left and right matrices and the frequent item set, and uploads the singular value matrix interference to the server; step 4: the server side performs aggregation analysis on the received singular matrixes, excavates the most frequent project set record, updates the initial matrix, calculates new left and right singular matrixes and transmits the new left and right singular matrixes to the user side; step 5: repeating the steps for K times 3-4, and mining a final top-K frequent item set by the server side. The mining method has the advantages that the mining result is accurate, and the privacy of the project set information of the user is strictly protected.

Description

Frequent item set mining method for local differential privacy protection based on singular value decomposition

Technical Field

The invention belongs to the technical field of information security.

Background

With the development of social economy and scientific technology, information technology including computer technology, communication technology, sensor technology and the like is rapidly developed and improved, and the development of modern family life of people towards a more convenient and comfortable direction is promoted. The intelligent home word also goes from concept to reality and becomes a word which is well-known and detailed by people. The intelligence of the intelligent home is mainly reflected in the automatic execution process of related intelligent equipment, and in a normal case, a sensor senses a triggering condition, and when the condition is met, the system can appoint corresponding equipment to execute corresponding actions. Such trigger-action combinations are the cornerstone of intelligent implementations in smarthouses, and the system needs to set up reasonable combinations to help users deploy intelligent applications quickly, while also needing to learn new combinations from the user side for subsequent improvement and optimization. If both triggers and actions are considered user owned items, then frequently occurring combinations found among the vast user owned trigger-action combinations correspond to frequent item sets in the dataset.

Frequent item sets are a common correlation property between data, specifically a collection of several items that frequently occur in a data set, but frequent item set mining typically requires collection and analysis of user raw data. How to mine frequent item sets while protecting user privacy information is not only the key of realizing various intelligent applications, but also the technical problem faced by further development of the intelligent application.

In recent years, differential privacy has been proposed as a strict definition of privacy protection. Differential privacy achieves the goal that a third party cannot estimate personal information while preserving overall data statistics by adding noise meeting specific properties to the user personal data. But traditional differential privacy requires a trusted server and the server can still observe the user's original data. To avoid the problem of trusted third party servers, local differential privacy techniques are proposed that add noise at the user side. The user uploads the information after the disturbance to the server to avoid exposing the original information to a third party server.

The prior art proposes that the guessing of frequent item set candidates is carried out based on the assumption that the frequent item sets are all composed of frequent items, firstly, the frequent item sets in the data set are searched under the definition of meeting local differential privacy, then the frequent item set candidates are constructed according to the assumption and sent to the user, the user uploads the item sets owned by the individual, and the server determines the final top-k frequent item set result through aggregation calculation. However, this approach ignores the possibility that the frequent item set may consist of relatively infrequent frequent items, which may result in a partially highly correlated frequent item set being ignored and the estimation result being inaccurate.

There is also a CALM method, in which a user is allocated to a set edge table, the user interferes with uploading a value of a corresponding attribute in the edge table, and a server obtains a multi-attribute joint distribution result through aggregation recovery. However, the user of the method needs to upload multiple data, so that the privacy budget is divided to cause noise increase, and communication overhead is increased.

How to realize a frequent mining method based on local differential privacy, so that the trade-off among data privacy, mining precision and communication overhead can be realized is a difficult problem.

Disclosure of Invention

The invention aims to: in order to solve the problems in the prior art, the invention provides a local differential privacy protection frequent item set mining method based on singular value decomposition.

The technical scheme is as follows: the invention provides a local differential privacy protection frequent item set mining method based on singular value decomposition, which specifically comprises the following steps:

step 1: estimating the frequency of items held by intelligent home users, arranging the frequency from large to small, selecting the items with the frequency rank K as frequent items, and numbering the items in sequence;

step 2: dividing users into K groups, combining the K frequent items selected in the step 1 in pairs, wherein each group is a frequent item set, and establishing an initial matrix M with K-K dimensions according to the frequent item set; setting an initial frequent item set FIS, wherein the FIS is empty;

step 3: let k=1;

step 4: singular value decomposition is carried out on M to obtain two orthogonal matrixes U and V; the server will U and V ^T Transmitting to the kth group of users, wherein T is matrix transposition;

step 5: the r user in the k group of users receives U and V ^T Establishing singular matrix corresponding to the r-th user, interfering the singular matrix to obtain interference information, and combining the interference information with the FIS and frequent items held by the userUploading information to a server, wherein r=1, 2, …, and R are the total number of users in the kth group of users;

step 6: the server performs aggregation analysis on the received R interference information, so that the most frequent item set is dug out, the most frequent item set is put into the FIS, and the value of an element corresponding to the frequent item set in the M is set to be 0;

step 7: let k=k+1, judge whether K is greater than K, if yes, stop calculating, regard FIS as the excavation result; otherwise, returning to the step 4.

Further, the step 1 specifically includes: dividing intelligent home users into three groups, randomly selecting one item by each user of the first group to perform interference, uploading interference information to a server, and performing aggregate analysis on the interference information sent by the first group of users by the server to obtain items with frequency ranking of front 2*K, wherein the items form a candidate item set; the server sends the candidate item set to a second group of users, each user of the second group calculates the intersection of the set formed by the owned items and the candidate item set, the number of the items in the intersection is used as data to interfere, the interference information is uploaded to the server, the server aggregates and recovers the interference information sent by the second group of users, and therefore the sampling number is obtained, and the server sends the sampling number and the candidate item set to a third group of users; and each user in the third group randomly selects one item from the intersection of the set formed by the self-owned items and the candidate item set to interfere, and uploads the interference information to the server, and the server aggregates and recovers the interference information sent by the user in the third group, so that the frequency estimation of the items held by all the users is obtained.

Further, in the step 2, according to the frequent item set formed by the ith frequent item and the jth frequent item, the value of the element in the ith row and the jth column in the matrix M is calculated:

m(i,j)＝min(f(i),f(j))

where min (-) is a minimum function, f (i) is the frequency of the ith frequent item, f (j) is the frequency of the jth frequent item, i=1, 2, …, K, j=1, 2, …, K.

Further, the step 5 specifically includes:

step 3.1: comparing the frequent items held by the r user with the frequent item sets in the FIS one by one, and if the set formed by the frequent items held by the r user is the same as a certain frequent item set, comparing the singular matrix S corresponding to the user _r The elements in the method are all set to 0, and the comparison is stopped, and the step 3.3 is performed; if the number of elements in the intersection of the set of frequent items held by the user and a certain set of frequent items is 1 or 0, the frequent items held by the r user are unchanged; if the number of elements in the intersection of the set formed by the frequent items held by the r-th user and a certain frequent item set is 2, deleting any one of the frequent items in the intersection from the frequent items held by the user, thereby updating the frequent items held by the client, and then performing the next round of comparison until all the frequent item sets in the FIS are compared, and turning to step 3.2;

step 3.2: matrix M using K-dimension K-x-K as follows _r Representing frequent items held by the user after the update:

wherein M is _r (x, y) represents M _r Elements of the x-th row and y-th column; x=1, 2,..k; y=1, 2,..k; q is the number set of frequent items held by the updated user;

according to U and V ^T Calculating singular matrices of the r user:

S _r ＝U ⁺ *M _r *V ^T+

wherein U is ⁺ ＝U ^T ，V ^T+ ＝V；

Step 3.3: will S _r Mapping to a value range v, v E < -1 >, 1]Is mapped with S _r The value range v of the code is interfered to obtain information after interferenceThe r user will t with probability p ₁ Uploading to a server, or t with probability 1-p ₂ Uploading to the server, < > and >>Epsilon is a privacy parameter.

Further, the step 6 specifically includes:

the server performs aggregation analysis on the received interference information to obtain an estimated matrixThe matrix of dimension K is calculated according to the following formula>

The matrix M is averaged withAnd adding the frequent item set corresponding to the element with the same position of the maximum element value in the frequent item set group FIS.

The beneficial effects are that:

(1) The privacy of the project set information of the user is protected meeting strict local differential privacy definition;

(2) The mining result of the frequent item set is accurate, and the frequent item set consisting of non-most frequent items is not ignored;

(3) Communication overhead between the user side and the server side is reduced.

Drawings

FIG. 1 is a flow chart of the present invention.

Detailed Description

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention.

The invention provides a brand new privacy frequent item set mining method based on a local differential privacy framework, which can optimally balance data privacy, mining precision and communication overhead. The basic idea is: firstly, at a distributed user side, a low-dimensional singular value matrix is utilized to represent an original high-dimensional sensitive data matrix of the user; and secondly, collecting insensitive singular values at a server side, and recovering to finally obtain accurate estimation of the frequent item set.

As shown in fig. 1, the present embodiment provides a method for mining a frequent item set of local differential privacy protection based on singular value decomposition:

step 1: frequent item frequency estimation

The items owned by the user may be various, and if all the possible items are used as input fields, the error of the estimation is too large due to the linear relationship between the differential privacy error and the value field range, so that the data practicability is affected. Therefore, the range of candidate items needs to be narrowed.

The method comprises the steps that users using intelligent home are divided into three groups, each user in the first group randomly selects any item to be disturbed and uploaded to a server, the server carries out aggregation analysis on received information, so that items with frequency ranking of first 2*k are obtained, the items are combined into an item set, and the item set is used as a candidate item set; the server sends the candidate item set to a second group of users, each user of the second group calculates the intersection of the own item and the candidate item set, the number of the items contained in the intersection is used as data to be disturbed and then uploaded, the server gathers and recovers the disturbed data sent by the second group of users and then determines the sampling number, and the server sends the sampling number and the candidate item set to a third group of users; each user in the third group calculates the intersection of the owned items and the option item set, if the number of the items in the intersection is smaller than the sampling number, virtual items are added in the intersection, so that the number of the items in the intersection is equal to the sampling number, each user in the third group randomly selects one item from the corresponding intersection to interfere and upload the same to a server, the server aggregates and recovers the received information, and the server can estimate the frequency of the frequent items held by the user to obtain frequent top-K frequent items, namely the items with the frequency rank of the top K names.

Step 2: dividing users into K groups, establishing an initial matrix by a server side according to a frequent item estimation result, performing singular value decomposition on the matrix, obtaining a left matrix and a right matrix (an orthogonal matrix or a left singular matrix and a right singular matrix), transmitting the left matrix and the right matrix to the K groups of users, wherein k=1 at the moment, setting an initial frequent item set FIS, and at the moment, the FIS is empty;

the final object of the present invention is to obtain a frequent item set group comprising top-K (top K frequent item), thus requiring the user to be classified into K groups, i, j being recorded as two different frequent items, f (i), f (j) being the frequency representing the i-th frequent item and the j-th frequent item, respectively, i=1, 2,... The server calculates the value M (i, j) =min (f (i), f (j)) of the element of the ith row and the jth column in the matrix M according to the frequency estimation of the frequent item obtained in the step 1 and the frequent item set consisting of the ith frequent item and the jth frequent item. The smaller value of the frequency of two items is chosen as the matrix element, because for any item i, when the frequency of occurrence is f (i), the frequency of the frequent item set containing it is not more than f (i). In order to ensure the accuracy of estimation, obtaining the maximum value which can occur to the frequency of each item set; using singular value decomposition m=u Σv ^T Two orthogonal matrixes U and V are obtained, and the sigma is a diagonal matrix. In order to reduce the communication overhead, the present embodiment approximates the matrix by using only n=1 singular values to obtain U and V with the orders of 1*n and n×1, respectively ^T And transmitting to a first group of users, wherein T is the matrix transposition.

Step 3: the k group of users establishes a matrix M according to the frequent items and FIS owned locally _r (when k=1, FIS is empty, FIS does not affect the user side according to the frequent items locally owned, and then according to U and V ^T Establishing a singular value matrix, interfering the singular value matrix, and uploading the singular value matrix to a server, wherein the interference in the embodiment is after differential privacy noise is added;

step 3.1: comparing the frequent items held by the r user with the frequent item sets in the FIS one by one, and if the frequent items held by the r user are a certain frequent item set, comparing the singular value matrix S of the user with the singular value matrix S of the user _r The elements in the method are all set to 0, and the comparison is stopped, and the step 3.3 is performed; if the number of intersections of the frequent items held by the user and a certain frequent item set is 1 or 0, the frequent items held by the r user are unchanged; if the number of intersections is 2, in order to avoid that the estimation of other frequent item sets containing frequent items is affected, randomly deleting one frequent item in the intersection of the frequent items held by the (r) th user (for example, the intersection has a, b two frequent items, and the client has a, b, d three frequent items, then randomly deleting a, or randomly deleting b in a, b, d), thereby updating the frequent items held by the client, and then performing the next round of comparison; turning to step 3.2 until all frequent item sets in the FIS are compared; the original frequent items held by the user are still reserved in the step;

step 3.2: and adopts a matrix M with the following dimensions of K x and K _r Representing frequent items held by the user:

wherein M is _r (x, y) represents M _r Elements of the x-th row and y-th column; x=1, 2,..k; y=1, 2,..k; q is a frequent item number set held by the updated user; each positional element in the matrix has a value, not a diagonal matrix like Σ; for example, the frequent items owned by the user after updating are a, d; the K frequent items are sequentially a, b, c and d according to the frequency ranking, and the numbers are sequentially 1234; frequent item numbers owned by the user are 1,4; then at M _r In the matrix, the 11, 14, 41, 44 element values are 1, and the rest are 0.

The kth group, the nthThe user receives U and V ^T Obtaining a Moore-Penrose generalized inverse matrix U corresponding to the orthogonal matrix according to the property of the orthogonal matrix ⁺ ＝U ^T ，V ^T+ =v; calculating singular matrices of the r user:

S _r ＝U ⁺ *M _r *V ^T+

step 3.3: in this embodiment due to U and V ^T The order of (2) is 1*n and n 1 respectively, so S _r Only one element is included. The user needs to make a pair of S _r Differential privacy interference is carried out in the following way: first according to left and right singular matrices U and V ^T Obtaining a value range D of singular values, and projecting the value range D to [ -1,1]Between, the user will S _j Mapping to corresponding value range v E < -1,1 [ -1 ]]The random response mechanism uploads the interfered information with the probability of pOr uploading the interfered information with a probability of 1-p +.>Epsilon is a privacy parameter; therefore, the user only needs to send 1bit data to the server, and communication overhead is greatly reduced. Wherein p and 1-p are defined as follows:

step 4: k=k+1, judging whether K is larger than K, if yes, stopping calculation, taking FIS as a final mining result, and also taking the final top-K frequent item set; otherwise, the server side performs aggregation analysis on the received interference information, digs out the most frequent item set and puts the most frequent item set into the FIS, updates the initial matrix and the FIS, and returns to the step 3;

the server firstly carries out aggregation estimation on the received interference information and recovers to obtain an estimation matrixThen calculate to getMatrix->Like the dimension of matrix M, select +.>The maximum element value of (2) is set in matrix +.>The position of the element in the matrix M corresponds to the matrix M, and the frequent item set corresponding to the element in the position in the matrix M is used as the frequent item set to be added into the frequent item set FIS. And setting the value of an element corresponding to the frequency item set in the FIS in the matrix M to be 0, carrying out singular value decomposition on the updated M to obtain a new orthogonal matrix, and transmitting the FIS and the new orthogonal matrix to the kth group of users together.

The embodiments of the present invention have been described in detail with reference to the drawings, but the present invention is not limited to the above embodiments, and various changes can be made within the knowledge of those skilled in the art without departing from the spirit of the present invention.

Claims

1. The method for mining the frequent item set of the local differential privacy protection based on singular value decomposition is characterized by comprising the following steps of: the method specifically comprises the following steps:

step 3: let k=1;

step 4: singular value decomposition is carried out on M to obtain two orthogonal matrixes U and V ^T The method comprises the steps of carrying out a first treatment on the surface of the The server will U and V ^T Transmitting to the kth group of users;

step 5: the r user in the k group of users receives U and V ^T Establishing a singular matrix corresponding to the R-th user, interfering the singular matrix to obtain interference information, and uploading the interference information to a server, wherein r=1, 2, …, R and R are the total number of users in the k-th group of users;

2. The method for mining frequent item sets of local differential privacy protection based on singular value decomposition according to claim 1, wherein: the step 1 specifically comprises the following steps: dividing intelligent home users into three groups, randomly selecting one item by each user of the first group to perform interference, uploading interference information to a server, and performing aggregate analysis on the interference information sent by the first group of users by the server to obtain items with frequency ranking of front 2*K, wherein the items form a candidate item set; the server sends the candidate item set to a second group of users, each user of the second group calculates the intersection of the set formed by the owned items and the candidate item set, the number of the items in the intersection is used as data to interfere, the interference information is uploaded to the server, the server aggregates and recovers the interference information sent by the second group of users, and therefore the sampling number is obtained, and the server sends the sampling number and the candidate item set to a third group of users; and each user in the third group randomly selects one item from the intersection of the set formed by the self-owned items and the candidate item set to interfere, and uploads the interference information to the server, and the server aggregates and recovers the interference information sent by the user in the third group, so that the frequency estimation of the items held by all the users is obtained.

3. The method for mining frequent item sets of local differential privacy protection based on singular value decomposition according to claim 1, wherein: in the step 2, according to the frequent item set formed by the ith frequent item and the jth frequent item, calculating to obtain the value of the element of the jth column of the ith row in the matrix M:

m(i，j)＝min(f(i)，f(j))

4. The method for mining frequent item sets of local differential privacy protection based on singular value decomposition according to claim 1, wherein: the step 5 specifically comprises the following steps:

step 3.1: comparing the frequent items held by the r user with the frequent item sets in the FIS one by one, and if the set formed by the frequent items held by the r user is the same as a certain frequent item set, comparing the singular matrix S corresponding to the user _r The elements in the method are all set to 0, and the comparison is stopped, and the step 3.3 is performed; if the number of elements in the intersection of the set of frequent items held by the user and a certain set of frequent items is 1 or 0, the frequent items held by the r user are unchanged; if the number of elements in the intersection of the set of frequent items held by the r-th user and a certain set of frequent items is 2, deleting any one of the frequent items in the intersection from the frequent items held by the user, thereby updating the frequent items held by the client, and then performing the next round of comparison until in the FISAll the frequent item sets are compared, and the step 3.2 is performed;

according to U and V ^T Calculating singular matrices of the r user:

S _r ＝U ⁺ *M _r *V ^T+

wherein U is ⁺ ＝U ^T Wherein T is the matrix transpose, V ^T+ ＝V；

Step 3.3: will S _r Mapping to a value range v, v E < -1 >, 1]Is mapped with S _r The value range v of the code is interfered to obtain information after interferenceOr->The r user will t with probability p ₁ Uploading to a server, or t with probability 1-p ₂ Uploading to the server, < > and >>Epsilon is a privacy parameter.

5. The method for mining frequent item sets for local differential privacy protection based on singular value decomposition according to claim 1, wherein: the step 6 specifically comprises the following steps: