WO2022079905A1

WO2022079905A1 - Data aggregation device, data aggregation system, data aggregation method, and program

Info

Publication number: WO2022079905A1
Application number: PCT/JP2020/039120
Authority: WO
Inventors: 聡長谷川; 尭之三浦
Original assignee: 日本電信電話株式会社
Priority date: 2020-10-16
Filing date: 2020-10-16
Publication date: 2022-04-21
Also published as: JPWO2022079905A1

Abstract

Provided is a data aggregation device comprising: a computing unit that computes a frequency vector of original data by calculating a formula having a frequency vector of randomized data obtained by randomizing the original data, a transition probability matrix having the transition probability of each pair before and after the randomization, and a frequency vector of the original data to be estimated; and an output unit that outputs the frequency vector of the original data computed by the computing unit.

Description

Data aggregation devices, data aggregation systems, data aggregation methods, and programs

The present invention relates to a technique for concealing individual data from a database by a probabilistic method.

As a technique for concealing individual data from a database by a probabilistic method, for example, there is a technique disclosed in Non-Patent Document 1 and the like. In the concealment process disclosed in Non-Patent Document 1 and the like, the data is maintained with a certain probability and rewritten randomly with a probability other than that. As a method of randomizing data, the input data set and the output data set may be the same or different.

If the data is rewritten randomly and used for data analysis, an error will occur. Therefore, when collecting randomized individual data, aggregation processing is performed so as to remove the influence of randomization as much as possible (for example, Non-Patent Document 2). ). As the data aggregation process, it mainly deals with the appearance frequency (frequency) of data, which is a basic analysis (Non-Patent Documents 1 and 2). After that, frequency aggregation is handled as data aggregation processing.

In the conventional technique, the data aggregation process in the method in which the input data set and the output data set are different does not satisfy the non-negative constraint and the total number constraint. Therefore, it is necessary to make corrections in order to satisfy the non-negative constraint and the total number constraint (Non-Patent Documents 1 and 2). Since an error occurs due to this correction, a method that satisfies the non-negative constraint and the total number constraint as much as possible is preferred.

The present invention has been made in view of the above points, and is a technique for efficiently performing data aggregation processing while simultaneously satisfying a non-negative constraint and a total number constraint in a data randomization method in which an input data set and an output data set are different. The purpose is to provide.

According to the disclosed technique, a frequency vector of randomized data obtained by randomizing the original data, a transition probability matrix having a transition probability for each set before and after randomization, and the original data to be estimated. An arithmetic unit that calculates the frequency vector of the original data by calculating an expression having a frequency vector, and
A data aggregation device including an output unit that outputs a frequency vector of the original data calculated by the calculation unit is provided.

According to the disclosed technique, a technique for efficiently performing data aggregation processing while simultaneously satisfying the non-negative constraint and the total number constraint is provided.

It is a block diagram of the data aggregation apparatus in embodiment of this invention. It is a flowchart which shows the flow of processing of a data aggregation apparatus. It is a figure which shows the hardware configuration example of the apparatus. It is an example of the transition probability matrix in Example 1. It is a figure which shows the image of the calculation amount reduction in Example 2. This is an example of a transition probability matrix with only nonzero columns.

Hereinafter, an embodiment of the present invention (the present embodiment) will be described with reference to the drawings. The embodiments described below are merely examples, and the embodiments to which the present invention is applied are not limited to the following embodiments.

(Device configuration example)
FIG. 1 shows a configuration diagram of the data aggregation device 100 according to the present embodiment. As shown in FIG. 1, the data aggregation device 100 in the present embodiment has an input unit 110, a calculation unit 120, an output unit 130, and a storage unit 140. Further, FIG. 1 shows a case where there is a terminal 200 that performs data randomization processing. A system having a data aggregation device 100 and a terminal 100 may be referred to as a data aggregation system.

The data to be aggregated is input to the input unit 110 of the data aggregation device 100. The arithmetic unit 120 randomizes the input data and executes the aggregation process. The output unit 130 outputs the result of the aggregation process. The aggregation process in this embodiment is frequency estimation of the original data.

Data input from the input unit 110 can be stored in the data storage unit 140 (database). Thereby, for example, the calculation unit 120 can perform the aggregation process on the data to be aggregated acquired in the past, and the aggregation result can be output from the output unit 130.

For example, the data storage unit 140 stores data in which the number of collected records is N (that is, N data), and the arithmetic unit 120 simultaneously satisfies the non-negative constraint and the total number constraint for the data. Perform aggregation processing.

Further, the original data may be randomized in the terminal 200, and the randomized data may be transmitted from the terminal 200. Randomized data is input to the input unit 110, and the randomized data is stored in the data storage unit 140. The arithmetic unit 120 may perform aggregation processing on the randomized data. Randomized data may be input to the input unit 110 without storing prior data in the data storage unit 140, and the calculation unit 120 may execute the aggregation process on the input data.

Hereinafter, the operation executed by the data aggregation device 100 will be described more specifically.

(Operation example of data aggregation device 100)
N pieces of data are input to the input unit 110. The N pieces of data constitute a data set X = [D] = {0, ..., D-1} having D kinds of values. In the arithmetic unit 120, the data x ∈ X included in X is input to the stochastic mechanism M: X → Z, and the output z ∈ Z is obtained. That is, a set Z of randomized data is obtained from X.

As described above, by executing the randomization on the terminal 200 outside the data aggregation device 100, the randomized data may be input to the input unit 110.

The arithmetic unit 120 aggregates a randomized set of data { _zi } ^N-1 _{i = 0} while considering the influence of the mechanism M, and the frequency vector h _X = in which the frequency of the data x ∈ X is stored. {H _X (0), ..., h _X (D-1)} (However, h _X (0) ≧ 0, ..., h _X (D-1) ≧ 0, Σ ^D-1 _{i =} Estimate ₀ h _X (i) = N). h _X (0) ≧ 0, ..., hX (D-1) ≧ 0 is a non-negative constraint, and Σ ^D-1 _{i = 0} h _X (i) = N is a total number constraint.

In the stochastic mechanism M: X → Z, there is a conditional probability Pr [z | x] for the input x ∈ X and the output z ∈ Z. The randomization process is performed according to the conditional probability. Let P ∈ [0,1] ^{| X | × | Z |} be a | X | x | Z | matrix having Pr [z | x] of arbitrary x ∈ X and z ∈ Z as the matrix value. | X | is the number of types of values in X, and | Z | is the number of types of values in Z. P may be called a transition probability matrix.

Further, the arithmetic unit 120 uses h _Z ₌ { ^h _Z ( ₀ ) ,. .., h _Z (D-1)}. Here, h _Z (0) ≧ 0, ..., h _Z (D-1) ≧ 0, Σ ^D-1 _{i = 0} h _Z (i) = N. It should be noted that Σ ^D-1 _{i = 0} h _Z (i) = N is an example. Further, h _Z may be calculated by an external device of the data aggregation device 100, and h _Z may be input to the data aggregation device 100.

The calculation unit 120 estimates the frequency h _X (x) by repeatedly calculating the equation (1).

Here, h _X (x) ^t indicates the value of h _X (x) of the t-th repetition. The frequency of each value can be calculated by the above equation (1). Further, the calculation unit 120 can collectively calculate the frequency vector h _X by repeatedly calculating the following equation (2).

In the above equation (2), h ^t _X is a frequency vector and is a D-dimensional horizontal vector. P is a | X | × | Z | matrix (D × | Z | matrix) having the above-mentioned Pr [z | x] as the matrix value. H _z , which is a frequency vector composed of randomized data { _zi } ^N-1 _{i = 0} , is a horizontal vector of | Z | dimension.

h ^t _XP is a product of a D-dimensional horizontal vector and a matrix of D × | Z |, and is a | Z | dimensional horizontal vector.

(H _Z / h ^t _XP ) is a | Z | dimensional horizontal vector obtained by dividing each element of h _Z by the corresponding element of h ^t _XP . (P (h _Z / h ^t _XP ) ^T ) ^T is a D-dimensional horizontal vector. h ^t _X · (P (h _Z / h ^t _XP ) ^T ) ^T is a D-dimensional horizontal vector consisting of the product of each element of two D-dimensional horizontal vectors.

The calculation unit 120 sets the initial value as h ⁰ _X = {N / | X |, ..., N / | X |}, and for a predetermined η> 0, || ^{ht + 1} _X- h ^t _X | H _X is obtained by calculating the iterative equation (2) until | <η. That is, the equation (2) is calculated until the magnitude of the difference between h ^{t + 1} _X and h ^t _X becomes smaller than the threshold value η.

Here, the calculation result of Eq. (2) does not change even if P is multiplied by a constant α. Therefore, the formula (2) is:

May be. If α = 1, the equation (2) is obtained. Further, as will be described later, by taking an appropriate value as α in the equation (3), it is possible to avoid a small value that cannot be handled even by the double precision floating point number type in P.

<About the basis of the formula>
Equation (1) is derived based on Bayes' theorem as follows. That is, using Bayes' theorem, Pr [x | z] is

Will be. father,

And h _X (x) = Pr [x] × N, the equation (1) is derived. By expressing the equation (1) by the calculation of the vector and the matrix, the equation (2) is obtained, and further α is introduced to obtain the equation (3).

<Processing flow>
An example of the processing flow of the data aggregation device 100 when the equation (3) is used will be described with reference to the flowchart of FIG. Here, an example is shown in which h _Z is input to the data aggregation device 100.

In S101, P, h _Z , α> 0, η> 0 are input to the input unit 110.

In S102, the arithmetic unit 120 starts the calculation of the equation (3) with h ⁰ _X = {N / | X |, ..., N / | X |}.

In S103, the calculation unit 120 determines whether || ht ^{+ 1} _X- ^ht _X || <η, and if the determination is No, returns to S102 and calculates the equation (3). If the determination is Yes, the h ^{t + 1} _X (or h ^t _X ) at that time is output from the output unit 130 as the frequency vector h _X of the calculation result (S104).

(Hardware configuration example)
The data aggregation device 100 and the terminal 200 (collectively referred to as "devices") in the present embodiment can be realized by, for example, causing a computer to execute a program describing the processing contents described in the present embodiment. The "computer" may be a physical machine or a virtual machine on the cloud. When using a virtual machine, the "hardware" described here is virtual hardware.

The above program can be recorded on a computer-readable recording medium (portable memory, etc.), saved, and distributed. It is also possible to provide the above program through a network such as the Internet or e-mail.

FIG. 3 is a diagram showing an example of the hardware configuration of the above computer. The computer of FIG. 3 has a drive device 1000, an auxiliary storage device 1002, a memory device 1003, a CPU 1004, an interface device 1005, a display device 1006, an input device 1007, an output device 1008, and the like, which are connected to each other by a bus BS, respectively.

The program that realizes the processing on the computer is provided by, for example, a recording medium 1001 such as a CD-ROM or a memory card. When the recording medium 1001 storing the program is set in the drive device 1000, the program is installed in the auxiliary storage device 1002 from the recording medium 1001 via the drive device 1000. However, the program does not necessarily have to be installed from the recording medium 1001, and may be downloaded from another computer via the network. The auxiliary storage device 1002 stores the installed program and also stores necessary files, data, and the like.

The memory device 1003 reads and stores the program from the auxiliary storage device 1002 when there is an instruction to start the program. The CPU 1004 realizes the function related to the device according to the program stored in the memory device 1003. The interface device 1005 is used as an interface for connecting to a network. The display device 1006 displays a GUI (Graphical User Interface) or the like by a program. The input device 1007 is composed of a keyboard, a mouse, buttons, a touch panel, and the like, and is used for inputting various operation instructions. The output device 1008 outputs the calculation result.

Hereinafter, Examples 1 to 3 will be described as more specific examples in the method of calculating the frequency vector h _X by the procedure for repeatedly calculating the formula (2) or the formula (3) described so far. Example 1 is an example of randomizing data by the RAPPOR method described in Non-Patent Document 1, and Example 2 shows a method of efficiently calculating when Z> N. In Example 3, when Z is large, it becomes inefficient in numerical calculation, and therefore, a method for solving it by using α is presented. Hereinafter, the details of each embodiment will be described.

(Example 1)
In Example 1, a case where data is randomized by the RAPPOR method described in Non-Patent Document 1 will be described.

RAPPOR (also called Basic One-time RAPPOR or Unary Encoding) is a method in which data x∈X is converted into a D-dimensional vinyl vector, and perturbation (randomization) and aggregation are performed for each value of the vector. In the first embodiment, it is assumed that the arithmetic unit 120 of the data aggregation device 100 performs D-dimensional vinyl vectorization and randomization.

The arithmetic unit 120 inputs the input x to Enter: X → {0,1} ^D in order to convert the input x into a D-dimensional binary vector. That is, each x is converted into a D-dimensional vinyl vector. The output after conversion is a D-dimensional binary vector in one-hot notation. That is, Encode (x) = (0, ..., 0, 1, 0, ..., 0) = b. In the vector b, only the i-th position is 1, and the other positions are 0.

The arithmetic unit 120 performs randomization processing Perturb in RAPPOR on b to obtain a vector b'after randomization as shown below.

Perturb (b) = b'
Randomization processing In Perturb, randomization is performed for each value of the vector b so as to follow the probability shown in the following equation. When the i-th value of the vector b is b [i],

Is. Equation (6) means that if the i-th value of the vector b is 1, the value is set to 1 with the probability p, and the value is set to 0 with the probability 1-p, and the i-th value of the vector b is 0. If there is, it means that the value is set to 1 with the probability q and 0 with the probability 1-q.

In order to construct the transition probability matrix P in the equation (2) and the equation (3), it is sufficient to know Pr [z | x] which is each element of the matrix. In the first embodiment, since the input and output of the Perturb are binary vectors in RAPPOR, the transition probability of the binary vector may be obtained.

The transition probability of the binary vector is the product of the transition probabilities of each value of the binary vector. Further, since the number of inputs of Perturb is a value of ^D type and the type of output value is 2D type, the transition probability matrix P is a ^D × 2D matrix. However, although it was a one-hot binary vector at the input, it is different from the Randomized Response because the output is not always expressed as a one-hot.

For example, when D = 3, there are 3 types of inputs to Perturb and 8 types of outputs. Further, for example, when the input is b = (0,1,0) and the output Perturb (b) = (1,1,0), the transition probability is p (1-q) from the equation (6). q. Transition probability matrix when D = 3

Is shown in FIG. In FIG. 4, the row corresponds to the input to the Perturb and the column corresponds to the output, and each value of the matrix is the multiplication of the transition probabilities defined in the equation (6). For example, it is shown that the transition probability from b = (0,0,1) to b'= (0,0,0) is (1-p) (1-q) ² .

For h _Z in the equation (2) or the equation (3), the arithmetic unit 120 aggregates the perturbed Vector (Encode (x _i )) = _b'i into { _b'i } ^Ni _{= 1} , and each of them It can be constructed by finding the frequency h _Z (b') of the vector b'∈ Z. In addition, α (when using equation (3)) and η may be appropriately determined. As for α in the formula (3), it is preferable to use α described in Example 3.

(Example 2)
Next, Example 2 will be described. In the simple method of calculating the equation (2) or the equation (3), both the spatial complexity and the time complexity are O (| X || Z |) = O (D | Z |), but | Z |> N. When the condition of is satisfied, both the spatial complexity and the time complexity can be reduced to O (DN) by utilizing the fact that the intermediate result of the calculation becomes 0.

In the second embodiment, attention is paid to the sparseness of hZ when | _Z |> N. The size of h _Z is | Z |, but since | Z |> N, the value of h _Z is always 0 in part. That is, Z is {z _i } ^N _{i = 1} , and the number of elements of Z is N. Therefore, when the type of value | Z | that can be an element of Z is larger than N, it becomes an element of Z. Of the values that can be obtained, there are always values that do not appear. That is, there is always a value at which the frequency becomes 0.

Here, a part of the calculation of the equation (2) is set as v = h _Z / h _XP . Then, equation (2) becomes
h ^{t + 1} _X = h ^t _X · (Pv ^T ) ^T
Will be. In the calculation of v, where the value of the vector of h _Z is 0, the calculation result of v obtained is always 0. Therefore, the value of all the columns corresponding to the value of the vector of h _Z in P may be 0. In addition, even in h _X · Pv ^T , the value may be 0 in all columns of P corresponding to the value of v being 0.

From these, P may be a matrix P'having only a column at a non-zero position in h _Z , and h _Z may be a vector _h'Z having only a non-zero position. The equation (2) in the case where P'and _h'Z are used by utilizing the sparseness as described above is as follows.

The same applies to the equation (3), and when sparseness is used, P and h _Z in the equation (3) may be replaced with P'and _h'Z .

FIG. 5 shows an image of computational complexity reduction using sparseness. In FIG. 5, it is assumed that the black-painted portion in the vector and the matrix is 0. As shown in FIG. 5, it becomes 0 at the same position of P and h _Z , and the amount of calculation can be reduced by using P'and _h'Z in which the 0 portion is deleted.

When constructing a 3 × 2 ³ transition probability matrix P (see FIG. 4) using RAPPOR (Unary Encoding) described in the first embodiment, h _Z ((0,1,1) of the transition probability matrix P is used. 0)) = 0, h _Z ((0,1,1)) = 0, h _Z ((1,0,1)) = 0, h _Z ((1,1,0)) = 0, h _Z In the case of ((1,1,1)) = 0, these columns are deleted, so P'becomes a matrix as shown in FIG.

(Example 3)
Not limited to the application of RAPPOR as in the first embodiment, each value of the transition probability matrix becomes smaller as D becomes larger. Therefore, the accuracy of numerical values when the method in this embodiment is implemented in a computer (computer) used as a data aggregation device 100 becomes a problem. For example, it is assumed that b is disturbed by RAPPOR with p + q = 1. At that time, the transition probability that the values of the binary vectors do not match at all before and after the disturbance is

Will be. For example, in the case of ε = 1 and D = 764 and ε = 4 and D = 350, the transition probability is a very small value of 5E-324, and a value smaller than that cannot be handled even by a double precision floating point type variable.

In Example 3, by adjusting α, even if D, which is the size of the type of X value, is large, it can be handled by the double precision floating point type. Specifically, it is as follows.

In the transition probability matrices P and P'of RAPPOR shown in FIGS. 4 and 6, the transition probability consists of the product of a total of four values of p, 1-p, q and 1-q. Then, in P'in FIG. 6, it can be seen that 1-q appears in common for each value.

Therefore, by setting α = 1 / (1-q) in the equation (3) in which P is P', extra multiplication is reduced in the calculation of "αP'", and the difference for each value can be expressed. become. For example, (1-p) (1-q) ² in P'becomes (1-p) (1-q) in "αP'", which can reduce extra multiplication by values smaller than 1. can.

That is, in Example 3, the reciprocal of the product of the combinations of probabilities p, 1-p, q, 1-q that commonly appear in P (or P') as the constant α in the equation (3). Should be used. In other words, the transition probability, which is the value of P or P', consists of the product of the values of a plurality of probabilities used in the randomization process, and is common to each value of P or P'and is the same as an element of the product. If a value is used, use a value based on that same value as α. If there is no probability that each value will appear in common, for example, α = 1.

(Effect of embodiment)
According to the technique described in this embodiment, it is possible to estimate the frequency of the original data while simultaneously satisfying the non-negative constraint and the total number constraint based on the data randomized from the original data.

Further, as described in the second embodiment, in the case of | Z |> N, the amount of calculation can be reduced by using the sparseness of the frequency of the randomized data. Further, as described in Example 3, when there is something that appears in common for each value in P (product of any one or combination of p, 1-p, q, 1-q), It can be canceled by α. As a result, the calculation becomes more efficient, and it is possible to avoid that the value in P cannot be handled by the double precision floating point type.

(Summary of embodiments)
This specification describes at least the data aggregation device, the data aggregation system, the data aggregation method, and the program described in each of the following sections.
(Section 1)
An expression having a frequency vector of randomized data obtained by randomizing the original data, a transition probability matrix having a transition probability for each set before and after randomization, and a frequency vector of the original data to be estimated. A calculation unit that calculates the frequency vector of the original data by calculation,
A data aggregation device including an output unit that outputs a frequency vector of the original data calculated by the calculation unit.
(Section 2)
Let X be the set of the original data x, Z be the set of the randomized data z, P be the transition probability matrix, h _Z be the frequency vector of the randomized data, and the frequency vector of the original data be. When h _X is set, h ^t _X is set to h _X of the t-th repetition, and α is set as a constant.
The above formula is
h ^{t + 1} _X = h ^t _X · (αP (h _Z / h ^t _X αP) ^T ) The first equation expressed by ^T.
The data aggregation device according to item 1, wherein the calculation unit calculates the first equation until the difference between h ^{t + 1} _X and h ^t _X becomes smaller than a threshold value.
(Section 3)
When the number of possible value types in the set Z is larger than the number of elements in the set Z, h _Z is a frequency vector consisting of only non-zero elements in the frequency vector h _Z of the randomized data. In the case where the transition probability matrix P is set to ′ and the matrix in which only the columns corresponding to the elements existing in h _Z ′ is used is P ′.
Instead of the first equation, the arithmetic unit calculates a second equation represented by h ^{t + 1} _X = h ^t _X · (αP'(h _Z '/ h ^t _X αP') ^T ) ^T. The data aggregation device according to Section 2.
(Section 4)
The transition probability, which is the value of P or P', consists of the product of the values of a plurality of probabilities used in the randomization process, and is common to each value of P or P'as an element of the product. The data aggregation device according to item 2 or 3, wherein when the same value is used, a value based on the same value is used as the α.
(Section 5)
A data aggregation system including the data aggregation device according to any one of paragraphs 1 to 4, and a terminal that randomizes the original data and transmits the randomized data to the data aggregation device.
(Section 6)
An expression having a frequency vector of randomized data obtained by randomizing the original data, a transition probability matrix having a transition probability for each set before and after randomization, and a frequency vector of the original data to be estimated. A calculation step for calculating the frequency vector of the original data by calculation, and
A data aggregation method including an output step for outputting a frequency vector of the original data calculated by the calculation step.
(Section 7)
A program for making a computer function as each part in the data aggregation device according to any one of the items 1 to 4.

Although the present embodiment has been described above, the present invention is not limited to such a specific embodiment, and various modifications and changes can be made within the scope of the gist of the present invention described in the claims. It is possible.

100 Data aggregation device 110 Input unit 120 Calculation unit 130 Output unit 140 Storage unit 200 Terminal 1000 Drive device 1001 Recording medium 1002 Auxiliary storage device 1003 Memory device 1004 CPU
1005 Interface device 1006 Display device 1007 Input device 1008 Output device

Claims

An expression having a frequency vector of randomized data obtained by randomizing the original data, a transition probability matrix having a transition probability for each set before and after randomization, and a frequency vector of the original data to be estimated. A calculation unit that calculates the frequency vector of the original data by calculation,
A data aggregation device including an output unit that outputs a frequency vector of the original data calculated by the calculation unit.
Let X be the set of the original data x, Z be the set of the randomized data z, P be the transition probability matrix, h Z be the frequency vector of the randomized data, and the frequency vector of the original data be. When h X is set, h t X is set to h X of the t-th repetition, and α is set as a constant.
The above formula is
h t + 1 X = h t X · (αP (h Z / h t X αP) T ) The first equation expressed by T.
The data aggregation device according to claim 1, wherein the calculation unit calculates the first equation until the difference between h t + 1 X and h t X becomes smaller than a threshold value.
When the number of possible value types in the set Z is larger than the number of elements in the set Z, h Z is a frequency vector consisting of only non-zero elements in the frequency vector h Z of the randomized data. In the case where the transition probability matrix P is set to ′ and the matrix in which only the columns corresponding to the elements existing in h Z ′ is used is P ′.
Instead of the first equation, the arithmetic unit calculates a second equation represented by h t + 1 X = h t X · (αP'(h Z '/ h t X αP') T ) T. The data aggregation device according to claim 2.
The transition probability, which is the value of P or P', consists of the product of the values of a plurality of probabilities used in the randomization process, and is common to each value of P or P'as an element of the product. The data aggregation device according to claim 2 or 3, wherein when the same value is used, a value based on the same value is used as the α.
A data aggregation system including the data aggregation device according to any one of claims 1 to 4 and a terminal that randomizes the original data and transmits the randomized data to the data aggregation device.
An expression having a frequency vector of randomized data obtained by randomizing the original data, a transition probability matrix having a transition probability for each set before and after randomization, and a frequency vector of the original data to be estimated. A calculation step for calculating the frequency vector of the original data by calculation, and
A data aggregation method including an output step for outputting a frequency vector of the original data calculated by the calculation step.
A program for making a computer function as each part in the data aggregation device according to any one of claims 1 to 4.