WO2022079905A1 - Data aggregation device, data aggregation system, data aggregation method, and program - Google Patents

Data aggregation device, data aggregation system, data aggregation method, and program Download PDF

Info

Publication number
WO2022079905A1
WO2022079905A1 PCT/JP2020/039120 JP2020039120W WO2022079905A1 WO 2022079905 A1 WO2022079905 A1 WO 2022079905A1 JP 2020039120 W JP2020039120 W JP 2020039120W WO 2022079905 A1 WO2022079905 A1 WO 2022079905A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
frequency vector
data aggregation
value
original data
Prior art date
Application number
PCT/JP2020/039120
Other languages
French (fr)
Japanese (ja)
Inventor
聡 長谷川
尭之 三浦
Original Assignee
日本電信電話株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電信電話株式会社 filed Critical 日本電信電話株式会社
Priority to JP2022556814A priority Critical patent/JPWO2022079905A1/ja
Priority to PCT/JP2020/039120 priority patent/WO2022079905A1/en
Publication of WO2022079905A1 publication Critical patent/WO2022079905A1/en

Links

Images

Classifications

    • GPHYSICS
    • G09EDUCATION; CRYPTOGRAPHY; DISPLAY; ADVERTISING; SEALS
    • G09CCIPHERING OR DECIPHERING APPARATUS FOR CRYPTOGRAPHIC OR OTHER PURPOSES INVOLVING THE NEED FOR SECRECY
    • G09C1/00Apparatus or methods whereby a given sequence of signs, e.g. an intelligible text, is transformed into an unintelligible sequence of signs by transposing the signs or groups of signs or by replacing them by others according to a predetermined system

Definitions

  • the present invention relates to a technique for concealing individual data from a database by a probabilistic method.
  • Non-Patent Document 1 As a technique for concealing individual data from a database by a probabilistic method, for example, there is a technique disclosed in Non-Patent Document 1 and the like. In the concealment process disclosed in Non-Patent Document 1 and the like, the data is maintained with a certain probability and rewritten randomly with a probability other than that. As a method of randomizing data, the input data set and the output data set may be the same or different.
  • Non-Patent Document 2 a basic analysis
  • frequency aggregation is handled as data aggregation processing.
  • Non-Patent Documents 1 and 2 Since an error occurs due to this correction, a method that satisfies the non-negative constraint and the total number constraint as much as possible is preferred.
  • the present invention has been made in view of the above points, and is a technique for efficiently performing data aggregation processing while simultaneously satisfying a non-negative constraint and a total number constraint in a data randomization method in which an input data set and an output data set are different.
  • the purpose is to provide.
  • a frequency vector of randomized data obtained by randomizing the original data, a transition probability matrix having a transition probability for each set before and after randomization, and the original data to be estimated.
  • An arithmetic unit that calculates the frequency vector of the original data by calculating an expression having a frequency vector, and A data aggregation device including an output unit that outputs a frequency vector of the original data calculated by the calculation unit is provided.
  • a technique for efficiently performing data aggregation processing while simultaneously satisfying the non-negative constraint and the total number constraint is provided.
  • FIG. 1 shows a configuration diagram of the data aggregation device 100 according to the present embodiment.
  • the data aggregation device 100 in the present embodiment has an input unit 110, a calculation unit 120, an output unit 130, and a storage unit 140. Further, FIG. 1 shows a case where there is a terminal 200 that performs data randomization processing.
  • a system having a data aggregation device 100 and a terminal 100 may be referred to as a data aggregation system.
  • the data to be aggregated is input to the input unit 110 of the data aggregation device 100.
  • the arithmetic unit 120 randomizes the input data and executes the aggregation process.
  • the output unit 130 outputs the result of the aggregation process.
  • the aggregation process in this embodiment is frequency estimation of the original data.
  • Data input from the input unit 110 can be stored in the data storage unit 140 (database).
  • the calculation unit 120 can perform the aggregation process on the data to be aggregated acquired in the past, and the aggregation result can be output from the output unit 130.
  • the data storage unit 140 stores data in which the number of collected records is N (that is, N data), and the arithmetic unit 120 simultaneously satisfies the non-negative constraint and the total number constraint for the data. Perform aggregation processing.
  • the original data may be randomized in the terminal 200, and the randomized data may be transmitted from the terminal 200.
  • Randomized data is input to the input unit 110, and the randomized data is stored in the data storage unit 140.
  • the arithmetic unit 120 may perform aggregation processing on the randomized data. Randomized data may be input to the input unit 110 without storing prior data in the data storage unit 140, and the calculation unit 120 may execute the aggregation process on the input data.
  • N pieces of data are input to the input unit 110.
  • the data x ⁇ X included in X is input to the stochastic mechanism M: X ⁇ Z, and the output z ⁇ Z is obtained. That is, a set Z of randomized data is obtained from X.
  • the randomized data may be input to the input unit 110.
  • h X (0) ⁇ 0, ..., hX (D-1) ⁇ 0 is a non-negative constraint
  • h Z may be calculated by an external device of the data aggregation device 100, and h Z may be input to the data aggregation device 100.
  • the calculation unit 120 estimates the frequency h X (x) by repeatedly calculating the equation (1).
  • h X (x) t indicates the value of h X (x) of the t-th repetition.
  • the frequency of each value can be calculated by the above equation (1).
  • the calculation unit 120 can collectively calculate the frequency vector h X by repeatedly calculating the following equation (2).
  • h t X is a frequency vector and is a D-dimensional horizontal vector.
  • P is a
  • h t XP is a product of a D-dimensional horizontal vector and a matrix of D ⁇
  • (H Z / h t XP ) is a
  • (P (h Z / h t XP ) T ) T is a D-dimensional horizontal vector.
  • h t X ⁇ (P (h Z / h t XP ) T ) T is a D-dimensional horizontal vector consisting of the product of each element of two D-dimensional horizontal vectors.
  • H X is obtained by calculating the iterative equation (2) until
  • Equation (1) ⁇ About the basis of the formula> Equation (1) is derived based on Bayes' theorem as follows. That is, using Bayes' theorem, Pr [x
  • the calculation unit 120 determines whether
  • the data aggregation device 100 and the terminal 200 in the present embodiment can be realized by, for example, causing a computer to execute a program describing the processing contents described in the present embodiment.
  • the "computer” may be a physical machine or a virtual machine on the cloud.
  • the "hardware” described here is virtual hardware.
  • the above program can be recorded on a computer-readable recording medium (portable memory, etc.), saved, and distributed. It is also possible to provide the above program through a network such as the Internet or e-mail.
  • FIG. 3 is a diagram showing an example of the hardware configuration of the above computer.
  • the computer of FIG. 3 has a drive device 1000, an auxiliary storage device 1002, a memory device 1003, a CPU 1004, an interface device 1005, a display device 1006, an input device 1007, an output device 1008, and the like, which are connected to each other by a bus BS, respectively.
  • the program that realizes the processing on the computer is provided by, for example, a recording medium 1001 such as a CD-ROM or a memory card.
  • a recording medium 1001 such as a CD-ROM or a memory card.
  • the program is installed in the auxiliary storage device 1002 from the recording medium 1001 via the drive device 1000.
  • the program does not necessarily have to be installed from the recording medium 1001, and may be downloaded from another computer via the network.
  • the auxiliary storage device 1002 stores the installed program and also stores necessary files, data, and the like.
  • the memory device 1003 reads and stores the program from the auxiliary storage device 1002 when there is an instruction to start the program.
  • the CPU 1004 realizes the function related to the device according to the program stored in the memory device 1003.
  • the interface device 1005 is used as an interface for connecting to a network.
  • the display device 1006 displays a GUI (Graphical User Interface) or the like by a program.
  • the input device 1007 is composed of a keyboard, a mouse, buttons, a touch panel, and the like, and is used for inputting various operation instructions.
  • the output device 1008 outputs the calculation result.
  • Example 1 is an example of randomizing data by the RAPPOR method described in Non-Patent Document 1
  • Example 2 shows a method of efficiently calculating when Z> N.
  • Example 3 when Z is large, it becomes inefficient in numerical calculation, and therefore, a method for solving it by using ⁇ is presented.
  • the details of each embodiment will be described.
  • Example 1 In Example 1, a case where data is randomized by the RAPPOR method described in Non-Patent Document 1 will be described.
  • RAPPOR also called Basic One-time RAPPOR or Unary Encoding
  • RAPPOR is a method in which data x ⁇ X is converted into a D-dimensional vinyl vector, and perturbation (randomization) and aggregation are performed for each value of the vector.
  • the arithmetic unit 120 of the data aggregation device 100 performs D-dimensional vinyl vectorization and randomization.
  • the arithmetic unit 120 inputs the input x to Enter: X ⁇ ⁇ 0,1 ⁇ D in order to convert the input x into a D-dimensional binary vector. That is, each x is converted into a D-dimensional vinyl vector.
  • the arithmetic unit 120 performs randomization processing Perturb in RAPPOR on b to obtain a vector b'after randomization as shown below.
  • Perturb (b) b' Randomization processing In Perturb, randomization is performed for each value of the vector b so as to follow the probability shown in the following equation.
  • the i-th value of the vector b is b [i]
  • Equation (6) means that if the i-th value of the vector b is 1, the value is set to 1 with the probability p, and the value is set to 0 with the probability 1-p, and the i-th value of the vector b is 0. If there is, it means that the value is set to 1 with the probability q and 0 with the probability 1-q.
  • the transition probability matrix P in the equation (2) and the equation (3) it is sufficient to know Pr [z
  • the transition probability of the binary vector may be obtained.
  • the transition probability of the binary vector is the product of the transition probabilities of each value of the binary vector. Further, since the number of inputs of Perturb is a value of D type and the type of output value is 2D type, the transition probability matrix P is a D ⁇ 2D matrix. However, although it was a one-hot binary vector at the input, it is different from the Randomized Response because the output is not always expressed as a one-hot.
  • when using equation (3)
  • may be appropriately determined.
  • ⁇ in the formula (3) it is preferable to use ⁇ described in Example 3.
  • both the spatial complexity and the time complexity are O (
  • ) O (D
  • both the spatial complexity and the time complexity can be reduced to O (DN) by utilizing the fact that the intermediate result of the calculation becomes 0.
  • the value of all the columns corresponding to the value of the vector of h Z in P may be 0.
  • the value may be 0 in all columns of P corresponding to the value of v being 0.
  • P may be a matrix P'having only a column at a non-zero position in h Z
  • h Z may be a vector h'Z having only a non-zero position.
  • FIG. 5 shows an image of computational complexity reduction using sparseness.
  • the black-painted portion in the vector and the matrix is 0.
  • the transition probability is a very small value of 5E-324, and a value smaller than that cannot be handled even by a double precision floating point type variable.
  • Example 3 by adjusting ⁇ , even if D, which is the size of the type of X value, is large, it can be handled by the double precision floating point type. Specifically, it is as follows.
  • transition probability matrices P and P'of RAPPOR shown in FIGS. 4 and 6 the transition probability consists of the product of a total of four values of p, 1-p, q and 1-q. Then, in P'in FIG. 6, it can be seen that 1-q appears in common for each value.
  • Example 3 the reciprocal of the product of the combinations of probabilities p, 1-p, q, 1-q that commonly appear in P (or P') as the constant ⁇ in the equation (3).
  • the amount of calculation can be reduced by using the sparseness of the frequency of the randomized data. Further, as described in Example 3, when there is something that appears in common for each value in P (product of any one or combination of p, 1-p, q, 1-q), It can be canceled by ⁇ . As a result, the calculation becomes more efficient, and it is possible to avoid that the value in P cannot be handled by the double precision floating point type.
  • This specification describes at least the data aggregation device, the data aggregation system, the data aggregation method, and the program described in each of the following sections.
  • (Section 1) An expression having a frequency vector of randomized data obtained by randomizing the original data, a transition probability matrix having a transition probability for each set before and after randomization, and a frequency vector of the original data to be estimated.
  • a calculation unit that calculates the frequency vector of the original data by calculation
  • a data aggregation device including an output unit that outputs a frequency vector of the original data calculated by the calculation unit.
  • h Z is a frequency vector consisting of only non-zero elements in the frequency vector h Z of the randomized data.
  • P transition probability matrix
  • the data aggregation device according to Section 2. (Section 4)
  • the transition probability which is the value of P or P', consists of the product of the values of a plurality of probabilities used in the randomization process, and is common to each value of P or P'as an element of the product.
  • (Section 5) A data aggregation system including the data aggregation device according to any one of paragraphs 1 to 4, and a terminal that randomizes the original data and transmits the randomized data to the data aggregation device.
  • (Section 6) An expression having a frequency vector of randomized data obtained by randomizing the original data, a transition probability matrix having a transition probability for each set before and after randomization, and a frequency vector of the original data to be estimated.
  • (Section 7) A program for making a computer function as each part in the data aggregation device according to any one of the items 1 to 4.

Abstract

Provided is a data aggregation device comprising: a computing unit that computes a frequency vector of original data by calculating a formula having a frequency vector of randomized data obtained by randomizing the original data, a transition probability matrix having the transition probability of each pair before and after the randomization, and a frequency vector of the original data to be estimated; and an output unit that outputs the frequency vector of the original data computed by the computing unit.

Description

データ集約装置、データ集約システム、データ集約方法、及びプログラムData aggregation devices, data aggregation systems, data aggregation methods, and programs
 本発明は、データベースに対して個別データを確率的手法により秘匿する技術に関するものである。 The present invention relates to a technique for concealing individual data from a database by a probabilistic method.
 データベースに対して個別データを確率的手法により秘匿する技術として、例えば非特許文献1等に開示された技術がある。非特許文献1等に開示された秘匿処理では、データを一定の確率で維持し、それ以外の確率でランダムに書き換えることを行う。データのランダム化の方法として、入力データ集合と出力データ集合が同じ場合や異なる場合がある。 As a technique for concealing individual data from a database by a probabilistic method, for example, there is a technique disclosed in Non-Patent Document 1 and the like. In the concealment process disclosed in Non-Patent Document 1 and the like, the data is maintained with a certain probability and rewritten randomly with a probability other than that. As a method of randomizing data, the input data set and the output data set may be the same or different.
 データをランダムに書き換えたままではデータ分析に用いると誤差が生じることから、ランダム化された個別データを収集する際、ランダム化の影響をできる限り取り除くように集約処理を行う(例えば非特許文献2)。データ集約処理として、基本的な分析であるデータの出現頻度(度数)を取り扱うことが主である(非特許文献1、2)。以降はデータ集約処理として度数の集計を取り扱う。 If the data is rewritten randomly and used for data analysis, an error will occur. Therefore, when collecting randomized individual data, aggregation processing is performed so as to remove the influence of randomization as much as possible (for example, Non-Patent Document 2). ). As the data aggregation process, it mainly deals with the appearance frequency (frequency) of data, which is a basic analysis (Non-Patent Documents 1 and 2). After that, frequency aggregation is handled as data aggregation processing.
 従来技術では、入力データ集合と出力データ集合が異なる方式におけるデータ集約処理は非負制約と総数制約を満たさない。よって非負制約と総数制約を満たすために補正を行う必要が生じる(非特許文献1、2)。この補正により誤差が生じることから、できる限り非負制約と総数制約を満たした方式が好まれる。 In the conventional technique, the data aggregation process in the method in which the input data set and the output data set are different does not satisfy the non-negative constraint and the total number constraint. Therefore, it is necessary to make corrections in order to satisfy the non-negative constraint and the total number constraint (Non-Patent Documents 1 and 2). Since an error occurs due to this correction, a method that satisfies the non-negative constraint and the total number constraint as much as possible is preferred.
 本発明は上記の点に鑑みてなされたものであり、入力データ集合と出力データ集合が異なるデータランダム化方式において、非負制約と総数制約を同時に満たしながら効率的にデータ集約処理を行うための技術を提供することを目的とする。 The present invention has been made in view of the above points, and is a technique for efficiently performing data aggregation processing while simultaneously satisfying a non-negative constraint and a total number constraint in a data randomization method in which an input data set and an output data set are different. The purpose is to provide.
 開示の技術によれば、元のデータをランダム化して得られたランダム化データの度数ベクトルと、ランダム化前後の組毎の遷移確率を有する遷移確率行列と、推定対象となる前記元のデータの度数ベクトルとを有する式を計算することにより、前記元のデータの度数ベクトルを算出する演算部と、
 前記演算部により算出された前記元のデータの度数ベクトルを出力する出力部と
 を備えるデータ集約装置が提供される。
According to the disclosed technique, a frequency vector of randomized data obtained by randomizing the original data, a transition probability matrix having a transition probability for each set before and after randomization, and the original data to be estimated. An arithmetic unit that calculates the frequency vector of the original data by calculating an expression having a frequency vector, and
A data aggregation device including an output unit that outputs a frequency vector of the original data calculated by the calculation unit is provided.
 開示の技術によれば、非負制約と総数制約を同時に満たしながら効率的にデータ集約処理を行うための技術が提供される。 According to the disclosed technique, a technique for efficiently performing data aggregation processing while simultaneously satisfying the non-negative constraint and the total number constraint is provided.
本発明の実施の形態におけるデータ集約装置の構成図である。It is a block diagram of the data aggregation apparatus in embodiment of this invention. データ集約装置の処理の流れを示すフローチャートである。It is a flowchart which shows the flow of processing of a data aggregation apparatus. 装置のハードウェア構成例を示す図である。It is a figure which shows the hardware configuration example of the apparatus. 実施例1における遷移確率行列の例である。It is an example of the transition probability matrix in Example 1. 実施例2における計算量削減のイメージを示す図である。It is a figure which shows the image of the calculation amount reduction in Example 2. 非ゼロの列のみを持つ遷移確率行列の例である。This is an example of a transition probability matrix with only nonzero columns.
 以下、図面を参照して本発明の実施の形態(本実施の形態)を説明する。以下で説明する実施の形態は一例に過ぎず、本発明が適用される実施の形態は、以下の実施の形態に限られるわけではない。 Hereinafter, an embodiment of the present invention (the present embodiment) will be described with reference to the drawings. The embodiments described below are merely examples, and the embodiments to which the present invention is applied are not limited to the following embodiments.
 (装置構成例)
 図1に、本実施の形態におけるデータ集約装置100の構成図を示す。図1に示すように、本実施の形態におけるデータ集約装置100は、入力部110、演算部120、出力部130、及び格納部140を有する。また、図1は、データのランダム化処理を行う端末200が存在するケースを示している。なお、データ集約装置100と端末100とを有するシステムをデータ集約システムと呼んでもよい。
(Device configuration example)
FIG. 1 shows a configuration diagram of the data aggregation device 100 according to the present embodiment. As shown in FIG. 1, the data aggregation device 100 in the present embodiment has an input unit 110, a calculation unit 120, an output unit 130, and a storage unit 140. Further, FIG. 1 shows a case where there is a terminal 200 that performs data randomization processing. A system having a data aggregation device 100 and a terminal 100 may be referred to as a data aggregation system.
 データ集約装置100の入力部110に集約対象となるデータが入力される。演算部120は、入力されたデータをランダム化して集約処理を実行する。出力部130は、集約処理の結果を出力する。本実施の形態における集約処理は、元のデータの度数推定である。 The data to be aggregated is input to the input unit 110 of the data aggregation device 100. The arithmetic unit 120 randomizes the input data and executes the aggregation process. The output unit 130 outputs the result of the aggregation process. The aggregation process in this embodiment is frequency estimation of the original data.
 データ格納部140(データベース)には、入力部110から入力されたデータを格納しておくことができる。これにより、例えば、過去に取得した集約対象データに対して、演算部120が集約処理を行って、集約結果を出力部130から出力することができる。 Data input from the input unit 110 can be stored in the data storage unit 140 (database). Thereby, for example, the calculation unit 120 can perform the aggregation process on the data to be aggregated acquired in the past, and the aggregation result can be output from the output unit 130.
 例えば、データ格納部140には、収集されたレコード数がNのデータ(つまり、N個のデータ)が格納され、演算部120は、当該データに対して、非負制約と総数制約を同時に満たしながら集約処理を行う。 For example, the data storage unit 140 stores data in which the number of collected records is N (that is, N data), and the arithmetic unit 120 simultaneously satisfies the non-negative constraint and the total number constraint for the data. Perform aggregation processing.
 また、端末200において元のデータをランダム化し、ランダム化したデータが端末200から送信されることとしてもよい。入力部110にはランダム化したデータが入力され、データ格納部140には、ランダム化後のデータが格納される。演算部120は、当該ランダム化後のデータに対して、集約処理を行うこととしてもよい。データ格納部140に事前のデータを格納することなく、入力部110にランダム化後のデータが入力され、演算部120は、入力されたデータに対して集約処理を実行することとしてもよい。 Further, the original data may be randomized in the terminal 200, and the randomized data may be transmitted from the terminal 200. Randomized data is input to the input unit 110, and the randomized data is stored in the data storage unit 140. The arithmetic unit 120 may perform aggregation processing on the randomized data. Randomized data may be input to the input unit 110 without storing prior data in the data storage unit 140, and the calculation unit 120 may execute the aggregation process on the input data.
 以下、データ集約装置100が実行する動作についてより具体的に説明する。 Hereinafter, the operation executed by the data aggregation device 100 will be described more specifically.
 (データ集約装置100の動作例)
 入力部110に、N個のデータが入力される。N個のデータは、D種類の値を持つデータ集合X=[D]={0,...,D-1}を構成する。演算部120において、Xに含まれるデータx∈Xを確率的メカニズムM:X→Zに入力し、出力z∈Zを得る。つまり、Xから、ランダム化したデータの集合Zを得る。
(Operation example of data aggregation device 100)
N pieces of data are input to the input unit 110. The N pieces of data constitute a data set X = [D] = {0, ..., D-1} having D kinds of values. In the arithmetic unit 120, the data x ∈ X included in X is input to the stochastic mechanism M: X → Z, and the output z ∈ Z is obtained. That is, a set Z of randomized data is obtained from X.
 前述したとおり、ランダム化をデータ集約装置100の外部の端末200で実行することで、ランダム化した後のデータが入力部110に入力されることとしてもよい。 As described above, by executing the randomization on the terminal 200 outside the data aggregation device 100, the randomized data may be input to the input unit 110.
 演算部120は、ランダム化したデータの集合{zN-1 i=0を、メカニズムMの影響を考慮しながら集約し、データx∈Xの度数が格納されている度数ベクトルh={h(0),...,h(D-1)}(ただし、h(0)≧0,...,h(D-1)≧0,ΣD-1 i=0(i)=N)を推定する。h(0)≧0,...,hX(D-1)≧0は非負制約であり,ΣD-1 i=0(i)=Nは総数制約である。 The arithmetic unit 120 aggregates a randomized set of data { zi } N-1 i = 0 while considering the influence of the mechanism M, and the frequency vector h X = in which the frequency of the data x ∈ X is stored. {H X (0), ..., h X (D-1)} (However, h X (0) ≧ 0, ..., h X (D-1) ≧ 0, Σ D-1 i = Estimate 0 h X (i) = N). h X (0) ≧ 0, ..., hX (D-1) ≧ 0 is a non-negative constraint, and Σ D-1 i = 0 h X (i) = N is a total number constraint.
 確率的メカニズムM:X→Zには、入力x∈X、出力z∈Zについて条件付き確率Pr[z|x]が存在する。ランダム化処理は、その条件付き確率に従って行われる。任意のx∈X、z∈ZのPr[z|x]を行列の値として持つ|X|×|Z|行列をP∈[0,1]|X|×|Z|とする。|X|は、Xにおける値の種類数であり、|Z|は、Zにおける値の種類数である。Pを遷移確率行列と呼んでもよい。 In the stochastic mechanism M: X → Z, there is a conditional probability Pr [z | x] for the input x ∈ X and the output z ∈ Z. The randomization process is performed according to the conditional probability. Let P ∈ [0,1] | X | × | Z | be a | X | x | Z | matrix having Pr [z | x] of arbitrary x ∈ X and z ∈ Z as the matrix value. | X | is the number of types of values in X, and | Z | is the number of types of values in Z. P may be called a transition probability matrix.
 また、演算部120は、Pr[z|x]によってランダム化されたランダム化済みデータ{zN-1 i=0から構成した度数ベクトルを、h={h(0),...,h(D-1)}として算出する。ここで、h(0)≧0,...,h(D-1)≧0,ΣD-1 i=0(i)=Nである。なお、ΣD-1 i=0(i)=Nであることは一例である。また、hがデータ集約装置100の外部の装置で計算され、hがデータ集約装置100に入力されることとしてもよい。 Further, the arithmetic unit 120 uses h Z = { h Z ( 0 ) ,. .., h Z (D-1)}. Here, h Z (0) ≧ 0, ..., h Z (D-1) ≧ 0, Σ D-1 i = 0 h Z (i) = N. It should be noted that Σ D-1 i = 0 h Z (i) = N is an example. Further, h Z may be calculated by an external device of the data aggregation device 100, and h Z may be input to the data aggregation device 100.
 演算部120は、式(1)を繰り返し計算することで、度数h(x)を推定する。 The calculation unit 120 estimates the frequency h X (x) by repeatedly calculating the equation (1).
Figure JPOXMLDOC01-appb-M000001
 ここでh(x)はt回目の繰り返しのh(x)の値を示す。上記の式(1)により、各値の度数を計算できる。また、演算部120は、下記の式(2)を繰り返し計算することで、度数ベクトルhを一括して計算することができる。
Figure JPOXMLDOC01-appb-M000001
Here, h X (x) t indicates the value of h X (x) of the t-th repetition. The frequency of each value can be calculated by the above equation (1). Further, the calculation unit 120 can collectively calculate the frequency vector h X by repeatedly calculating the following equation (2).
Figure JPOXMLDOC01-appb-M000002
 上記の式(2)において、h は度数ベクトルであり、D次元の横ベクトルである。Pは、上述したPr[z|x]を行列の値として持つ|X|×|Z|行列(D×|Z|行列)である。ランダム化済みデータ{zN-1 i=0から構成した度数ベクトルであるhは|Z|次元の横ベクトルである。
Figure JPOXMLDOC01-appb-M000002
In the above equation (2), h t X is a frequency vector and is a D-dimensional horizontal vector. P is a | X | × | Z | matrix (D × | Z | matrix) having the above-mentioned Pr [z | x] as the matrix value. H z , which is a frequency vector composed of randomized data { zi } N-1 i = 0 , is a horizontal vector of | Z | dimension.
 h Pは、D次元の横ベクトルとD×|Z|の行列との積であり、|Z|次元の横ベクトルである。 h t XP is a product of a D-dimensional horizontal vector and a matrix of D × | Z |, and is a | Z | dimensional horizontal vector.
 (h/h P)は、hの各要素を、h Pの対応する要素で割った|Z|次元の横ベクトルである。(P(h/h P)は、D次元の横ベクトルである。h ・(P(h/h P)は、2つのD次元の横ベクトルの要素毎の積からなるD次元の横ベクトルである。 (H Z / h t XP ) is a | Z | dimensional horizontal vector obtained by dividing each element of h Z by the corresponding element of h t XP . (P (h Z / h t XP ) T ) T is a D-dimensional horizontal vector. h t X · (P (h Z / h t XP ) T ) T is a D-dimensional horizontal vector consisting of the product of each element of two D-dimensional horizontal vectors.
 演算部120は、初期値をh ={N/|X|,...,N/|X|}として、予め定めたη>0に対して||ht+1 -h ||<ηとなるまで、繰り返し式(2)を計算することで、hを得る。つまり、ht+1 とh の差分の大きさが閾値ηよりも小さくなるまで式(2)を計算する。 The calculation unit 120 sets the initial value as h 0 X = {N / | X |, ..., N / | X |}, and for a predetermined η> 0, || ht + 1 X- h t X | H X is obtained by calculating the iterative equation (2) until | <η. That is, the equation (2) is calculated until the magnitude of the difference between h t + 1 X and h t X becomes smaller than the threshold value η.
 ここで、式(2)の計算結果は、Pを定数倍αしても変わらない。そのため、式(2)を、 Here, the calculation result of Eq. (2) does not change even if P is multiplied by a constant α. Therefore, the formula (2) is:
Figure JPOXMLDOC01-appb-M000003
としてもよい。なお、α=1とすれば式(2)になる。また、後述するとおり、式(3)において、αとして適切な値をとることで、Pにおいて倍精度浮動小数点数型でも扱えない小さな値が生じることを回避できる。
Figure JPOXMLDOC01-appb-M000003
May be. If α = 1, the equation (2) is obtained. Further, as will be described later, by taking an appropriate value as α in the equation (3), it is possible to avoid a small value that cannot be handled even by the double precision floating point number type in P.
 <式の根拠について>
 式(1)は、下記のとおり、ベイズの定理に基づいて導出される。すなわち、ベイズの定理を用いるとPr[x|z]は、
<About the basis of the formula>
Equation (1) is derived based on Bayes' theorem as follows. That is, using Bayes' theorem, Pr [x | z] is
Figure JPOXMLDOC01-appb-M000004
となる。加えて、
Figure JPOXMLDOC01-appb-M000004
Will be. father,
Figure JPOXMLDOC01-appb-M000005
及びh(x)=Pr[x]×Nという関係を用いることで式(1)が導出される。式(1)をベクトルと行列の計算で表すことで、式(2)が得られ、更にαを導入して式(3)が得られる。
Figure JPOXMLDOC01-appb-M000005
And h X (x) = Pr [x] × N, the equation (1) is derived. By expressing the equation (1) by the calculation of the vector and the matrix, the equation (2) is obtained, and further α is introduced to obtain the equation (3).
 <処理フロー>
 式(3)を用いる場合におけるデータ集約装置100の処理の流れの例を図2のフローチャートを参照して説明する。ここでは、hがデータ集約装置100に入力される場合の例を示している。
<Processing flow>
An example of the processing flow of the data aggregation device 100 when the equation (3) is used will be described with reference to the flowchart of FIG. Here, an example is shown in which h Z is input to the data aggregation device 100.
 S101において、入力部110にP,h,α>0,η>0が入力される。 In S101, P, h Z , α> 0, η> 0 are input to the input unit 110.
 S102において、演算部120は、h ={N/|X|,...,N/|X|}として、式(3)の計算を開始する。 In S102, the arithmetic unit 120 starts the calculation of the equation (3) with h 0 X = {N / | X |, ..., N / | X |}.
 S103において、演算部120は、||ht+1 -h ||<ηになったかどうかを判定し、判定がNoであればS102に戻り、式(3)を計算する。判定がYesであれば、そのときのht+1 (又はh )を計算結果の度数ベクトルhとして出力部130から出力する(S104)。 In S103, the calculation unit 120 determines whether || ht + 1 X- ht X || <η, and if the determination is No, returns to S102 and calculates the equation (3). If the determination is Yes, the h t + 1 X (or h t X ) at that time is output from the output unit 130 as the frequency vector h X of the calculation result (S104).
 (ハードウェア構成例)
 本実施の形態におけるデータ集約装置100及び端末200("装置"と総称する)は、例えば、コンピュータに、本実施の形態で説明する処理内容を記述したプログラムを実行させることにより実現可能である。なお、この「コンピュータ」は、物理マシンであってもよいし、クラウド上の仮想マシンであってもよい。仮想マシンを使用する場合、ここで説明する「ハードウェア」は仮想的なハードウェアである。
(Hardware configuration example)
The data aggregation device 100 and the terminal 200 (collectively referred to as "devices") in the present embodiment can be realized by, for example, causing a computer to execute a program describing the processing contents described in the present embodiment. The "computer" may be a physical machine or a virtual machine on the cloud. When using a virtual machine, the "hardware" described here is virtual hardware.
 上記プログラムは、コンピュータが読み取り可能な記録媒体(可搬メモリ等)に記録して、保存したり、配布したりすることが可能である。また、上記プログラムをインターネットや電子メール等、ネットワークを通して提供することも可能である。 The above program can be recorded on a computer-readable recording medium (portable memory, etc.), saved, and distributed. It is also possible to provide the above program through a network such as the Internet or e-mail.
 図3は、上記コンピュータのハードウェア構成例を示す図である。図3のコンピュータは、それぞれバスBSで相互に接続されているドライブ装置1000、補助記憶装置1002、メモリ装置1003、CPU1004、インタフェース装置1005、表示装置1006、入力装置1007、出力装置1008等を有する。 FIG. 3 is a diagram showing an example of the hardware configuration of the above computer. The computer of FIG. 3 has a drive device 1000, an auxiliary storage device 1002, a memory device 1003, a CPU 1004, an interface device 1005, a display device 1006, an input device 1007, an output device 1008, and the like, which are connected to each other by a bus BS, respectively.
 当該コンピュータでの処理を実現するプログラムは、例えば、CD-ROM又はメモリカード等の記録媒体1001によって提供される。プログラムを記憶した記録媒体1001がドライブ装置1000にセットされると、プログラムが記録媒体1001からドライブ装置1000を介して補助記憶装置1002にインストールされる。但し、プログラムのインストールは必ずしも記録媒体1001より行う必要はなく、ネットワークを介して他のコンピュータよりダウンロードするようにしてもよい。補助記憶装置1002は、インストールされたプログラムを格納すると共に、必要なファイルやデータ等を格納する。 The program that realizes the processing on the computer is provided by, for example, a recording medium 1001 such as a CD-ROM or a memory card. When the recording medium 1001 storing the program is set in the drive device 1000, the program is installed in the auxiliary storage device 1002 from the recording medium 1001 via the drive device 1000. However, the program does not necessarily have to be installed from the recording medium 1001, and may be downloaded from another computer via the network. The auxiliary storage device 1002 stores the installed program and also stores necessary files, data, and the like.
 メモリ装置1003は、プログラムの起動指示があった場合に、補助記憶装置1002からプログラムを読み出して格納する。CPU1004は、メモリ装置1003に格納されたプログラムに従って、当該装置に係る機能を実現する。インタフェース装置1005は、ネットワークに接続するためのインタフェースとして用いられる。表示装置1006はプログラムによるGUI(Graphical User Interface)等を表示する。入力装置1007はキーボード及びマウス、ボタン、又はタッチパネル等で構成され、様々な操作指示を入力させるために用いられる。出力装置1008は演算結果を出力する。 The memory device 1003 reads and stores the program from the auxiliary storage device 1002 when there is an instruction to start the program. The CPU 1004 realizes the function related to the device according to the program stored in the memory device 1003. The interface device 1005 is used as an interface for connecting to a network. The display device 1006 displays a GUI (Graphical User Interface) or the like by a program. The input device 1007 is composed of a keyboard, a mouse, buttons, a touch panel, and the like, and is used for inputting various operation instructions. The output device 1008 outputs the calculation result.
 以下、これまでに説明した式(2)又は式(3)を繰り返し計算する手順で度数ベクトルhを算出する方法における、より具体的な例として、実施例1~3を説明する。実施例1は、非特許文献1に記載のRAPPORの手法でデータをランダム化する場合の例であり、実施例2は、Z>Nの場合に効率よく計算する手法を示している。実施例3では、Zが大きい場合に数値計算上非効率となることから、αを用いることでそれを解決する方法を提示している。以下、各実施例の詳細を説明する。 Hereinafter, Examples 1 to 3 will be described as more specific examples in the method of calculating the frequency vector h X by the procedure for repeatedly calculating the formula (2) or the formula (3) described so far. Example 1 is an example of randomizing data by the RAPPOR method described in Non-Patent Document 1, and Example 2 shows a method of efficiently calculating when Z> N. In Example 3, when Z is large, it becomes inefficient in numerical calculation, and therefore, a method for solving it by using α is presented. Hereinafter, the details of each embodiment will be described.
 (実施例1)
 実施例1では、非特許文献1に記載のRAPPORの手法でデータをランダム化する場合について説明する。
(Example 1)
In Example 1, a case where data is randomized by the RAPPOR method described in Non-Patent Document 1 will be described.
 RAPPOR(Basic One-time RAPPOR又はUnary Encodingとも呼ばれる)は、データx∈XをそれぞれD次元バイナルベクトル化し、ベクトルの各値毎に摂動(ランダム化)及び集約を行う手法である。実施例1では、データ集約装置100の演算部120がD次元バイナルベクトル化及びランダム化を行うものとする。 RAPPOR (also called Basic One-time RAPPOR or Unary Encoding) is a method in which data x∈X is converted into a D-dimensional vinyl vector, and perturbation (randomization) and aggregation are performed for each value of the vector. In the first embodiment, it is assumed that the arithmetic unit 120 of the data aggregation device 100 performs D-dimensional vinyl vectorization and randomization.
 演算部120は、入力xをD次元バイナリベクトル化するため、Encode:X→{0,1}に入力する。つまり、各xを、D次元バイナルベクトルに変換する。変換後の出力はone-hot表記のD次元バイナリベクトルである。すなわち、Encode(x)=(0,...,0,1,0,...,0)=bである。ベクトルbにおいて、i番目の位置のみ1であり、それ以外が0である。 The arithmetic unit 120 inputs the input x to Enter: X → {0,1} D in order to convert the input x into a D-dimensional binary vector. That is, each x is converted into a D-dimensional vinyl vector. The output after conversion is a D-dimensional binary vector in one-hot notation. That is, Encode (x) = (0, ..., 0, 1, 0, ..., 0) = b. In the vector b, only the i-th position is 1, and the other positions are 0.
 演算部120は、bに対してRAPPORにおけるランダム化処理Perturbを行って、下記のようにランダム化後のベクトルb´を得る。 The arithmetic unit 120 performs randomization processing Perturb in RAPPOR on b to obtain a vector b'after randomization as shown below.
  Perturb(b)=b′
 ランダム化処理Perturbにおいては、ベクトルbの各値毎に、次式に示す確率に従うようにランダム化を行う。ベクトルbのi番目の値をb[i]とした場合、
Perturb (b) = b'
Randomization processing In Perturb, randomization is performed for each value of the vector b so as to follow the probability shown in the following equation. When the i-th value of the vector b is b [i],
Figure JPOXMLDOC01-appb-M000006
である。式(6)は、ベクトルbのi番目の値が1であれば確率pでその値を1にし、確率1-pで0にすることを意味し、ベクトルbのi番目の値が0であれば確率qでその値を1にし、確率1-qで0にすることを意味する。
Figure JPOXMLDOC01-appb-M000006
Is. Equation (6) means that if the i-th value of the vector b is 1, the value is set to 1 with the probability p, and the value is set to 0 with the probability 1-p, and the i-th value of the vector b is 0. If there is, it means that the value is set to 1 with the probability q and 0 with the probability 1-q.
 式(2)や式(3)における遷移確率行列Pを構成するには、当該行列の各要素であるPr[z|x]がわかればよい。実施例1において、RAPPORではPerturbの入力及び出力がバイナリベクトルであるので、バイナリベクトルの遷移確率が求まれば良い。 In order to construct the transition probability matrix P in the equation (2) and the equation (3), it is sufficient to know Pr [z | x] which is each element of the matrix. In the first embodiment, since the input and output of the Perturb are binary vectors in RAPPOR, the transition probability of the binary vector may be obtained.
 バイナリベクトルの遷移確率は、バイナリベクトルの各値の遷移確率の積となる。また、Perturbの入力の数はD種類の値であり、出力の値の種類は2種類となることから、遷移確率行列PはD×2行列となる。ただし、入力ではone-hotバイナリベクトルであったが、出力はone-hot表現になるとは限らないことから、Randomized Responseとは異なる。 The transition probability of the binary vector is the product of the transition probabilities of each value of the binary vector. Further, since the number of inputs of Perturb is a value of D type and the type of output value is 2D type, the transition probability matrix P is a D × 2D matrix. However, although it was a one-hot binary vector at the input, it is different from the Randomized Response because the output is not always expressed as a one-hot.
 例えば、D=3の場合、Perturbへの入力は3種類であり、出力は8種類となる。また、例えば、入力がb=(0,1,0)であり出力Perturb(b)=(1,1,0)であった場合、遷移確率は式(6)より、p(1-q)qである。D=3の場合の遷移確率行列 For example, when D = 3, there are 3 types of inputs to Perturb and 8 types of outputs. Further, for example, when the input is b = (0,1,0) and the output Perturb (b) = (1,1,0), the transition probability is p (1-q) from the equation (6). q. Transition probability matrix when D = 3
Figure JPOXMLDOC01-appb-M000007
を図4に示す。図4において、行がPerturbへの入力、列が出力に対応しており、行列の各値は式(6)で定義した遷移確率の掛け算となる。例えば、b=(0,0,1)からb´=(0,0,0)への遷移確率が(1-p)(1-q)であることが示されている。
Figure JPOXMLDOC01-appb-M000007
Is shown in FIG. In FIG. 4, the row corresponds to the input to the Perturb and the column corresponds to the output, and each value of the matrix is the multiplication of the transition probabilities defined in the equation (6). For example, it is shown that the transition probability from b = (0,0,1) to b'= (0,0,0) is (1-p) (1-q) 2 .
 式(2)又は式(3)におけるhについては、演算部120は、摂動したPerturb(Encode(x))=b′を集約して{b′ i=1とし、各ベクトルb′∈Zの度数h(b′)を求めることで構成することができる。加えてα(式(3)を使う場合)やηを適切に定めればよい。なお、式(3)におけるαについては、実施例3において説明するαを使用することが好ましい。 For h Z in the equation (2) or the equation (3), the arithmetic unit 120 aggregates the perturbed Vector (Encode (x i )) = b'i into { b'i } Ni = 1 , and each of them It can be constructed by finding the frequency h Z (b') of the vector b'∈ Z. In addition, α (when using equation (3)) and η may be appropriately determined. As for α in the formula (3), it is preferable to use α described in Example 3.
 (実施例2)
 次に、実施例2を説明する。式(2)又は式(3)を計算する素朴な方法では空間計算量、時間計算量ともにO(|X||Z|)=O(D|Z|)であるが、|Z|>Nの条件が成り立つ場合には、計算の途中結果が0になることを利用することで、空間計算量、時間計算量ともにO(DN)にまで削減することができる。
(Example 2)
Next, Example 2 will be described. In the simple method of calculating the equation (2) or the equation (3), both the spatial complexity and the time complexity are O (| X || Z |) = O (D | Z |), but | Z |> N. When the condition of is satisfied, both the spatial complexity and the time complexity can be reduced to O (DN) by utilizing the fact that the intermediate result of the calculation becomes 0.
 実施例2では、|Z|>Nである場合のhの疎性に着目する。hのサイズは|Z|であるが、|Z|>Nであることから、hの値が一部必ず0となる。つまり、Zは{z i=1であり、Zの要素の個数はNなので、Zの要素になり得る値の種類|Z|がNよりも大きい場合においては、Zの要素になり得る値のうち、出現しない値が必ず存在する。つまり、度数が0になる値が必ず存在する。 In the second embodiment, attention is paid to the sparseness of hZ when | Z |> N. The size of h Z is | Z |, but since | Z |> N, the value of h Z is always 0 in part. That is, Z is {z i } N i = 1 , and the number of elements of Z is N. Therefore, when the type of value | Z | that can be an element of Z is larger than N, it becomes an element of Z. Of the values that can be obtained, there are always values that do not appear. That is, there is always a value at which the frequency becomes 0.
 ここで、式(2)の計算の一部をv=h/hPとおく。すると、式(2)は、
   ht+1 =h ・(Pv
となる。vの計算において、hのベクトルの値が0の箇所は、得られるvの計算結果も必ず0になる。よって、Pにおける、hのベクトルの値が0に対応する列は全て値が0でも良い。加えてh・Pvにおいてもvの値が0に対応するPの列は全て値が0で良い。
Here, a part of the calculation of the equation (2) is set as v = h Z / h XP . Then, equation (2) becomes
h t + 1 X = h t X · (Pv T ) T
Will be. In the calculation of v, where the value of the vector of h Z is 0, the calculation result of v obtained is always 0. Therefore, the value of all the columns corresponding to the value of the vector of h Z in P may be 0. In addition, even in h X · Pv T , the value may be 0 in all columns of P corresponding to the value of v being 0.
 これらより、Pについては、hにおける非ゼロの位置の列のみを持つ行列P′とすれば良く、hも非ゼロのみを持つベクトルh′とすれば良い。このように疎性を利用してP′、h′を使用した場合における式(2)は下記のとおりである。 From these, P may be a matrix P'having only a column at a non-zero position in h Z , and h Z may be a vector h'Z having only a non-zero position. The equation (2) in the case where P'and h'Z are used by utilizing the sparseness as described above is as follows.
Figure JPOXMLDOC01-appb-M000008
 なお、式(3)についても同様であり、疎性を利用する場合には、式(3)におけるP、hをP′、h′に置き換えればよい。
Figure JPOXMLDOC01-appb-M000008
The same applies to the equation (3), and when sparseness is used, P and h Z in the equation (3) may be replaced with P'and h'Z .
 図5に、疎性を利用した計算量削減のイメージを示す。図5において、ベクトル及び行列における黒塗箇所が0であるとする。図5に示すとおり、Pとhの同じ位置で0となり、その0の部分を削除したP′、h′を用いることで計算量を削減できる。 FIG. 5 shows an image of computational complexity reduction using sparseness. In FIG. 5, it is assumed that the black-painted portion in the vector and the matrix is 0. As shown in FIG. 5, it becomes 0 at the same position of P and h Z , and the amount of calculation can be reduced by using P'and h'Z in which the 0 portion is deleted.
 実施例1で説明したRAPPOR(Unary Encoding)を使用して3×2の遷移確率行列P(図4参照)を構成する際に、遷移確率行列Pのうち、h((0,1,0))=0,h((0,1,1))=0,h((1,0,1))=0,h((1,1,0))=0,h((1,1,1))=0の場合においては、これらの列が削除されるので、P´は図6に示すとおりの行列になる。 When constructing a 3 × 2 3 transition probability matrix P (see FIG. 4) using RAPPOR (Unary Encoding) described in the first embodiment, h Z ((0,1,1) of the transition probability matrix P is used. 0)) = 0, h Z ((0,1,1)) = 0, h Z ((1,0,1)) = 0, h Z ((1,1,0)) = 0, h Z In the case of ((1,1,1)) = 0, these columns are deleted, so P'becomes a matrix as shown in FIG.
 (実施例3)
 実施例1のようなRAPPORの適用に限らず、Dが大きくなるにつれて遷移確率行列の各値は小さくなる。よって、本実施の形態における手法を、データ集約装置100として使用されるコンピュータ(計算機)に実装する時の数値の精度が問題となる。例えば、p+q=1としてRAPPORによりbを撹乱したとする。その際にバイナリベクトルの値が攪乱の前後で全く一致しない遷移確率は
(Example 3)
Not limited to the application of RAPPOR as in the first embodiment, each value of the transition probability matrix becomes smaller as D becomes larger. Therefore, the accuracy of numerical values when the method in this embodiment is implemented in a computer (computer) used as a data aggregation device 100 becomes a problem. For example, it is assumed that b is disturbed by RAPPOR with p + q = 1. At that time, the transition probability that the values of the binary vectors do not match at all before and after the disturbance is
Figure JPOXMLDOC01-appb-M000009
となる。例えばε=1かつD=764及びε=4かつD=350の場合において、遷移確率は5E-324と非常に小さい値となり、それ以上小さい値は倍精度浮動小数型の変数でも扱えなくなる。
Figure JPOXMLDOC01-appb-M000009
Will be. For example, in the case of ε = 1 and D = 764 and ε = 4 and D = 350, the transition probability is a very small value of 5E-324, and a value smaller than that cannot be handled even by a double precision floating point type variable.
 実施例3では、αの調整により、Xの値の種類のサイズであるDが大きい場合でも倍精度浮動小数点型で扱えるようにしている。具体的には下記のとおりである。 In Example 3, by adjusting α, even if D, which is the size of the type of X value, is large, it can be handled by the double precision floating point type. Specifically, it is as follows.
 図4、図6に示したRAPPORの遷移確率行列P、P′において、遷移確率はp,1-p,q,1-qの計4つの値の組合せの積から成る。そして、図6のP´では、各値共通して1-qが出現することがわかる。 In the transition probability matrices P and P'of RAPPOR shown in FIGS. 4 and 6, the transition probability consists of the product of a total of four values of p, 1-p, q and 1-q. Then, in P'in FIG. 6, it can be seen that 1-q appears in common for each value.
 そこで、PをP´とした式(3)において、α=1/(1-q)とすることで、「αP´」の計算において、余分な掛け算が減り、各値毎の差異が表現可能になる。例えば、P´における(1-p)(1-q)は、「αP´」においては、(1-p)(1-q)となり、1より小さい値による余分な掛け算を減少させることができる。 Therefore, by setting α = 1 / (1-q) in the equation (3) in which P is P', extra multiplication is reduced in the calculation of "αP'", and the difference for each value can be expressed. become. For example, (1-p) (1-q) 2 in P'becomes (1-p) (1-q) in "αP'", which can reduce extra multiplication by values smaller than 1. can.
 すなわち、実施例3においては、式(3)の定数αとして、P(又はP´)に各値に共通して出現する確率p,1-p,q,1-qの組合せの積の逆数を用いれば良い。言い換えると、P又はP´の値である遷移確率は、ランダム化処理で使用される複数の確率の値の積からなり、P又はP´の各値に共通して、当該積の要素として同じ値が使用されている場合において、αとして当該同じ値に基づく値を使用する。なお、各値に共通して出現する確率が存在しない場合には、例えは、α=1とする。 That is, in Example 3, the reciprocal of the product of the combinations of probabilities p, 1-p, q, 1-q that commonly appear in P (or P') as the constant α in the equation (3). Should be used. In other words, the transition probability, which is the value of P or P', consists of the product of the values of a plurality of probabilities used in the randomization process, and is common to each value of P or P'and is the same as an element of the product. If a value is used, use a value based on that same value as α. If there is no probability that each value will appear in common, for example, α = 1.
 (実施の形態の効果)
 本実施の形態で説明した技術により、元のデータからランダム化されたデータに基づいて、非負制約と総数制約を同時に満たしながら元のデータの度数推定を行うことができる。
(Effect of embodiment)
According to the technique described in this embodiment, it is possible to estimate the frequency of the original data while simultaneously satisfying the non-negative constraint and the total number constraint based on the data randomized from the original data.
 また、実施例2で説明したように、|Z|>Nの場合に、ランダム化したデータの度数の疎性を利用することで、計算量を削減できる。また、実施例3で説明したように、Pの中の各値に共通して出現するもの(p,1-p,q,1-qのいずれか又は組合せの積)がある場合には、αによりそれを打ち消すことができる。これにより、計算が効率化するとともに、Pの中の値が倍精度浮動小数点型で扱えなくなることを回避できる。 Further, as described in the second embodiment, in the case of | Z |> N, the amount of calculation can be reduced by using the sparseness of the frequency of the randomized data. Further, as described in Example 3, when there is something that appears in common for each value in P (product of any one or combination of p, 1-p, q, 1-q), It can be canceled by α. As a result, the calculation becomes more efficient, and it is possible to avoid that the value in P cannot be handled by the double precision floating point type.
 (実施の形態のまとめ)
 本明細書には、少なくとも下記の各項に記載したデータ集約装置、データ集約システム、データ集約方法、及びプログラムが記載されている。
(第1項)
 元のデータをランダム化して得られたランダム化データの度数ベクトルと、ランダム化前後の組毎の遷移確率を有する遷移確率行列と、推定対象となる前記元のデータの度数ベクトルとを有する式を計算することにより、前記元のデータの度数ベクトルを算出する演算部と、
 前記演算部により算出された前記元のデータの度数ベクトルを出力する出力部と
 を備えるデータ集約装置。
(第2項)
 前記元のデータxの集合をXとし、前記ランダム化データzの集合をZとし、前記遷移確率行列をPとし、前記ランダム化データの度数ベクトルをhとし、前記元のデータの度数ベクトルをhとし、h をt回目の繰り返しのhとし、αを定数とした場合において、
 前記式は、
 ht+1 =h ・(αP(h/h αP)で表される第1の式であり、
 前記演算部は、ht+1 とh の差分の大きさが閾値よりも小さくなるまで前記第1の式を計算する
 第1項に記載のデータ集約装置。
(第3項)
 前記集合Zにおける取り得る値の種類の数が、前記集合Zにおける要素の数よりも大きい場合において、前記ランダム化データの度数ベクトルhにおける度数が非ゼロの要素のみからなる度数ベクトルをh´とし、前記遷移確率行列Pをh´に存在する要素に対応する列のみとした行列をP´とした場合において、
 前記演算部は、前記第1の式に代えて、ht+1 =h ・(αP´(h´/h αP´)で表される第2の式を計算する
 第2項に記載のデータ集約装置。
(第4項)
 前記P又は前記P´の値である遷移確率は、ランダム化処理で使用される複数の確率の値の積からなり、前記P又は前記P´の各値に共通して、前記積の要素として同じ値が使用されている場合において、前記αとして当該同じ値に基づく値を使用する
 第2項又は第3項に記載のデータ集約装置。
(第5項)
 第1項ないし第4項のうちいずれか1項に記載の前記データ集約装置と、前記元のデータをランダム化し、ランダム化データを前記データ集約装置に送信する端末と、を備えるデータ集約システム。
(第6項)
 元のデータをランダム化して得られたランダム化データの度数ベクトルと、ランダム化前後の組毎の遷移確率を有する遷移確率行列と、推定対象となる前記元のデータの度数ベクトルとを有する式を計算することにより、前記元のデータの度数ベクトルを算出する演算ステップと、
 前記演算ステップにより算出された前記元のデータの度数ベクトルを出力する出力ステップと
 を備えるデータ集約方法。
(第7項)
 コンピュータを、第1項ないし第4項のうちいずれか1項に記載のデータ集約装置における各部として機能させるためのプログラム。
(Summary of embodiments)
This specification describes at least the data aggregation device, the data aggregation system, the data aggregation method, and the program described in each of the following sections.
(Section 1)
An expression having a frequency vector of randomized data obtained by randomizing the original data, a transition probability matrix having a transition probability for each set before and after randomization, and a frequency vector of the original data to be estimated. A calculation unit that calculates the frequency vector of the original data by calculation,
A data aggregation device including an output unit that outputs a frequency vector of the original data calculated by the calculation unit.
(Section 2)
Let X be the set of the original data x, Z be the set of the randomized data z, P be the transition probability matrix, h Z be the frequency vector of the randomized data, and the frequency vector of the original data be. When h X is set, h t X is set to h X of the t-th repetition, and α is set as a constant.
The above formula is
h t + 1 X = h t X · (αP (h Z / h t X αP) T ) The first equation expressed by T.
The data aggregation device according to item 1, wherein the calculation unit calculates the first equation until the difference between h t + 1 X and h t X becomes smaller than a threshold value.
(Section 3)
When the number of possible value types in the set Z is larger than the number of elements in the set Z, h Z is a frequency vector consisting of only non-zero elements in the frequency vector h Z of the randomized data. In the case where the transition probability matrix P is set to ′ and the matrix in which only the columns corresponding to the elements existing in h Z ′ is used is P ′.
Instead of the first equation, the arithmetic unit calculates a second equation represented by h t + 1 X = h t X · (αP'(h Z '/ h t X αP') T ) T. The data aggregation device according to Section 2.
(Section 4)
The transition probability, which is the value of P or P', consists of the product of the values of a plurality of probabilities used in the randomization process, and is common to each value of P or P'as an element of the product. The data aggregation device according to item 2 or 3, wherein when the same value is used, a value based on the same value is used as the α.
(Section 5)
A data aggregation system including the data aggregation device according to any one of paragraphs 1 to 4, and a terminal that randomizes the original data and transmits the randomized data to the data aggregation device.
(Section 6)
An expression having a frequency vector of randomized data obtained by randomizing the original data, a transition probability matrix having a transition probability for each set before and after randomization, and a frequency vector of the original data to be estimated. A calculation step for calculating the frequency vector of the original data by calculation, and
A data aggregation method including an output step for outputting a frequency vector of the original data calculated by the calculation step.
(Section 7)
A program for making a computer function as each part in the data aggregation device according to any one of the items 1 to 4.
 以上、本実施の形態について説明したが、本発明はかかる特定の実施形態に限定されるものではなく、特許請求の範囲に記載された本発明の要旨の範囲内において、種々の変形・変更が可能である。 Although the present embodiment has been described above, the present invention is not limited to such a specific embodiment, and various modifications and changes can be made within the scope of the gist of the present invention described in the claims. It is possible.
100 データ集約装置
110 入力部
120 演算部
130 出力部
140 格納部
200 端末
1000 ドライブ装置
1001 記録媒体
1002 補助記憶装置
1003 メモリ装置
1004 CPU
1005 インタフェース装置
1006 表示装置
1007 入力装置
1008 出力装置
100 Data aggregation device 110 Input unit 120 Calculation unit 130 Output unit 140 Storage unit 200 Terminal 1000 Drive device 1001 Recording medium 1002 Auxiliary storage device 1003 Memory device 1004 CPU
1005 Interface device 1006 Display device 1007 Input device 1008 Output device

Claims (7)

  1.  元のデータをランダム化して得られたランダム化データの度数ベクトルと、ランダム化前後の組毎の遷移確率を有する遷移確率行列と、推定対象となる前記元のデータの度数ベクトルとを有する式を計算することにより、前記元のデータの度数ベクトルを算出する演算部と、
     前記演算部により算出された前記元のデータの度数ベクトルを出力する出力部と
     を備えるデータ集約装置。
    An expression having a frequency vector of randomized data obtained by randomizing the original data, a transition probability matrix having a transition probability for each set before and after randomization, and a frequency vector of the original data to be estimated. A calculation unit that calculates the frequency vector of the original data by calculation,
    A data aggregation device including an output unit that outputs a frequency vector of the original data calculated by the calculation unit.
  2.  前記元のデータxの集合をXとし、前記ランダム化データzの集合をZとし、前記遷移確率行列をPとし、前記ランダム化データの度数ベクトルをhとし、前記元のデータの度数ベクトルをhとし、h をt回目の繰り返しのhとし、αを定数とした場合において、
     前記式は、
     ht+1 =h ・(αP(h/h αP)で表される第1の式であり、
     前記演算部は、ht+1 とh の差分の大きさが閾値よりも小さくなるまで前記第1の式を計算する
     請求項1に記載のデータ集約装置。
    Let X be the set of the original data x, Z be the set of the randomized data z, P be the transition probability matrix, h Z be the frequency vector of the randomized data, and the frequency vector of the original data be. When h X is set, h t X is set to h X of the t-th repetition, and α is set as a constant.
    The above formula is
    h t + 1 X = h t X · (αP (h Z / h t X αP) T ) The first equation expressed by T.
    The data aggregation device according to claim 1, wherein the calculation unit calculates the first equation until the difference between h t + 1 X and h t X becomes smaller than a threshold value.
  3.  前記集合Zにおける取り得る値の種類の数が、前記集合Zにおける要素の数よりも大きい場合において、前記ランダム化データの度数ベクトルhにおける度数が非ゼロの要素のみからなる度数ベクトルをh´とし、前記遷移確率行列Pをh´に存在する要素に対応する列のみとした行列をP´とした場合において、
     前記演算部は、前記第1の式に代えて、ht+1 =h ・(αP´(h´/h αP´)で表される第2の式を計算する
     請求項2に記載のデータ集約装置。
    When the number of possible value types in the set Z is larger than the number of elements in the set Z, h Z is a frequency vector consisting of only non-zero elements in the frequency vector h Z of the randomized data. In the case where the transition probability matrix P is set to ′ and the matrix in which only the columns corresponding to the elements existing in h Z ′ is used is P ′.
    Instead of the first equation, the arithmetic unit calculates a second equation represented by h t + 1 X = h t X · (αP'(h Z '/ h t X αP') T ) T. The data aggregation device according to claim 2.
  4.  前記P又は前記P´の値である遷移確率は、ランダム化処理で使用される複数の確率の値の積からなり、前記P又は前記P´の各値に共通して、前記積の要素として同じ値が使用されている場合において、前記αとして当該同じ値に基づく値を使用する
     請求項2又は3に記載のデータ集約装置。
    The transition probability, which is the value of P or P', consists of the product of the values of a plurality of probabilities used in the randomization process, and is common to each value of P or P'as an element of the product. The data aggregation device according to claim 2 or 3, wherein when the same value is used, a value based on the same value is used as the α.
  5.  請求項1ないし4のうちいずれか1項に記載の前記データ集約装置と、前記元のデータをランダム化し、ランダム化データを前記データ集約装置に送信する端末と、を備えるデータ集約システム。 A data aggregation system including the data aggregation device according to any one of claims 1 to 4 and a terminal that randomizes the original data and transmits the randomized data to the data aggregation device.
  6.  元のデータをランダム化して得られたランダム化データの度数ベクトルと、ランダム化前後の組毎の遷移確率を有する遷移確率行列と、推定対象となる前記元のデータの度数ベクトルとを有する式を計算することにより、前記元のデータの度数ベクトルを算出する演算ステップと、
     前記演算ステップにより算出された前記元のデータの度数ベクトルを出力する出力ステップと
     を備えるデータ集約方法。
    An expression having a frequency vector of randomized data obtained by randomizing the original data, a transition probability matrix having a transition probability for each set before and after randomization, and a frequency vector of the original data to be estimated. A calculation step for calculating the frequency vector of the original data by calculation, and
    A data aggregation method including an output step for outputting a frequency vector of the original data calculated by the calculation step.
  7.  コンピュータを、請求項1ないし4のうちいずれか1項に記載のデータ集約装置における各部として機能させるためのプログラム。 A program for making a computer function as each part in the data aggregation device according to any one of claims 1 to 4.
PCT/JP2020/039120 2020-10-16 2020-10-16 Data aggregation device, data aggregation system, data aggregation method, and program WO2022079905A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
JP2022556814A JPWO2022079905A1 (en) 2020-10-16 2020-10-16
PCT/JP2020/039120 WO2022079905A1 (en) 2020-10-16 2020-10-16 Data aggregation device, data aggregation system, data aggregation method, and program

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2020/039120 WO2022079905A1 (en) 2020-10-16 2020-10-16 Data aggregation device, data aggregation system, data aggregation method, and program

Publications (1)

Publication Number Publication Date
WO2022079905A1 true WO2022079905A1 (en) 2022-04-21

Family

ID=81209025

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2020/039120 WO2022079905A1 (en) 2020-10-16 2020-10-16 Data aggregation device, data aggregation system, data aggregation method, and program

Country Status (2)

Country Link
JP (1) JPWO2022079905A1 (en)
WO (1) WO2022079905A1 (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2017212669A (en) * 2016-05-27 2017-11-30 日本電信電話株式会社 Data disturbance apparatus, data disturbance method, and data disturbance program
JP2018055057A (en) * 2016-09-30 2018-04-05 日本電信電話株式会社 Data disturbing device, method and program

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2017212669A (en) * 2016-05-27 2017-11-30 日本電信電話株式会社 Data disturbance apparatus, data disturbance method, and data disturbance program
JP2018055057A (en) * 2016-09-30 2018-04-05 日本電信電話株式会社 Data disturbing device, method and program

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
HORIGOME, HIKARU ET AL.: "Privacy-Preserving Estimation Population Distribution using Local Differential Privacy", MULTIMEDIA, DISTRIBUTED, COOPERATIVE, AND MOBILE (DICOM02020) SYMPOSIUM, 24 June 2020 (2020-06-24), pages 1 - 8, Retrieved from the Internet <URL:https://windy/mind_meiji.ac_jp/paper/2020/bachelor> [retrieved on 20210909] *
ZITAO LI; TIANHAO WANG; MILAN LOPUHA\"A-ZWAKENBERG; BORIS SKORIC; NINGHUI LI: "Estimating Numerical Distributions under Local Differential Privacy", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 2 December 2019 (2019-12-02), 201 Olin Library Cornell University Ithaca, NY 14853 , XP081543898 *

Also Published As

Publication number Publication date
JPWO2022079905A1 (en) 2022-04-21

Similar Documents

Publication Publication Date Title
Sherman Area-convexity, l∞ regularization, and undirected multicommodity flow
Frauenfelder et al. Finite elements for elliptic problems with stochastic coefficients
Hackbusch et al. Use of tensor formats in elliptic eigenvalue problems
van der Hofstad Critical behavior in inhomogeneous random graphs
US10083250B2 (en) Simplification of large networks and graphs
Galatolo et al. An elementary approach to rigorous approximation of invariant measures
Bonnin Amplitude and phase dynamics of noisy oscillators
Li et al. Non-convex low-rank matrix recovery with arbitrary outliers via median-truncated gradient descent
US20150088953A1 (en) Methods, systems and computer-readable media for distributed probabilistic matrix factorization
Data et al. Data encoding for byzantine-resilient distributed optimization
CN112131515A (en) Method and computer readable medium for converting higher order polynomials into second order polynomials
Liang et al. Alternating iterative methods for solving tensor equations with applications
Winkler et al. The calculation of the degree of an approximate greatest common divisor of two polynomials
Blanco et al. Minimizing ordered weighted averaging of rational functions with applications to continuous location
Ghaffari et al. Reduced spline method based on a proper orthogonal decomposition technique for fractional sub-diffusion equations
Lin et al. A sparsity preserving stochastic gradient methods for sparse regression
González-Pinto et al. A family of three-stage third order AMF-W-methods for the time integration of advection diffusion reaction PDEs.
Das et al. Random convolutional coding for robust and straggler resilient distributed matrix computation
Bocanegra et al. Improving an interior-point approach for large block-angular problems by hybrid preconditioners
Chu et al. An efficient implementable inexact entropic proximal point algorithm for a class of linear programming problems
Guan et al. Reduced basis methods for nonlocal diffusion problems with random input data
Kong et al. Efficient algorithms for selecting features with arbitrary group constraints via group lasso
WO2022079905A1 (en) Data aggregation device, data aggregation system, data aggregation method, and program
Traoré et al. Singleshot: a scalable Tucker tensor decomposition
US10896366B2 (en) Reduction of parameters in fully connected layers of neural networks by low rank factorizations

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20957734

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2022556814

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20957734

Country of ref document: EP

Kind code of ref document: A1