JP2017203904A

JP2017203904A - Privacy protection device

Info

Publication number: JP2017203904A
Application number: JP2016096209A
Authority: JP
Inventors: 寺田　雅之; Masayuki Terada; 雅之寺田; 高康山口; Takayasu Yamaguchi
Original assignee: NTT Docomo Inc
Current assignee: NTT Docomo Inc
Priority date: 2016-05-12
Filing date: 2016-05-12
Publication date: 2017-11-16
Anticipated expiration: 2036-05-12
Also published as: JP6711689B2

Abstract

PROBLEM TO BE SOLVED: To provide a privacy protection device capable of providing individual data which meet a differential privacy reference and improve practicality.SOLUTION: A privacy protection device 10 comprises: an input part 11 for receiving an input of first individual data; a total part 12 for generating first total data by totaling the first individual data received by the input part 11; a random number application part 13 for generating second total data by applying a random number of a predetermined strength to the first total data; a precision part 14 for generating, from the second total data, third total data of which each element is an integer that is not a negative value, and the total sum of all the elements is equal to a total sum of all elements of the second total data; an individual generation part 15 for generating second individual data from the third total data; and an output part 16 for outputting the second individual data.SELECTED DRAWING: Figure 1

Description

本発明は、プライバシー保護装置に関する。 The present invention relates to a privacy protection device.

近年、情報セキュリティ分野及びデータベース処理分野等において、プライバシーを保護しつつ有用なデータを公開するための様々な新しい基準及び手法が提案されている。これらの技術は、プライバシー保護データ公開（ＰＰＤＰ：Privacy-Preserving Data Publishing）技術等と呼ばれている。しかし、これらのＰＰＤＰ技術は、それぞれ攻撃者が持つ目的、能力及び背景知識に関する前提が異なり、その安全性について一概に議論することが困難であることから、実際のデータ活用に適用することは容易ではない。すなわち、これらの技術を実際に適用する上では、扱うデータの性質及び応用ごとに、「どのプライバシー保護基準に基づいて、どの手法によりプライバシーを保護するべきか」を適切に判断することが求められるが、この判断を全てのデータ活用において行うことは現実的には難しい。 In recent years, in the information security field and the database processing field, various new standards and methods for publishing useful data while protecting privacy have been proposed. These technologies are called privacy protection data publishing (PPDP) technology. However, these PPDP technologies have different assumptions regarding the objectives, abilities, and background knowledge of attackers, and it is difficult to discuss their safety in general, so it is easy to apply them to actual data utilization. is not. In other words, when these technologies are actually applied, it is required to appropriately determine “based on which privacy protection standards should be used to protect privacy” for each nature and application of the data to be handled. However, it is practically difficult to make this determination in all data utilization.

そこで、差分プライバシー基準（differential privacy）が注目されている（特許文献１、特許文献２、非特許文献１〜３参照）。この差分プライバシー基準は、「加工データを作成する上での元データとなるデータベースに、ある人が含まれるか否かの、加工データからの判別困難性」を安全性の根拠とするプライバシー保護基準である。差分プライバシー基準は、他の多くのプライバシー保護基準とは異なり、任意の背景知識を持つ攻撃者及び未知の攻撃に対して数学的な安全性が与えられているという優れた性質を有する。差分プライバシー基準を満たす手段は「メカニズム（mechanism）」と呼ばれる。代表的な差分プライバシーのメカニズムとしてラプラス（Laplace）メカニズムが挙げられる。ラプラスメカニズムは「問い合わせ結果に対してラプラスノイズを加える」という簡単な手段によって実現することができる。 Therefore, a differential privacy standard (differential privacy) has attracted attention (see Patent Document 1, Patent Document 2, and Non-Patent Documents 1 to 3). This differential privacy standard is a privacy protection standard based on the safety of "difficulty in determining whether or not a person is included in the database that is the original data for creating processed data". It is. Differential privacy standards, unlike many other privacy protection standards, have the excellent property that they are given mathematical security against attackers with arbitrary background knowledge and unknown attacks. Means that meet the differential privacy criteria are called "mechanism". A typical Laplace mechanism is a differential privacy mechanism. The Laplace mechanism can be realized by a simple means of “adding Laplace noise to the inquiry result”.

ラプラスメカニズムは、差分プライバシー基準を満たす集計データを簡単かつ高速に作成することができる。しかし、集計する前の元データ、すなわち個票データを、差分プライバシー基準を満たすように匿名化する用途に直接用いることはできない。個票データを、差分プライバシー基準を満たすように匿名化するための手段として、ＰＲＡＭ(Post Randomization)に基づく手法、すなわち個票データに含まれる各レコードの値を、予め定められた確率で別の値に変化させる手法がある。ＰＲＡＭに基づき差分プライバシー基準を満たす手法においては、この手法の適用により得られる差分プライバシーの強度(安全性)は、値を変化させる確率によって変化する。（非特許文献４参照） The Laplace mechanism can easily and quickly create aggregate data that satisfies the differential privacy standard. However, the original data before aggregation, that is, individual vote data cannot be directly used for anonymization so as to satisfy the differential privacy standard. As a means for anonymizing individual vote data so as to satisfy the differential privacy standard, a method based on PRAM (Post Randomization), that is, the value of each record included in the individual vote data is changed with a predetermined probability. There is a method to change the value. In the method that satisfies the differential privacy standard based on PRAM, the strength (safety) of the differential privacy obtained by applying this method changes depending on the probability of changing the value. (See Non-Patent Document 4)

特開２０１２−１３３３２０号公報JP 2012-133320 A 特開２０１６−０１２０７４号公報JP 2006-012074 A

Cynthia Dwork. Differential Privacy. In Michele Bugliesi, BartPreneel, Vladimiro Sassone, and Ingo Wegener, editors, Proc. 33rd intl.conf.Autom ata, Languages and Programming - Volume Part II, Vol.4052 of LectureNotesin Com puter Science, pp.1-12. Springer, 2006.Cynthia Dwork. Differential Privacy. In Michele Bugliesi, BartPreneel, Vladimiro Sassone, and Ingo Wegener, editors, Proc. 33rd intl.conf. Autom ata, Languages and Programming-Volume Part II, Vol. 1-12. Springer, 2006. Cynthia Dwork. Differential privacy: a survey of results. In Proc.5th intl. conf. Theory and applications of models of computation, pp.1-19.Springer-Verlag,April 2008.Cynthia Dwork. Differential privacy: a survey of results.In Proc.5th intl.conf.Theory and applications of models of computation, pp.1-19.Springer-Verlag, April 2008. 寺田雅之、外３名、大規模集計データへの差分プライバシの適用、情報処理学会論文誌、第56巻9号、pp. 1801-1816、2015年9月．Masayuki Terada, 3 others, Application of differential privacy to large-scale aggregate data, IPSJ Transactions, Vol. 56, No. 9, pp. 1801-1816, September 2015. Dai Ikarashi et al., k-anonymous microdata release via postrandomisation method, eprint arXiv, Vol. 1504.05353, pp. 1-22, 2015.Dai Ikarashi et al., K-anonymous microdata release via postrandomisation method, eprint arXiv, Vol. 1504.05353, pp. 1-22, 2015.

しかしながら、ＰＲＡＭに基づく手法で、十分な安全性を持つ匿名化個票データを生成しようとすると、著しく高い確率でレコードの値を変化させる必要がある。そのため、十分な安全性を持たせるためにはデータの有用性が大きく損なわれることになる。 However, if anonymized personal data with sufficient security is generated by a method based on PRAM, it is necessary to change the value of the record with a very high probability. Therefore, the usefulness of data is greatly impaired in order to provide sufficient safety.

本発明は、上記問題点に鑑みてなされたものであり、差分プライバシー基準を満たすと共に有用性の高い個票データを提供可能なプライバシー保護装置を提供することを目的とする。 The present invention has been made in view of the above problems, and an object of the present invention is to provide a privacy protection device that satisfies the differential privacy standard and can provide highly useful individual slip data.

本発明の一態様に係るプライバシー保護装置は、複数のレコードを含む第１の個票データの入力を受け、第２の個票データを出力するプライバシー保護装置であって、第１の個票データの入力を受け付ける入力手段と、入力手段によって受け付けられた第１の個票データを集計することにより第１の集計データを生成する集計手段と、第１の集計データに対して予め定められた強度の乱数を付与することにより第２の集計データを生成する乱数付与手段と、第２の集計データから、各要素が負の値ではない整数であり、且つ、全ての要素の総和が第２の集計データの全ての要素の総和と等しい、第３の集計データを生成する精緻化手段と、第３の集計データから第２の個票データを生成する個票生成手段と、第２の個票データを出力する出力手段と、を備える。 A privacy protection device according to an aspect of the present invention is a privacy protection device that receives input of first piece data including a plurality of records and outputs second piece data, and includes first piece data. Input means for accepting the input, a summing means for generating the first aggregated data by aggregating the first individual vote data received by the input means, and a predetermined strength for the first aggregated data Random number assigning means for generating the second aggregated data by assigning random numbers, and from the second aggregated data, each element is an integer that is not a negative value, and the sum of all the elements is the second Refinement means for generating third total data, which is equal to the sum of all elements of the total data, individual form generation means for generating second individual form data from the third total data, and second individual form Output that outputs data It includes a stage, a.

このプライバシー保護装置では、第１の集計データに対して乱数が付与されて第２の集計データが生成される。このように、例えばＰＲＡＭ等により個票データのレコードの値が直接変更されるのではなく集計データに対して乱数が付与されることにより、データの有用性を大きく損なうことなく、差分プライバシー基準を満たす集計データを生成することができる。ここで、乱数が付与された第２の集計データには、負の値や少数の要素が含まれ得る。当該要素が含まれていることにより、乱数付与後の集計データから個票データを生成することができない場合がある。この点、各要素が負の値ではない整数であり且つ要素の総和が第２の集計データの全ての要素の総和と等しい第３の集計データが生成されることにより、該第３の集計データに対応する第２の個票データを生成することが可能となる。第２の個票データは、第２の集計データから生成された第３の集計データに基づくものであるので、第２の集計データと同様、差分プライバシー基準を満たし、有用性も高い。以上より、本発明によれば、差分プライバシー基準を満たすと共に有用性の高い個票データを提供することができる。 In this privacy protection device, random numbers are assigned to the first aggregated data to generate second aggregated data. Thus, for example, the value of the record of the individual slip data is not directly changed by PRAM or the like, but a random number is given to the aggregated data, so that the difference privacy standard can be set without significantly impairing the usefulness of the data. It is possible to generate aggregate data that satisfies the requirement. Here, the second tabulated data to which random numbers are assigned may include a negative value or a small number of elements. Due to the inclusion of the element, individual vote data may not be generated from the aggregated data after the random number is assigned. In this respect, the third aggregated data is generated by generating the third aggregate data in which each element is an integer that is not a negative value and the sum of the elements is equal to the sum of all the elements of the second aggregate data. It is possible to generate the second individual vote data corresponding to. Since the second individual vote data is based on the third aggregate data generated from the second aggregate data, the second privacy data satisfies the differential privacy standard and is highly useful, like the second aggregate data. As described above, according to the present invention, it is possible to provide individual vote data that satisfies the differential privacy standard and is highly useful.

上記プライバシー保護装置において、精緻化手段は、第２の集計データの要素に関する最近傍探索を行うことにより第３の集計データを生成してもよい。これにより、第２の個票データを、第１の個票データと似たレコード構成とすることができる。このことで、第２の個票データの有用性を向上させることができる。 In the privacy protection apparatus, the elaboration unit may generate the third aggregated data by performing a nearest neighbor search on elements of the second aggregated data. Thereby, the second individual form data can have a record structure similar to the first individual form data. As a result, the usefulness of the second individual vote data can be improved.

本発明によれば、差分プライバシー基準を満たすと共に有用性の高い個票データを提供することができる。 According to the present invention, it is possible to provide individual vote data that satisfies the differential privacy standard and is highly useful.

第１実施形態に係るプライバシー保護装置の構成を概略的に示す図である。It is a figure showing roughly the composition of the privacy protection device concerning a 1st embodiment. 図１のプライバシー保護装置のハードウェア構成図である。It is a hardware block diagram of the privacy protection apparatus of FIG. 図１のプライバシー保護装置によって実行されるプライバシー保護方法の一連の処理を示すフローチャートである。It is a flowchart which shows a series of processes of the privacy protection method performed by the privacy protection apparatus of FIG.

以下、添付図面を参照しながら本発明の実施形態を詳細に説明する。図面の説明において、同一又は同等の要素には同一符号を用い、重複する説明を省略する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings. In the description of the drawings, the same reference numerals are used for the same or equivalent elements, and redundant descriptions are omitted.

［第１実施形態］
図１は、第１実施形態に係るプライバシー保護装置の構成を概略的に示す図である。図１に示されるように、プライバシー保護装置１０は、複数のレコードを含む第１の個票データDを入力し、第２の個票データD+を出力する装置であり、例えば、サーバ装置等の情報処理装置によって構成されている。プライバシー保護装置１０は、個票データを公開するにあたって、データベースに含まれる人々のプライバシーに関する情報（個人情報）の漏洩を防止するためのプライバシー保護処理を第１の個票データDに施す。例えば、プライバシー保護装置１０は、商店における売上履歴やGPSによる位置情報などの提供及び開示におけるプライバシーを保護する。 [First Embodiment]
FIG. 1 is a diagram schematically showing the configuration of the privacy protection device according to the first embodiment. As shown in FIG. 1, the privacy protection device 10 is a device that inputs first piece data D including a plurality of records and outputs second piece data D +. It is comprised by the information processing apparatus. The privacy protection device 10 applies privacy protection processing to the first individual data D to prevent leakage of information (personal information) related to the privacy of people included in the database when the individual data is disclosed. For example, the privacy protection device 10 protects privacy in providing and disclosing information such as sales history in stores and location information by GPS.

第１の個票データDは、プライバシー保護装置１０によるプライバシー保護処理の処理対象である。第１の個票データDは、それぞれが個人に対応づけられた１つ以上のレコードから構成されたデータベースであり、各レコードは１つ以上の属性値を持つ。すなわち、各レコードを元とした多重集合(multiset)として表現される。 The first individual slip data D is a processing target of privacy protection processing by the privacy protection device 10. The first individual vote data D is a database composed of one or more records each associated with an individual, and each record has one or more attribute values. That is, it is expressed as a multiset based on each record.

以下、詳細に説明する。ある個人iに対応づけられたレコードをx_iとする。x_iは、その個人に関する何らかの情報を表す、d個の属性値x_{ij}の組(順序対)から構成される。任意のレコードにおける、j番目の属性値は集合A_jに属する。ここで、A_jを属性と呼び、全ての属性の直積A=A_1×A_2×…×A_dを属性空間と呼ぶ。 Details will be described below. A record associated with a certain individual i is assumed to be x_i. x_i is composed of a set (order pair) of d attribute values x_ {ij} representing some information about the individual. The jth attribute value in any record belongs to the set A_j. Here, A_j is called an attribute, and the direct product A = A_1 × A_2 ×... × A_d of all attributes is called an attribute space.

このとき、n個のレコードから構成される個票データDは、属性空間Aを台集合(underlyingset)とする、以下のn元の多重集合として表される。なお、{{・}}は多重集合、(・)は順序対、{}は集合をそれぞれ表す記号とする。D、x_i、及びAは、下記の（１）〜（３）式により示される。
D = {{ x_1, x_2, ..., x_n }}・・・（１）
x_i = (x_{i1}, x_{i2},... , x_{id} )・・・（２）
A = A_1 × A_2 × … × A_d・・・（３） At this time, the individual slip data D composed of n records is represented as the following n-element multiple set, where the attribute space A is an underlying set. In addition, {{•}} is a multiple set, (•) is an ordered pair, and {} is a symbol representing the set. D, x_i, and A are represented by the following formulas (1) to (3).
D = {{x_1, x_2, ..., x_n}} (1)
x_i = (x_ {i1}, x_ {i2}, ..., x_ {id}) ... (2)
A = A_1 × A_2 ×… × A_d (3)

ここで、個票データDを(集合もしくは順序対ではなく)多重集合として表現する理由は、個票データには同一の属性値の組み合わせを持つレコードが複数存在しうること、及びレコードの並び順のみが異なる個票データは同値とみなすべきことによる。 Here, the reason why the individual data D is expressed as a multiple set (not a set or an ordered pair) is that there can be a plurality of records having the same combination of attribute values in the individual data, and the arrangement order of the records. This is because individual vote data that differ only should be regarded as equivalent.

プライバシー保護装置１０は、機能的には、入力部１１(入力手段)と、集計部１２(集計手段)と、乱数付与部１３(乱数付与手段)と、精緻化部１４(精緻化手段)と、個票生成部１５(個票生成手段)と、出力部１６(出力手段)と、を備える。プライバシー保護装置１０は、図２に示されるハードウェアによって構成される。 Functionally, the privacy protection device 10 includes an input unit 11 (input unit), a totaling unit 12 (totaling unit), a random number assigning unit 13 (random number providing unit), and a refining unit 14 (refining unit). , An individual form generation unit 15 (individual form generation means) and an output unit 16 (output means). The privacy protection device 10 is configured by hardware shown in FIG.

図２は、プライバシー保護装置１０のハードウェア構成図である。図２に示されるように、プライバシー保護装置１０は、物理的には、１又は複数のＣＰＵ（Central Processi ng Unit）１０１、主記憶装置であるＲＡＭ（RandomAccessMemory）１０２及びＲＯＭ（Read Only Memory)１０３、データ送受信デバイスである通信モジュール１０４、ハードディスク装置等の補助記憶装置１０５、キーボード等のユーザの入力を受け付ける入力装置１０６、並びに、ディスプレイ等の出力装置１０７等のハードウェアを備えるコンピュータとして構成される。図１におけるプライバシー保護装置１０の各機能は、ＣＰＵ１０１、ＲＡＭ１０２等のハードウェア上に１又は複数の所定のコンピュータソフトウェアを読み込ませることにより、ＣＰＵ１０１の制御のもとで通信モジュール１０４、入力装置１０６及び出力装置１０７を動作させるとともに、ＲＡＭ１０２及び補助記憶装置１０５におけるデータの読み出し及び書き込みを行うことで実現される。 FIG. 2 is a hardware configuration diagram of the privacy protection device 10. As shown in FIG. 2, the privacy protection device 10 is physically composed of one or a plurality of CPUs (Central Processing Units) 101, a main storage device RAM (Random Access Memory) 102, and a ROM (Read Only Memory) 103. A computer including hardware such as a communication module 104 which is a data transmission / reception device, an auxiliary storage device 105 such as a hard disk device, an input device 106 which accepts user input such as a keyboard, and an output device 107 such as a display . Each function of the privacy protection device 10 in FIG. 1 is configured such that one or a plurality of predetermined computer software is loaded on hardware such as the CPU 101 and the RAM 102, thereby controlling the communication module 104, the input device 106, and the like. This is realized by operating the output device 107 and reading and writing data in the RAM 102 and the auxiliary storage device 105.

図１に戻って、プライバシー保護装置１０の機能構成について詳細に説明する。入力部１１は、第１の個票データDの入力を受け付ける入力手段として機能する。入力部１１は、プライバシー保護装置１０の外部から第１の個票データDを受信し、受信した第１の個票データDを集計部１２に出力する。 Returning to FIG. 1, the functional configuration of the privacy protection device 10 will be described in detail. The input unit 11 functions as an input unit that receives input of the first individual slip data D. The input unit 11 receives the first individual form data D from the outside of the privacy protection device 10 and outputs the received first individual form data D to the totaling unit 12.

集計部１２は、入力部１１によって受け付けられた第１の個票データDを集計することによって第１の集計データVを生成する集計手段として機能する。第１の集計データVは、第１の個票データDにおいて、ある定められた条件を満たす属性値(もしくは属性値の組み合わせ)を持つレコードの個数を数えあげた値の集合である。 The tabulation unit 12 functions as a tabulation unit that generates the first tabulation data V by tabulating the first individual vote data D received by the input unit 11. The first aggregated data V is a set of values obtained by counting the number of records having attribute values (or combinations of attribute values) satisfying a predetermined condition in the first individual data D.

AをDの属性空間とするとき、Aの部分空間C_kに属するレコードの個数をCount(D, C_k)=とする。これを計数問い合わせ(count query)結果と呼ぶ。このとき、任意のC_kからなる順序対である集計条件C=(C_1, C_2, ...,C_p)に対して、第1の集計データVは、Cの各元に対応する計数問い合わせ結果Count(D, C_k)からなる順序対として以下の（４）式により示される。
V(D, C) = (v_1,v_2, ..., v_p), v_k = Count(D, C_k)・・・（４） When A is the attribute space of D, the number of records belonging to the subspace C_k of A is Count (D, C_k) =. This is called a count query result. At this time, for the aggregation condition C = (C_1, C_2, ..., C_p), which is an ordered pair consisting of arbitrary C_k, the first aggregation data V is the count inquiry result Count corresponding to each element of C It is shown by the following formula (4) as an ordered pair consisting of (D, C_k).
V (D, C) = (v_1, v_2, ..., v_p), v_k = Count (D, C_k) (4)

第１の集計データVの作成において、一般的には各属性の値域A_jの互いに素な部分集合の直積が集計条件Cとして用いられる。このとき、集計データは分割表(contingency table)と呼ばれる。分割表におけるそれぞれの値v_jを、セル(cell)もしくはセル値と呼ぶ。 In creating the first aggregated data V, the direct product of disjoint subsets of the attribute range A_j of each attribute is generally used as the aggregate condition C. At this time, the aggregated data is called a contingency table. Each value v_j in the contingency table is called a cell or a cell value.

さらに、A={a_1,a_2, ..., a_p}を属性空間とする第１の集計データVにおいて、c_i={a_i}、すなわち集計条件の各要素c_iは、属性空間のいずれかの元のみを含む集合であり、重複なく全ての元に対応づけられる(一対一の対応関係を持つ)とする。この第１の集計データVは分割表であり、これ以上に集計条件を細かくした分割表は作れないことから、以下ではこれを完全分割表と呼ぶ。完全分割表のセル数|V|は、Aの濃度|A|と等しい。 Furthermore, in the first aggregated data V having A = {a_1, a_2, ..., a_p} as the attribute space, c_i = {a_i}, that is, each element c_i of the aggregation condition is any element of the attribute space. It is a set including only and can be associated with all elements without duplication (having a one-to-one correspondence). The first aggregated data V is a contingency table, and a contingency table with finer aggregation conditions cannot be created. Therefore, this is hereinafter referred to as a complete contingency table. The number of cells | V | in the complete contingency table is equal to the A concentration | A |.

このとき、完全分割表Vにおける各要素v_iは，多重集合である第１の個票データDにおける、その台集合である属性空間Aの元a_iの多重度(multiplicity) m_D(a_i)に他ならない。任意の多重集合は、台集合と、その各元の多重度により一意に定義されるため、(V, A)の組が与えられればDは一意に定まる。このことから、以下の定理が成立する。 At this time, each element v_i in the complete contingency table V is none other than the multiplicity m_D (a_i) of the element a_i of the attribute space A that is the set in the first piece data D that is the multiset. . Since an arbitrary multiset is uniquely defined by the base set and the multiplicity of each element, D is uniquely determined if a set of (V, A) is given. From this, the following theorem holds.

(定理１)属性空間Aを持つ第１の個票データDから第１の集計データである完全分割表Vを生成する写像をf_Aとする(V = f_A(D))。このとき、f_Aは、D = f^{-1}_A(V)となる逆写像f^{-1}_Aを持つ。 (Theorem 1) Let f_A be a map that generates a complete contingency table V, which is the first aggregated data, from the first piece of data D having the attribute space A (V = f_A (D)). At this time, f_A has an inverse map f ^ {-1} _A such that D = f ^ {-1} _A (V).

集計部１２が生成する第１の集計データVは、第１の個票データから生成される任意の分割表であるが、上記の定理から特に完全分割表であることが望ましい。 The first total data V generated by the totaling unit 12 is an arbitrary contingency table generated from the first piece of data, but is preferably a complete contingency table from the above theorem.

乱数付与部１３は、第１の集計データVに対して予め定められた強度の乱数を付与することによって、第２の集計データV*を生成する乱数付与手段として機能する。ここで、第２の集計データV*は差分プライバシー基準を満たす。また、乱数は、加算により差分プライバシー基準を満たすことができる乱数である。このような乱数として、例えば、ラプラス分布に従う乱数であるラプラスノイズ（ラプラス乱数）、幾何分布に従う乱数である幾何ノイズ（幾何乱数）等が用いられる。ラプラスノイズを付与することにより差分プライバシー基準を満たす手段はラプラスメカニズムと呼ばれ、幾何ノイズを付与することにより差分プライバシー基準を満たす手段は幾何メカニズムと呼ばれる。 The random number assigning unit 13 functions as a random number assigning unit that generates the second aggregated data V * by giving a random number having a predetermined strength to the first aggregated data V. Here, the second aggregated data V * satisfies the differential privacy standard. The random number is a random number that can satisfy the differential privacy standard by addition. As such random numbers, for example, Laplace noise (Laplace random number) that is a random number according to the Laplace distribution, geometric noise (geometric random number) that is a random number according to the geometric distribution, or the like is used. A means for satisfying the differential privacy standard by applying Laplace noise is called a Laplace mechanism, and a means for satisfying the differential privacy standard by applying geometric noise is called a geometric mechanism.

乱数としてラプラスノイズを用いて説明を行う。ここで、ラプラスノイズとは、確率分布の一種であるラプラス分布から独立に抽出された乱数である。平均０、スケールλのラプラス分布に従って発生させたラプラスノイズをＬａｐ（λ）とする。具体的には、第１の集計データVに含まれる全ての要素に対して、それぞれ独立に生成されたＬａｐ（1/ε）を加算することによって第２の集計データV*を生成することにより、第２の集計データV*はε-差分プライバシーという安全性強度を満たすことが保証される。 A description will be given using Laplace noise as a random number. Here, the Laplace noise is a random number extracted independently from a Laplace distribution which is a kind of probability distribution. Let Lap (λ) be Laplace noise generated according to Laplace distribution with mean 0 and scale λ. Specifically, by generating the second total data V * by adding the independently generated Lap (1 / ε) to all elements included in the first total data V The second aggregated data V * is guaranteed to satisfy the safety strength of ε-difference privacy.

乱数として幾何ノイズを用いる場合は、上記のラプラスノイズを幾何ノイズに置きかえればよい。この場合も、用いる幾何ノイズのスケールに応じた安全性強度を持つ、差分プライバシーを満たした第２の集計データV*を得ることができる。 When geometric noise is used as a random number, the Laplace noise may be replaced with geometric noise. Also in this case, it is possible to obtain the second aggregated data V * satisfying the differential privacy having the safety strength corresponding to the geometric noise scale to be used.

このように、ラプラス分布に従う乱数であるラプラスノイズ（ラプラス乱数）、幾何分布に従う乱数である幾何ノイズ（幾何乱数）を用いることにより、第２の集計データV*が差分プライバシー基準を満たすことが保証される。 In this way, by using Laplace noise (Laplace random number), which is a random number according to the Laplace distribution, and geometric noise (geometric random number), which is a random number according to the geometric distribution, it is guaranteed that the second aggregated data V * satisfies the differential privacy standard. Is done.

精緻化部１４は、第２の集計データV*から、各要素が負の値ではない整数であり、且つ、全ての要素が第２の集計データV*の全ての要素の総和と等しい、第３の集計データV+を生成する精緻化手段として機能する。すなわち、精緻化部１４は、第２の集計データV*から、各要素が負の値でなく、各要素が整数であり、且つ、全ての要素の総和が第２の集計データV*に含まれる全ての要素の総和と等しいという制約条件を全て満たす、第３の集計データV+を生成する。 The refinement unit 14 determines from the second aggregated data V * that each element is an integer that is not a negative value, and all the elements are equal to the sum of all the elements of the second aggregated data V *. 3 functions as an elaboration means for generating the total data V +. That is, the refinement unit 14 determines that each element is not a negative value, each element is an integer, and the total sum of all elements is included in the second aggregated data V * from the second aggregated data V *. 3rd total data V + which satisfy | fills all the constraints that it is equal to the sum total of all the elements to be generated is produced | generated.

第３の集計データV+を生成するための方法は、V+が上記の制約条件を全て満たすようなものであれば任意である。簡単には、たとえば、V* に含まれる負の値を持つ要素を全て０にするとともに、小数以下を切り捨て、総計が合うように均等にV*の要素に対して定数を加算もしくは減算することにより、上記の制約条件を全て満たす、第３の集計データV+を生成することができる。ただし、精緻化部１４において第３の集計データV+を生成するための方法はこの方法に限られない。 A method for generating the third aggregated data V + is arbitrary as long as V + satisfies all of the above-described constraints. To simplify, for example, all elements with negative values in V * are set to 0, and fractional values are rounded down, and constants are added to or subtracted from V * elements evenly so that the totals match. Thus, it is possible to generate the third aggregated data V + that satisfies all the above-described constraint conditions. However, the method for generating the third total data V + in the refinement unit 14 is not limited to this method.

個票生成部１５は、第３の集計データV+から、予め定められた手順により第２の個票データD+を生成する個票生成手段として機能する。上述の通り、第３の集計データV+が完全分割表であれば、V+の各要素は第１の個票データDの台集合Aにおいて対応する要素の多重度を表現する。たとえば、あるV+の要素v_iについて、v_iはAの要素a_i = (男性、３０代)に対応しており、さらにv_i = ３であったとする。このとき、第２の個票データD+は、(男性、３０代)というレコードを３個含むことを意味する。 The individual form generation unit 15 functions as individual form generation means for generating second individual form data D + from the third tabulated data V + by a predetermined procedure. As described above, if the third tabulated data V + is a complete contingency table, each element of V + represents the multiplicity of the corresponding element in the table set A of the first individual data D. For example, for an element v_i of a certain V +, v_i corresponds to the element a_i = (male, 30s) of A, and v_i = 3. At this time, the second individual vote data D + means that three records (male, 30s) are included.

出力部１６は、個票生成部１５によって生成された第２の個票データD+を出力する出力手段として機能する。出力部１６は、個票生成部１５から第２の個票データD+を受信し、受信した第２の個票データD+をプライバシー保護装置１０の外部に出力する。出力部１６は、たとえば、第２の個票データD+を公開用のデータベースに出力し、プライバシーが保護された個票データを備えるデータベースを作成する。 The output unit 16 functions as an output unit that outputs the second individual form data D + generated by the individual form generation unit 15. The output unit 16 receives the second piece data D + from the piece generation unit 15 and outputs the received second piece data D + to the outside of the privacy protection device 10. For example, the output unit 16 outputs the second individual form data D + to a public database, and creates a database including the individual form data in which privacy is protected.

次に、上述したプライバシー保護装置１０によって実行されるプライバシー保護方法の処理について、図３を参照して説明する。図３は、図１のプライバシー保護装置１０によって実行されるプライバシー保護方法の一連の処理を示すフローチャートである。 Next, processing of the privacy protection method executed by the privacy protection apparatus 10 described above will be described with reference to FIG. FIG. 3 is a flowchart showing a series of processes of the privacy protection method executed by the privacy protection apparatus 10 of FIG.

図３に示されるように、まず、入力部１１によって第１の個票データDの入力が受け付けられる（ステップＳ１）。そして、入力部１１によって集計部１２に第１の個票データDが出力され、集計部１２によって、第１の個票データDが集計されて第１の集計データVが生成される（ステップＳ２）。 As shown in FIG. 3, first, the input of the first individual slip data D is accepted by the input unit 11 (step S1). Then, the first individual form data D is output to the totaling unit 12 by the input unit 11, and the first individual form data D is totaled by the totaling unit 12 to generate the first total data V (step S2). ).

つづいて、乱数付与部１３によって、第１の集計データVに対して予め定められた強度の乱数が付与され、第２の集計データV*が生成される（ステップＳ３）。そして、精緻化部１４により、第２の集計データV*から、各要素が負の値ではない整数であり、且つ、全ての要素が第２の集計データV*の全ての要素の総和と等しい、第３の集計データV+が生成される（ステップＳ４）。 Subsequently, a random number with a predetermined strength is assigned to the first aggregated data V by the random number assigning unit 13 to generate second aggregated data V * (step S3). Then, the refinement unit 14 determines from the second aggregated data V * that each element is an integer that is not a negative value, and all the elements are equal to the sum of all the elements of the second aggregated data V *. Third aggregate data V + is generated (step S4).

つづいて、個票生成部１５により、第３の集計データV+から、予め定められた手順により第２の個票データD+が生成される（ステップＳ５）。最後に、出力部１６により、個票生成部１５によって生成された第２の個票データD+が外部に出力される（ステップＳ６）。以上が、プライバシー保護装置１０によって実行されるプライバシー保護方法の処理である。 Subsequently, the individual vote generation unit 15 generates second individual vote data D + from the third tabulated data V + by a predetermined procedure (step S5). Finally, the output unit 16 outputs the second individual piece data D + generated by the individual piece generation unit 15 to the outside (step S6). The above is the processing of the privacy protection method executed by the privacy protection device 10.

次に、プライバシー保護装置１０の作用効果について説明する。 Next, the effect of the privacy protection apparatus 10 is demonstrated.

上述したように、プライバシー保護装置１０は、複数のレコードを含む第１の個票データDの入力を受け、第２の個票データD+を出力するプライバシー保護装置であって、第１の個票データDの入力を受け付ける入力部１１と、入力部１１によって受け付けられた第１の個票データDを集計することにより第１の集計データVを生成する集計部１２と、第１の集計データVに対して予め定められた強度の乱数を付与することにより第２の集計データV*を生成する乱数付与部１３と、第２の集計データV*から、各要素が負の値ではない整数であり、且つ、全ての要素の総和が第２の集計データV*の全ての要素の総和と等しい、第３の集計データV+を生成する精緻化部１４と、第３の集計データV+から第２の個票データD+を生成する個票生成部１５と、第２の個票データD+を出力する出力部１６と、を備える。 As described above, the privacy protection device 10 is a privacy protection device that receives the input of the first individual piece data D including a plurality of records and outputs the second individual piece data D +. An input unit 11 that receives input of data D, a totaling unit 12 that generates first aggregated data V by aggregating first individual form data D received by the input unit 11, and first aggregated data V From the random number assigning unit 13 that generates the second aggregated data V * by assigning a random number having a predetermined strength to the above, and from the second aggregated data V *, each element is an integer that is not a negative value. Yes, the refinement unit 14 for generating the third aggregated data V + in which the sum of all the elements is equal to the sum of all the elements of the second aggregated data V *, and the second from the third aggregated data V + The individual form generating unit 15 for generating the individual form data D +, and the second form data D + And an output unit 16 for outputting the individual slip data D +.

このプライバシー保護装置１０では、第１の集計データVに対して乱数が付与されて第２の集計データV*が生成される。このように、例えばＰＲＡＭ等により個票データのレコードの値が直接変更されるのではなく集計データに対して乱数が付与されることにより、データの有用性を大きく損なうことなく、差分プライバシー基準を満たす集計データを生成することができる。ここで、乱数が付与された第２の集計データV*には、負の値や少数の要素が含まれ得る。当該要素が含まれていることにより、乱数付与後の集計データから個票データを生成することができない場合がある。この点、各要素が負の値ではない整数であり且つ要素の総和が第２の集計データV*の全ての要素の総和と等しい第３の集計データV+が生成されることにより、該第３の集計データV+に対応する第２の個票データD+を生成することが可能となる。第２の個票データD+は、第２の集計データV*から生成された第３の集計データV+に基づくものであるので、第２の集計データV*と同様、差分プライバシー基準を満たし、有用性も高い。以上より、本発明によれば、差分プライバシー基準を満たすと共に有用性の高い個票データを提供することができる。 In this privacy protection device 10, random numbers are assigned to the first aggregated data V to generate second aggregated data V *. Thus, for example, the value of the record of the individual slip data is not directly changed by PRAM or the like, but a random number is given to the aggregated data, so that the difference privacy standard can be set without significantly impairing the usefulness of the data. It is possible to generate aggregate data that satisfies the requirement. Here, the second aggregated data V * to which random numbers are assigned may include a negative value or a small number of elements. Due to the inclusion of the element, individual vote data may not be generated from the aggregated data after the random number is assigned. In this respect, the third aggregated data V + is generated by generating the third aggregated data V + in which each element is an integer that is not a negative value and the sum of the elements is equal to the sum of all the elements of the second aggregated data V *. It is possible to generate the second individual slip data D + corresponding to the total data V +. Since the second individual data D + is based on the third total data V + generated from the second total data V *, the second privacy data satisfies the differential privacy standard and is useful like the second total data V *. The nature is also high. As described above, according to the present invention, it is possible to provide individual vote data that satisfies the differential privacy standard and is highly useful.

［第２実施形態］
次に、本発明の第２実施形態に係るプライバシー保護装置１０について説明する。第１実施形態において説明したように、プライバシー保護装置１０では、精緻化部１４が、第２の集計データV*から、各要素が負の値でなく、各要素が整数であり、且つ、全ての要素の総和が第２の集計データV*に含まれる全ての要素の総和と等しいという制約条件を全て満たす、第３の集計データV+を生成している。ここで、個票生成部１５において、第１の個票データDの性質をより良く保持する、有用性が高い第２の個票データD+を得るためには、第２の集計データV*と第３の集計データV+とが互いに近い値をとることが望ましい。 [Second Embodiment]
Next, the privacy protection apparatus 10 according to the second embodiment of the present invention will be described. As described in the first embodiment, in the privacy protection device 10, the elaboration unit 14 determines that each element is not a negative value, each element is an integer, and all of the second aggregated data V *. The third aggregated data V + is generated that satisfies all of the constraints that the sum of all the elements is equal to the sum of all the elements included in the second aggregated data V *. Here, in order to obtain the highly useful second individual slip data D + that better retains the properties of the first individual slip data D in the individual slip generator 15, the second aggregated data V * and It is desirable that the third aggregated data V + take values close to each other.

そこで、第２実施形態に係るプライバシー保護装置１０では、精緻化部１４が、第２の集計データV*から、第３の集計データV+を、制約条件を満たすデータからの最近傍探索により生成する精緻化手段として機能する。その他の構成については、第１実施形態と同様である。 Therefore, in the privacy protection device 10 according to the second embodiment, the elaboration unit 14 generates the third aggregated data V + from the second aggregated data V * by nearest neighbor search from the data satisfying the constraint conditions. Functions as an elaboration means. About another structure, it is the same as that of 1st Embodiment.

すなわち、第２実施形態における精緻化部１４は、第２の集計データV*の要素に関する最近傍探索を行うことにより第３の集計データV+を生成する。精緻化部１４における最近傍探索は、第２の集計データV*をn次元のベクトル空間上の点とみなしたとき、上記の制約条件を満たすn-1次元の超平面Z上の非負整数格子点のうち、V*と最も近い点を探索することに相当する。この超平面Zは、第１の集計データVを含み、Zの任意の点において、その原点からの１次ノルムは、Vの原点からの１次ノルムと等しい。 That is, the elaboration unit 14 in the second embodiment generates the third total data V + by performing a nearest neighbor search on the elements of the second total data V *. The nearest neighbor search in the refinement unit 14 is a non-negative integer lattice on the n-1 dimensional hyperplane Z that satisfies the above constraints when the second aggregated data V * is regarded as a point on an n dimensional vector space. This corresponds to searching for a point closest to V * among the points. This hyperplane Z includes the first aggregated data V, and at any point of Z, the primary norm from the origin is equal to the primary norm from the origin of V.

この探索は、単純には、上述した制約条件を満たす格子点を全て数え上げ、第２の集計データV*との間で最も短い距離を持つデータを選択することにより得ることができる。この距離の定義は様々なものが考えられる。たとえばL1ノルム、L2ノルム(ユークリッド距離)などの各種のノルムや、KS-距離などの標本間の統計量に基づく距離などを用いることが考えられるが、これらに限定されるものではない。 This search can be obtained simply by counting all the lattice points that satisfy the above-mentioned constraint conditions and selecting data having the shortest distance from the second aggregated data V *. Various definitions of this distance can be considered. For example, various norms such as L1 norm and L2 norm (Euclidean distance) and distances based on statistics between samples such as KS-distance may be used, but are not limited thereto.

また、上述した制約条件を満たすデータを全て数え上げることが計算量的に困難な場合は、効率的に最近傍探索を行うためのデータ構造を用いて、最近傍探索を実施することも望ましい。このようなデータ構造としては、たとえばkd木などを例として挙げることができるが、これに限定されるものではない。 Further, when it is difficult in terms of calculation amount to count all the data satisfying the above-described constraint conditions, it is also desirable to perform the nearest neighbor search using a data structure for efficiently performing the nearest neighbor search. An example of such a data structure is a kd tree, but is not limited thereto.

このようなプライバシー保護装置１０によれば、第２の個票データD+を、第１の個票データDと似たレコード構成とすることができる。このことで、第２の個票データD+の有用性を向上させることができる。 According to such a privacy protection device 10, the second individual form data D + can have a record configuration similar to the first individual form data D. Thus, the usefulness of the second individual vote data D + can be improved.

１０…プライバシー保護装置、１１…入力部（入力手段）、１２…集計部（集計手段）、１３…乱数付与部（乱数付与手段）、１４…精緻化部（精緻化手段）、１５…個票生成部（個票生成手段）、１６…出力部（出力手段）、D…第１の個票データ、D+…第２の個票データ、V…第１の集計データ、V*…第２の集計データ、V+…第３の集計データ。 DESCRIPTION OF SYMBOLS 10 ... Privacy protection apparatus, 11 ... Input part (input means), 12 ... Counting part (counting means), 13 ... Random number giving part (random number giving means), 14 ... Refinement part (refining means), 15 ... Individual vote Generating unit (individual generating unit), 16... Output unit (outputting unit), D... 1st individual form data, D +... 2nd individual form data, V... First aggregated data, V *. Aggregated data, V + ... third aggregated data.

Claims

A privacy protection device that receives input of first piece data including a plurality of records and outputs second piece data,
An input means for receiving input of the first individual vote data;
Tally means for generating first tally data by tallying the first individual form data received by the input means;
Random number providing means for generating second aggregated data by assigning a random number having a predetermined strength to the first aggregated data;
From the second aggregated data, third aggregated data is generated in which each element is an integer that is not a negative value and the sum of all the elements is equal to the sum of all the elements of the second aggregated data Elaboration means,
Individual vote generating means for generating the second individual vote data from the third tabulated data;
A privacy protection device comprising: output means for outputting the second individual vote data.

The privacy protection device according to claim 1, wherein the elaboration unit generates the third aggregated data by performing a nearest neighbor search on elements of the second aggregated data.