JP2018055612A

JP2018055612A - Device, method, and program for generating anonymized tables

Info

Publication number: JP2018055612A
Application number: JP2016194297A
Authority: JP
Inventors: 亮菊池; Akira Kikuchi; 大五十嵐; Masaru Igarashi; 長谷川　聡; Satoshi Hasegawa; 聡長谷川
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2016-09-30
Filing date: 2016-09-30
Publication date: 2018-04-05
Anticipated expiration: 2036-09-30
Also published as: JP6556681B2

Abstract

PROBLEM TO BE SOLVED: To provide an anonymized table generation technique which allows for appropriately determining a sampling fraction in accordance with tolerable identification risk of individuals.SOLUTION: An anonymized table generation device includes: a sampling fraction generator 110 configured to generate a real number p (p<1) that satisfies a predetermined formula involving tolerable individual identification risk ξas a sampling fraction; and a sampling unit 120 configured to generate an anonymized table τ' by uniform-randomly extracting [pn] records from a non-anonymized table τ (n is the number of records in the non-anonymized table τ).SELECTED DRAWING: Figure 1

Description

本発明は、個人のプライバシ情報を保護するための匿名化技術に関し、特にサンプリングに関する。 The present invention relates to an anonymization technique for protecting personal privacy information, and more particularly to sampling.

様々な主体からデータを集約し分析する場合を考える。このとき、データには個人のプライバシ情報が含まれることもあるため、データをそのまま他者に渡すと個人のプライバシ情報が漏えいする危険性がある。そのため、プライバシ情報が漏えいする危険が低減するようにデータを加工する方法として、匿名化技術の研究が進められている。 Consider the case of collecting and analyzing data from various entities. At this time, since the personal privacy information may be included in the data, there is a risk that the personal privacy information may be leaked if the data is directly passed to another person. For this reason, research on anonymization technology is underway as a method for processing data so that the risk of leakage of privacy information is reduced.

従来からある匿名化技術の一手法として、サンプリングと呼ばれるレコードを抽出する方法がある（非特許文献１）。サンプリングでは、レコードの集合であるテーブルの公開範囲を一部のレコードに限定することによりプライバシを保護する。例えば、各種センサスデータの匿名化では、英国、カナダ、アメリカ等でサンプリング割合（データ全体に対する公開範囲の割合）として0.01や0.03が用いられている。 One conventional anonymization technique is a method of extracting a record called sampling (Non-Patent Document 1). In sampling, privacy is protected by limiting the disclosure range of a table, which is a set of records, to some records. For example, in anonymizing various census data, 0.01 or 0.03 is used as a sampling rate (ratio of the disclosure range with respect to the entire data) in the UK, Canada, the United States, and the like.

“匿名化技術の現状について”，［online］，［平成28年9月14日検索］，インターネット<URL:http://www.kantei.go.jp/jp/singi/it2/pd/wg/dai1/siryou2_3.pdf>“Current Status of Anonymization Technology”, [online], [searched on September 14, 2016], Internet <URL: http://www.kantei.go.jp/jp/singi/it2/pd/wg/ dai1 / siryou2_3.pdf>

上述の通り、公的統計では、経験的に0.01、0.03、0.05などの値がサンプリング割合として用いられてきた。しかし、これらの値は主に経験則から求められたものにすぎず、これらの値を用いてサンプリングすることにより、どの程度個人が識別されてしまうリスク（以下、個人識別リスクという）が低減できているのかはよくわからないのが現状である。 As described above, in public statistics, values such as 0.01, 0.03, and 0.05 have been empirically used as sampling rates. However, these values are mainly obtained from empirical rules, and by sampling using these values, the risk of individual identification (hereinafter referred to as personal identification risk) can be reduced. It is the current situation that I do not know well.

また、サンプリングの効果を測る研究として母集団一意性を評価することが行われているが、母集団一意性の評価は必ずしも個人識別リスクの低減を意味するわけではなく、個人識別リスクが具体的にどの程度低減しているかは自明ではない。 In addition, population uniqueness has been evaluated as a study to measure the effect of sampling, but evaluation of population uniqueness does not necessarily mean a reduction in individual identification risk, and individual identification risk It is not obvious how much it has been reduced.

そこで本発明は、許容できる個人識別リスクに応じてサンプリング割合を適切に決定する匿名化テーブル生成技術を提供することを目的とする。 Therefore, an object of the present invention is to provide an anonymization table generation technique that appropriately determines a sampling rate in accordance with an allowable personal identification risk.

本発明の一態様は、τを匿名化前テーブル、nを前記匿名化前テーブルτのレコード数、n₀を攻撃者が前記匿名化前テーブルτについて知っていると考えられるレコード数、ξ_allowableを許容可能個人識別リスク、[x]を実数xに対してxを超えない最大の整数を表すものとし、前記匿名化前テーブルτから匿名化テーブルτ’を生成する匿名化テーブル生成装置であって、前記許容可能個人識別リスクξ_allowableに対して以下の式を満たす実数p(p<1)をサンプリング割合として生成するサンプリング割合生成部と、 In one aspect of the present invention, τ is a pre-anonymization table, n is the number of records in the pre-anonymization table τ, n ₀ is the number of records that an attacker is supposed to know about the pre-anonymization table τ, and ξ _allowable Is an anonymization table generating device that generates an anonymization table τ ′ from the pre-anonymization table τ, wherein [x] represents an allowable personal identification risk, and [x] represents a maximum integer not exceeding x with respect to a real number x. Te, a sampling rate generator for generating a real p (p <1) as the sampling rate satisfies the following formula with respect to the acceptable personal identification risk xi] _allowable,

前記匿名化前テーブルτから[pn]個のレコードを一様ランダムに抜き出すことにより前記匿名化テーブルτ’を生成するサンプリング部とを含む。 And a sampling unit that generates the anonymization table τ 'by uniformly and randomly extracting [pn] records from the pre-anonymization table τ.

本発明によれば、サンプリングにより匿名化テーブルを生成する際、許容可能個人識別リスクに応じてサンプリング割合を適切に決定することができるようになる。 According to the present invention, when the anonymization table is generated by sampling, the sampling rate can be appropriately determined according to the allowable personal identification risk.

匿名化テーブル生成装置１００の構成を示すブロック図。The block diagram which shows the structure of the anonymization table production | generation apparatus 100. FIG. 匿名化テーブル生成装置１００の動作を示すフローチャート。The flowchart which shows operation | movement of the anonymization table production | generation apparatus 100. FIG.

以下、本発明の実施の形態について、詳細に説明する。なお、同じ機能を有する構成部には同じ番号を付し、重複説明を省略する。 Hereinafter, embodiments of the present invention will be described in detail. In addition, the same number is attached | subjected to the structure part which has the same function, and duplication description is abbreviate | omitted.

＜定義と記法＞
集合Aに対して、|A|は集合Aの要素数を表すものとする。また、実数xに対して、[x]はxを超えない最大の整数を表すものとする。 <Definition and Notation>
For set A, | A | represents the number of elements in set A. In addition, for a real number x, [x] represents the maximum integer not exceeding x.

データはレコード形式で表現され、レコードの集合をテーブルという。また、サンプリング割合p(ただし、pはp<1を満たす実数)でのサンプリングとは、テーブルがn個のレコードから構成されるとき、一様ランダムに[pn]個のレコードを抜き出すものである。 Data is expressed in record format, and a set of records is called a table. In addition, sampling at a sampling rate p (where p is a real number satisfying p <1) is to extract [pn] records uniformly at random when the table is composed of n records. .

以下、テーブルに対する匿名化を議論するための定義等について説明する。 Hereinafter, a definition for discussing anonymization for a table will be described.

Τ, τをそれぞれ匿名化前のテーブル（以下、匿名化前テーブルという）の確率変数及びそのインスタンスとする。つまり、τは匿名化前テーブルである。 Let Τ and τ be a random variable and an instance of a table before anonymization (hereinafter referred to as a table before anonymization), respectively. That is, τ is a pre-anonymization table.

Τ’, τ’をそれぞれ匿名化後のテーブル（以下、匿名化テーブルという）の確率変数及びそのインスタンスとする。つまり、τ’は匿名化テーブルである。 Let ′ ′ and τ ′ be a random variable and an instance of a table after anonymization (hereinafter referred to as anonymization table), respectively. That is, τ ′ is an anonymization table.

R, R’をそれぞれ匿名化前テーブルのレコード集合、匿名化テーブルのレコード集合とする。つまり、Rは匿名化前テーブルτのすべてのレコードからなる集合、R’は匿名化テーブルτ’のすべてのレコードからなる集合である。 Let R and R ′ be the record set of the pre-anonymization table and the record set of the anonymization table, respectively. That is, R is a set composed of all the records in the pre-anonymization table τ, and R ′ is a set composed of all the records in the anonymization table τ ′.

V, V’をそれぞれ匿名化前テーブルのレコードが取りうる属性値の組み合わせの集合、匿名化テーブルのレコードが取りうる属性値の組み合わせの集合とする。 Let V and V ′ be a set of combinations of attribute values that can be taken by the record in the pre-anonymization table, and a set of combinations of attribute values that can be taken by the record in the anonymization table.

Δ, δをそれぞれ匿名化処理（匿名化アルゴリズム）の確率変数及びそのインスタンスとする。 Let Δ and δ be random variables and an instance of anonymization processing (anonymization algorithm), respectively.

Π, πをそれぞれ定義域をR、値域をR’∪{φ}（ただし、|R’|≦|R|、φはサンプリングされなかったことを示す特別な記号）とするサンプリングの確率変数及びそのインスタンスとする。つまり、πはR→R’∪{φ}のサンプリングである。確率 and π are sampling random variables with the domain R and the range R'∪ {φ} (where | R '| ≦ | R |, φ is a special symbol indicating that sampling was not performed) and Let that instance. That is, π is a sampling of R → R′∪ {φ}.

ここで、匿名化前テーブルτ, 匿名化テーブルτ’, 匿名化処理δは、それぞれτ:R→V, τ’: R’→V’, δ: (R→V)→(R→V’) である関数とみなすこともできる。 Here, the pre-anonymization table τ, the anonymization table τ ′, and the anonymization process δ are τ: R → V, τ ′: R ′ → V ′, and δ: (R → V) → (R → V ′, respectively. It can also be regarded as a function.

δが匿名化処理であるとは、サンプリングπ:R→R’に対して、τを任意の匿名化前テーブル、τ’を任意の匿名化テーブルとして、δ(τ)= τ’・πが成り立つことをいう。この式は匿名化前テーブルにデータ保護処理を施してレコードのシャッフルを行ったものが匿名化テーブルであることを示している。 δ is anonymization processing means that for sampling π: R → R ′, τ is an arbitrary anonymization table, τ ′ is an arbitrary anonymization table, and δ (τ) = τ ′ · π is Say that it holds. This formula indicates that the table that has been subjected to the data protection process on the pre-anonymization table and shuffled the record is the anonymization table.

以下では、Τ, Π, Δが確率事象として互いに独立である場合に、同様の式Δ(Τ)=Τ’・Πが成り立つ場合について考えていくことにする。 In the following, let us consider the case where the same expression Δ (Τ) = Τ ′ · Π holds when Τ, Π, and Δ are independent as probability events.

＜背景知識の意義＞
攻撃者が有する匿名化前テーブルについての背景知識とサンプリングによる個人識別リスクの低減との関係について考察する。 <Significance of background knowledge>
We consider the relationship between the background knowledge of the anonymized table held by the attacker and the reduction of personal identification risk by sampling.

まず、攻撃者がある個人のレコードの属性値（の一部）を知っていて、その個人を匿名化テーブルから特定しようとする場合について考える。このとき、もしサンプリングされた後の匿名化テーブルに攻撃者の知っている属性値を持つレコードが1つしかなかったとしても、そのレコードは匿名化前テーブルに複数存在している可能性がある。そのため、攻撃者にとって、サンプリングした場合はサンプリングしていない場合と比較して、その1つしかないレコードが攻撃者が特定しようとしている個人のものであるかどうかがわかりづらくなっている。つまり、サンプリングにより個人識別リスクが低減しているといえる。 First, consider a case where an attacker knows (part of) an attribute value of a person's record and tries to identify that person from the anonymization table. At this time, even if there is only one record with attribute values that the attacker knows in the anonymized table after sampling, there may be multiple records in the pre-anonymized table . This makes it harder for an attacker to see if one sample is the individual that the attacker is trying to identify when sampling than when not sampling. That is, it can be said that the personal identification risk is reduced by sampling.

次に、攻撃者が匿名化前テーブルの値すべてを知っている場合について考える。このような攻撃者が例えば匿名化前テーブルに1つしかない属性値を持つ個人を特定したい場合、その個人がサンプリングされると、確実に匿名化テーブルのうちどのレコードが当該の個人であるか特定できてしまう。そのため、サンプリングにより個人識別リスクが低減できていない。 Next, consider the case where the attacker knows all the values in the pre-anonymization table. For example, if such an attacker wants to identify an individual who has only one attribute value in the pre-anonymization table, when that individual is sampled, which record in the anonymization table is surely the corresponding individual It can be identified. Therefore, the individual identification risk cannot be reduced by sampling.

以上の考察をまとめると、サンプリングが個人識別リスクを低減できる場合は、攻撃者にとってサンプリングする匿名化前テーブルに不確実性がある場合と考えられる。言い換えれば、サンプリングする匿名化前テーブルに不確実性があれば、匿名化テーブルにある属性値を持つレコードが1つしかなかったとしても、匿名化前テーブルには複数あった可能性があるため、個人識別は難しくなる。このような不確実性が生じる状況として以下の2つケースが考えられる。
ケース(1)：攻撃者の匿名化前テーブルに関する背景知識が制限されているケース。
ケース(2)：サンプリングをする前に（攻撃者の知らない）処理が行われるケース。 To summarize the above considerations, if sampling can reduce the personal identification risk, it is considered that there is uncertainty in the pre-anonymization table sampled by the attacker. In other words, if there is uncertainty in the pre-anonymization table to be sampled, there may be more than one record in the pre-anonymization table even if there is only one record with an attribute value in the anonymization table Individual identification becomes difficult. The following two cases can be considered as situations where such uncertainty occurs.
Case (1): A case where background knowledge about the table before anonymization of an attacker is limited.
Case (2): A case where processing (not known by the attacker) is performed before sampling.

以下説明していく第一実施形態では、ケース(1)を扱う。つまり、攻撃者の匿名化前テーブルに関する背景知識が制限されている場合において、サンプリング割合と個人識別リスクの関係を導出する。 In the first embodiment described below, the case (1) is handled. That is, when the background knowledge about the attacker's pre-anonymization table is limited, the relationship between the sampling rate and the personal identification risk is derived.

＜第一実施形態＞
以下、個人識別リスクの指標、攻撃者についての仮定（攻撃者モデル）、個人識別リスクの評価について順に説明していく。 <First embodiment>
Hereinafter, the index of personal identification risk, the assumption about the attacker (attacker model), and the evaluation of the personal identification risk will be described in order.

［個人識別リスクの指標］
確率的な匿名化を行った際の個人識別リスクの評価指標としてPk-匿名性（参考非特許文献１）がある。
（参考非特許文献１）D. Ikarashi, R. Kikuchi, K. Chida, and K. Takahashi, “k-anonymous microdata release via post randomisation method”, Advances in Information and Computer Security-10th International Workshop on Security, IWSEC 2015, Volume 9241 of the series Lecture Notes in Computer Scienc, pp.225-241, 2015. [Indicator of personal identification risk]
There exists Pk-anonymity (reference nonpatent literature 1) as an evaluation index of the personal identification risk at the time of performing probabilistic anonymization.
(Reference Non-Patent Document 1) D. Ikarashi, R. Kikuchi, K. Chida, and K. Takahashi, “k-anonymous microdata release via post randomisation method”, Advances in Information and Computer Security-10th International Workshop on Security, IWSEC 2015, Volume 9241 of the series Lecture Notes in Computer Scienc, pp.225-241, 2015.

しかし、Pk-匿名性は任意の攻撃者を想定している（つまり、背景知識に制限がない攻撃者も想定している）ため、上述した通り、ケース(1)を前提とする場合、Pk-匿名性を用いて議論を進めることはできない。そこで、ここではPk-匿名性にならった新たなリスク指標を、背景知識を制限した攻撃者に対する個人識別リスクの指標として定義する。 However, since Pk-anonymity assumes an arbitrary attacker (that is, an attacker with no background knowledge is also assumed), as described above, if case (1) is assumed, Pk -We cannot proceed with discussion using anonymity. Therefore, here we define a new risk index based on Pk-anonymity as an index of personal identification risk for attackers with limited background knowledge.

匿名化処理Δと匿名化テーブルτ’が、任意の匿名化前テーブルτ、匿名化前テーブルτに含まれる任意のレコードr∈R、匿名化テーブルτ’に含まれる任意のレコードr’∈Rがある背景知識を有する攻撃者f_Tに対して次式を満たす場合、攻撃者f_Tに対して個人識別リスクがξであるという。 Anonymization process Δ and anonymization table τ ′ are an arbitrary pre-anonymization table τ, an arbitrary record r∈R included in the pre-anonymization table τ, and an arbitrary record r′∈R included in the anonymization table τ ′. If an attacker f _T having a certain background knowledge satisfies the following equation, the personal identification risk is ξ for the attacker f _T.

［攻撃者についての仮定（攻撃者モデル）］
攻撃者が匿名化前テーブルの一部について知識がある場合を考える。例えば、各都道府県の患者情報を集めて匿名化する場合、神奈川県の病院の勤務者は神奈川県の患者情報については知っているが、神奈川県以外の全国各都道府県の情報については未知であるというような場合が相当する。 [Assumption about the attacker (attacker model)]
Consider the case where an attacker has knowledge of a part of the pre-anonymization table. For example, when collecting patient information in each prefecture and making it anonymous, workers in Kanagawa Prefecture know about patient information in Kanagawa, but are unknown about information in prefectures nationwide other than Kanagawa. It corresponds to the case that there is.

そこで、以下のような攻撃者を仮定する。
仮定(1)：n個のレコードから構成される匿名化前テーブルτのうち、その一部であるn₀個のレコードから構成される部分匿名化前テーブルτ₀の値を知っている。
仮定(2)：攻撃者は自分自身が知らないレコードの値の分布を現在自身が知っているレコードの分布と同じであると推測する。 Therefore, the following attacker is assumed.
Assumption (1): The value of the pre-anonymization table τ ₀ composed of n ₀ records, which is a part of the pre-anonymization table τ composed of n records, is known.
Assumption (2): The attacker assumes that the distribution of record values that he / she does not know is the same as the distribution of records he / she currently knows.

攻撃者はn₀個のレコードから構成される部分匿名化前テーブルτ₀以外の知識はないため、知らないレコードの値を一様であると推定することに比べても、仮定(2)は現実的なものであると考えられる。 Since the attacker has no knowledge other than the table before partial anonymization τ ₀ composed of n ₀ records, the assumption (2) is also compared to assuming that the values of the unknown records are uniform. It is considered realistic.

［個人識別リスクの評価］
匿名化手法としてサンプリングを考える。つまり、匿名化前テーブルを構成するレコード数をnとしたとき、サンプリングした後のレコード数が[pn]（pはサンプリング割合）となるように匿名化前テーブルのレコードの中から一様ランダムに抜き出すことを考える。 [Evaluation of personal identification risk]
Sampling is considered as an anonymization method. In other words, when n is the number of records that make up the pre-anonymization table, the number of records after sampling is [pn] (p is the sampling rate). Think about extracting.

以下、攻撃者モデルで定義した背景知識を持つ攻撃者（仮定(1）及び仮定(2)を満たす攻撃者）に対して、サンプリングによりどの程度個人識別リスクが低減されるかを評価する。簡単のため、pn=[pn]として説明をするが、一般性は失わない。 In the following, we evaluate how much individual identification risk is reduced by sampling for attackers with background knowledge defined in the attacker model (attackers who satisfy assumption (1) and assumption (2)). For the sake of simplicity, the explanation will be given as pn = [pn], but generality is not lost.

n^=n-n₀-(pn-1)とする。また、匿名化前テーブルτのうち属性値vであるレコード数を#_τ(v)と表すことにする。つまり、#_τ(v)=|τ^-1({v})|となる。 Let n ^ = nn _0- (pn-1). In addition, the number of records having the attribute value v in the pre-anonymization table τ is expressed as # _τ (v). That is, # _τ (v) = | τ ⁻¹ ({v}) |.

上記攻撃者モデルを満たす任意の攻撃者に対して式(1)を満たす個人識別リスクξ、つまり個人識別リスクξの最大値を最大個人識別リスクξ_maxという。 The personal identification risk ξ satisfying the equation (1) for any attacker satisfying the attacker model, that is, the maximum value of the personal identification risk ξ is referred to as the maximum personal identification risk ξ _max .

このとき、レコード数がnであるような匿名化前テーブルτに対してpn個のレコードをサンプリングした場合、任意の部分匿名化前テーブルτ₀（レコード数はn₀）、匿名化テーブルτ’について、最大個人識別リスクξ_maxは以下の式で表される。 At this time, when pn records are sampled with respect to the pre-anonymization table τ in which the number of records is n, the arbitrary partial pre-anonymization table τ ₀ (the number of records is n ₀ ), the anonymization table τ ′ The maximum personal identification risk ξ _max is expressed by the following equation.

ただし、i, jは１以上の整数であり、 However, i and j are integers of 1 or more,

とする。

And

以下、最大個人識別リスクξ_maxの式(2)の導出について説明する。 Hereinafter, the derivation of Equation (2) for the maximum individual identification risk ξ _max will be described.

個人が識別される確率の最大値である最大個人識別リスクξ_maxを得るために、まず上記攻撃モデルを満たす攻撃者にとっての最適な匿名化前テーブルについての背景知識である部分匿名化前テーブルτ₀と、匿名化テーブルτ’について考える。あるレコードrの特定を行うときに攻撃者にとって最も有利になる部分匿名化前テーブルτ₀、匿名化テーブルτ’として、 In order to obtain the maximum individual identification risk ξ _max that is the maximum probability that an individual is identified, first, a table before partial anonymization τ that is background knowledge about an optimal table before anonymization for an attacker that satisfies the above attack model Consider ₀ and the anonymization table τ ′. As a partial pre-anonymization table τ ₀ and an anonymization table τ ′ that are most advantageous for an attacker when specifying a certain record r,

が成り立つ場合、すなわち、部分匿名化前テーブルτ₀でユニークだったレコードrが匿名化テーブルτ’でもユニークなレコードとなる場合で、 In other words, the record r that is unique in the table before partial anonymization τ ₀ becomes a unique record even in the anonymization table τ ′.

（ただし、R₀は部分匿名化前テーブルτ₀のすべてのレコードからなる集合）が成り立つ場合、すなわち、匿名化テーブルτ’のユニークなレコード以外は部分匿名化前テーブルτ₀に含まれないという場合を考える。式(4)はレコードrがユニークであれば匿名化テーブルへの出現確率が小さくなること、式(5)は攻撃者にとって未知なレコードの個数を最小化することに由来する条件である。このとき、攻撃者にとって未知なレコード数がn^であることに注意すると、式(4)及び(5)を満たす攻撃者について、 (However, R ₀ is a set consisting of all the records of the table before partial anonymization τ ₀ ), that is, the records other than unique records of the anonymization table τ ′ are not included in the table before partial anonymization τ _0. Think about the case. Equation (4) is a condition derived from minimizing the number of records unknown to the attacker, because the probability of appearing in the anonymization table is small if the record r is unique. At this time, if it is noted that the number of records unknown to the attacker is n ^, the attacker who satisfies the equations (4) and (5)

ここで式(a)中のΤ_iはi番目の部分匿名化前テーブルの確率変数を示す。 Here, Τ _i in the formula (a) indicates a random variable of the i-th partial pre-anonymization table.

以上より、式(2)が成り立つことがわかる。 From the above, it can be seen that equation (2) holds.

また、式(2)を一般化した以下の式(2)’も成り立つ。レコード数がnであるような匿名化前テーブルτに対して[pn]個のレコードをサンプリングした場合、任意の部分匿名化前テーブルτ₀（レコード数はn₀）、匿名化テーブルτ’について、最大個人識別リスクξ_maxは以下の式で表される。 Further, the following formula (2) ′ obtained by generalizing the formula (2) also holds. When [pn] records are sampled with respect to the pre-anonymization table τ such that the number of records is n, the arbitrary pre-anonymization table τ ₀ (the number of records is n ₀ ) and the anonymization table τ ′ The maximum individual identification risk ξ _max is expressed by the following equation.

［匿名化テーブル生成装置］
以下、図１〜図２を参照して第一実施形態の匿名化テーブル生成装置１００について説明する。図１は、匿名化テーブル生成装置１００の構成を示すブロック図である。図２は、匿名化テーブル生成装置１００の動作を示すフローチャートである。 [Anonymization table generator]
Hereinafter, the anonymization table generation device 100 of the first embodiment will be described with reference to FIGS. FIG. 1 is a block diagram illustrating a configuration of the anonymization table generation device 100. FIG. 2 is a flowchart showing the operation of the anonymization table generation device 100.

図１に示すように匿名化テーブル生成装置１００は、サンプリング割合生成部１１０と、サンプリング部１２０と、記録部１９０を含む。記録部１９０は、匿名化テーブル生成装置１００の処理に必要な情報を記録する構成部である。 As shown in FIG. 1, the anonymization table generation device 100 includes a sampling ratio generation unit 110, a sampling unit 120, and a recording unit 190. The recording unit 190 is a component that records information necessary for the processing of the anonymization table generating device 100.

匿名化テーブル生成装置１００は、n個のレコードから構成される匿名化前テーブルτと、攻撃者が匿名化前テーブルτについて知っていると考えられるレコード数n₀と、許容することができる個人識別リスク（以下、許容可能個人識別リスクという）ξ_allowableを入力とし、匿名化テーブルτ’を生成する。なお、攻撃者が知っているレコード数n₀と、許容可能個人識別リスクξ_allowableは事前に与えておき、記録部１９０に記録しておくのでもよい。 The anonymization table generating apparatus 100 includes a pre-anonymization table τ composed of n records, a record number n ₀ considered to be known by the attacker about the pre-anonymization table τ, and an allowable individual. An anonymization table τ ′ is generated by inputting an identification risk (hereinafter referred to as an allowable individual identification risk) ξ _allowable . Incidentally, the record number n ₀ the attacker knows an acceptable personal identification risk xi] _Allowable can leave given beforehand, or than is recorded in the recording unit 190.

以下、図２に従い匿名化テーブル生成装置１００の動作について説明する。 Hereinafter, the operation of the anonymization table generating apparatus 100 will be described with reference to FIG.

サンプリング割合生成部１１０は、匿名化前テーブルτのレコード数nと、攻撃者が匿名化前テーブルτについて知っていると考えられるレコード数n₀と、許容可能個人識別リスクξ_allowableから、次の式を満たす実数p(ただし、p<1)をサンプリング割合として生成する（Ｓ１１０）。 Sampling ratio generation unit 110, and a record number n of anonymized before table tau, an attacker can record the number n ₀ which will know of anonymized before table tau, the acceptable personal identification risk xi] _Allowable, the following A real number p (where p <1) that satisfies the equation is generated as a sampling rate (S110).

上記式の右辺の値はpについて単調増加であるため、二分法を用いると、効率的に計算することができる。 Since the value on the right side of the above equation is monotonically increasing with respect to p, it can be calculated efficiently using the bisection method.

サンプリング部１２０は、Ｓ１１０で生成したサンプリング割合pを用いて[pn]を計算し、匿名化前テーブルτから[pn]個のレコードを一様ランダムに抜き出すことにより匿名化テーブルτ’を生成する（Ｓ１２０）。 The sampling unit 120 calculates [pn] using the sampling rate p generated in S110, and generates anonymization table τ ′ by uniformly extracting [pn] records from the pre-anonymization table τ. (S120).

本実施形態の発明によれば、匿名化前テーブルτの一部のレコードの値を知っている攻撃者が存在する場合において、サンプリングにより匿名化テーブルを生成する際、許容可能個人識別リスクに応じて適切なサンプリング割合を決定することができる。その結果、許容可能個人識別リスクを考慮してサンプリングによる匿名化を実行することができる。 According to the invention of this embodiment, when there is an attacker who knows the values of some records of the pre-anonymization table τ, when generating the anonymization table by sampling, depending on the allowable personal identification risk An appropriate sampling rate can be determined. As a result, it is possible to execute anonymization by sampling in consideration of an acceptable personal identification risk.

＜変形例＞
この発明は上述の実施形態に限定されるものではなく、この発明の趣旨を逸脱しない範囲で適宜変更が可能であることはいうまでもない。上記実施形態において説明した各種の処理は、記載の順に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。 <Modification>
The present invention is not limited to the above-described embodiment, and it goes without saying that modifications can be made as appropriate without departing from the spirit of the present invention. The various processes described in the above embodiment may be executed not only in time series according to the order of description, but also in parallel or individually as required by the processing capability of the apparatus that executes the processes or as necessary.

＜補記＞
本発明の装置は、例えば単一のハードウェアエンティティとして、キーボードなどが接続可能な入力部、液晶ディスプレイなどが接続可能な出力部、ハードウェアエンティティの外部に通信可能な通信装置（例えば通信ケーブル）が接続可能な通信部、ＣＰＵ（Central Processing Unit、キャッシュメモリやレジスタなどを備えていてもよい）、メモリであるＲＡＭやＲＯＭ、ハードディスクである外部記憶装置並びにこれらの入力部、出力部、通信部、ＣＰＵ、ＲＡＭ、ＲＯＭ、外部記憶装置の間のデータのやり取りが可能なように接続するバスを有している。また必要に応じて、ハードウェアエンティティに、ＣＤ−ＲＯＭなどの記録媒体を読み書きできる装置（ドライブ）などを設けることとしてもよい。このようなハードウェア資源を備えた物理的実体としては、汎用コンピュータなどがある。 <Supplementary note>
The apparatus of the present invention includes, for example, a single hardware entity as an input unit to which a keyboard or the like can be connected, an output unit to which a liquid crystal display or the like can be connected, and a communication device (for example, a communication cable) capable of communicating outside the hardware entity. Can be connected to a communication unit, a CPU (Central Processing Unit, may include a cache memory or a register), a RAM or ROM that is a memory, an external storage device that is a hard disk, and an input unit, an output unit, or a communication unit thereof , A CPU, a RAM, a ROM, and a bus connected so that data can be exchanged between the external storage devices. If necessary, the hardware entity may be provided with a device (drive) that can read and write a recording medium such as a CD-ROM. A physical entity having such hardware resources includes a general-purpose computer.

ハードウェアエンティティの外部記憶装置には、上述の機能を実現するために必要となるプログラムおよびこのプログラムの処理において必要となるデータなどが記憶されている（外部記憶装置に限らず、例えばプログラムを読み出し専用記憶装置であるＲＯＭに記憶させておくこととしてもよい）。また、これらのプログラムの処理によって得られるデータなどは、ＲＡＭや外部記憶装置などに適宜に記憶される。 The external storage device of the hardware entity stores a program necessary for realizing the above functions and data necessary for processing the program (not limited to the external storage device, for example, reading a program) It may be stored in a ROM that is a dedicated storage device). Data obtained by the processing of these programs is appropriately stored in a RAM or an external storage device.

ハードウェアエンティティでは、外部記憶装置（あるいはＲＯＭなど）に記憶された各プログラムとこの各プログラムの処理に必要なデータが必要に応じてメモリに読み込まれて、適宜にＣＰＵで解釈実行・処理される。その結果、ＣＰＵが所定の機能（上記、…部、…手段などと表した各構成要件）を実現する。 In the hardware entity, each program stored in an external storage device (or ROM or the like) and data necessary for processing each program are read into a memory as necessary, and are interpreted and executed by a CPU as appropriate. . As a result, the CPU realizes a predetermined function (respective component requirements expressed as the above-described unit, unit, etc.).

本発明は上述の実施形態に限定されるものではなく、本発明の趣旨を逸脱しない範囲で適宜変更が可能である。また、上記実施形態において説明した処理は、記載の順に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されるとしてもよい。 The present invention is not limited to the above-described embodiment, and can be appropriately changed without departing from the spirit of the present invention. In addition, the processing described in the above embodiment may be executed not only in time series according to the order of description but also in parallel or individually as required by the processing capability of the apparatus that executes the processing. .

既述のように、上記実施形態において説明したハードウェアエンティティ（本発明の装置）における処理機能をコンピュータによって実現する場合、ハードウェアエンティティが有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記ハードウェアエンティティにおける処理機能がコンピュータ上で実現される。 As described above, when the processing functions in the hardware entity (the apparatus of the present invention) described in the above embodiments are realized by a computer, the processing contents of the functions that the hardware entity should have are described by a program. Then, by executing this program on a computer, the processing functions in the hardware entity are realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。具体的には、例えば、磁気記録装置として、ハードディスク装置、フレキシブルディスク、磁気テープ等を、光ディスクとして、ＤＶＤ（Digital Versatile Disc）、ＤＶＤ−ＲＡＭ（Random Access Memory）、ＣＤ−ＲＯＭ（Compact Disc Read Only Memory）、ＣＤ−Ｒ（Recordable）／ＲＷ（ReWritable）等を、光磁気記録媒体として、ＭＯ（Magneto-Optical disc）等を、半導体メモリとしてＥＥＰ−ＲＯＭ（Electronically Erasable and Programmable-Read Only Memory）等を用いることができる。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used. Specifically, for example, as a magnetic recording device, a hard disk device, a flexible disk, a magnetic tape or the like, and as an optical disk, a DVD (Digital Versatile Disc), a DVD-RAM (Random Access Memory), a CD-ROM (Compact Disc Read Only). Memory), CD-R (Recordable) / RW (ReWritable), etc., magneto-optical recording medium, MO (Magneto-Optical disc), etc., semiconductor memory, EEP-ROM (Electronically Erasable and Programmable-Read Only Memory), etc. Can be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Furthermore, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。そして、処理の実行時、このコンピュータは、自己の記録媒体に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 A computer that executes such a program first stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. When executing the process, the computer reads a program stored in its own recording medium and executes a process according to the read program. As another execution form of the program, the computer may directly read the program from a portable recording medium and execute processing according to the program, and the program is transferred from the server computer to the computer. Each time, the processing according to the received program may be executed sequentially. Also, the program is not transferred from the server computer to the computer, and the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition. It is good. Note that the program in this embodiment includes information that is used for processing by an electronic computer and that conforms to the program (data that is not a direct command to the computer but has a property that defines the processing of the computer).

また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、ハードウェアエンティティを構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 In this embodiment, a hardware entity is configured by executing a predetermined program on a computer. However, at least a part of these processing contents may be realized by hardware.

１００匿名化テーブル生成装置
１１０サンプリング割合生成部
１２０サンプリング部
１９０記録部 100 Anonymization Table Generation Device 110 Sampling Ratio Generation Unit 120 Sampling Unit 190 Recording Unit

Claims

τ is the pre-anonymization table, n is the number of records in the pre-anonymization table τ, n ₀ is the number of records that the attacker is supposed to know about the pre-anonymization table τ, ξ _allowable is an acceptable personal identification risk, Let [x] represent the largest integer that does not exceed x for a real number x,
An anonymization table generating device for generating an anonymization table τ ′ from the anonymization table τ,
A sampling rate generator for generating a real p (p <1) that satisfies the following expression with respect to the acceptable personal identification risk xi] _Allowable as the sampling rate,

An anonymization table generating device including: a sampling unit that generates the anonymization table τ ′ by uniformly and randomly extracting [pn] records from the pre-anonymization table τ.

The anonymization table generating device according to claim 1,
The sampling ratio generation unit is an anonymization table generation apparatus that calculates the sampling ratio p using a bisection method.

τ is the pre-anonymization table, n is the number of records in the pre-anonymization table τ, n ₀ is the number of records that the attacker is supposed to know about the pre-anonymization table τ, ξ _allowable is an acceptable personal identification risk, Let [x] represent the largest integer that does not exceed x for a real number x,
The anonymization table generation device is an anonymization table generation method for generating an anonymization table τ ′ from the table before anonymization τ,
A sampling rate generating step of generating a real p (p <1) that satisfies the following expression with respect to the acceptable personal identification risk xi] _Allowable as the sampling rate,

A sampling step of generating the anonymization table τ ′ by uniformly extracting [pn] records from the pre-anonymization table τ.

The program for functioning a computer as an anonymization table production | generation apparatus of Claim 1 or 2.