JP5962472B2

JP5962472B2 - Anonymized data generation method, apparatus and program

Info

Publication number: JP5962472B2
Application number: JP2012264421A
Authority: JP
Inventors: 裕司山岡
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2012-12-03
Filing date: 2012-12-03
Publication date: 2016-08-03
Anticipated expiration: 2032-12-03
Also published as: JP2014109934A

Description

本技術は、情報の匿名化技術に関する。 The present technology relates to information anonymization technology.

数値データを含むレコード群を、各レコードの機密情報（例えば識別子（ＩＤとも呼ぶ））を秘密にして、他者に開示又は提供したい場合がある。このとき、ＩＤを削除して開示又は提供しても、特性の強い数値を有するレコードについては他者がＩＤを推定できてしまう場合がある。 There is a case where it is desired to disclose or provide a group of records including numerical data to others with confidential information (for example, an identifier (also referred to as ID)) of each record kept secret. At this time, even if the ID is deleted and disclosed or provided, another person may be able to estimate the ID for a record having a strong numerical value.

例えば、車両の位置データの収集者Ａが、車両の運転者や所有者等と紐付いたＩＤを秘匿化した形で、交通状況調査機関Ｂ等に位置データを提供する場合を考える。 For example, consider a case where the vehicle location data collector A provides location data to the traffic condition investigation organization B etc. in a form in which the ID associated with the vehicle driver, owner, etc. is concealed.

例えば、収集者Ａが収集した車両の位置データが図１に示すようなデータであるとする。このように、ＩＤがc1，c2，c3の３種類の車両の位置データが、全部で５レコードある。なお、Ｘは緯度、Ｙは経度を表す。ＩＤは車両のＩＤであっても良いし、使用者（使用事業者）のＩＤであってもよい。図１に示すようなレコードを、地図上にプロットすると、図２のようになるとする。 For example, it is assumed that the position data of the vehicle collected by the collector A is data as shown in FIG. Thus, there are five records in total for the position data of three types of vehicles having IDs c1, c2, and c3. X represents latitude and Y represents longitude. The ID may be a vehicle ID or a user (user company) ID. It is assumed that a record as shown in FIG. 1 is plotted on a map as shown in FIG.

交通状況調査機関Ｂは、図１及び図２のようなデータを、道路の改良計画や保守計画に活用できる。例えば、図２の東西方向の幹線道路は速度超過の車が多いことがわかり、道路に速度超過を減らす施策をしたり、点検の頻度を多めにしたりする計画を立てる場合がある。 The traffic condition investigation organization B can utilize data as shown in FIGS. 1 and 2 for a road improvement plan and a maintenance plan. For example, it can be seen that there are many overspeed vehicles on the east-west road in FIG. 2, and there are cases where a plan is made to reduce the overspeed on the road or to increase the frequency of inspection.

また、収集者Ａは図１に示すようなデータを交通状況調査機関Ｂに提供する場合であっても、交通状況調査機関ＢにはＩＤを秘密にするものとする。例えば、収集者Ａは位置データ提供者との間で、匿名化しない限り他者に提供しないという契約を結んでいる状況が考えられる。位置データ提供者は、使用した車両が、進入禁止区域への進入や速度超過など、違法や危険な運転をした可能性があり、そのことを収集者Ａ以外に知られたくないなどの理由で、匿名化を希望することが考えられる。 Further, even when the collector A provides the traffic situation survey organization B with data as shown in FIG. 1, the traffic situation survey organization B keeps the ID secret. For example, it is conceivable that the collector A has a contract with the location data provider so as not to provide it to others unless it is anonymized. The location data provider may have used the vehicle illegally or dangerously, such as entering a prohibited area or exceeding speed, for reasons such as not wanting to be known to anyone other than the collector A. It may be possible to request anonymization.

一方、交通状況調査機関ＢにとってはＩＤがなくても良い場合もある。車両の運転者や使用事業者を知らなくても上で述べたような交通状況調査はできるためである。従って、収集者Ａは図１に示すようなデータに対し、いわゆる匿名化という、ＩＤの推定を困難にする変換を行って、変換後のデータを開示又は提供することになる。 On the other hand, there may be a case where the traffic condition investigation organization B does not need an ID. This is because the traffic condition survey as described above can be performed without knowing the driver or operator of the vehicle. Therefore, the collector A performs the conversion that makes it difficult to estimate the ID, which is so-called anonymization, on the data as shown in FIG. 1, and discloses or provides the converted data.

収集者Ａによる単純な匿名化方式として、ＩＤを削除する方式がある。図１に示したデータからＩＤを削除したデータを交通状況調査機関Ｂが見ても、どのレコードが誰のレコードなのかそのままでは分からない。しかし、位置の値からＩＤを推定できるレコードがあるという問題がある。 As a simple anonymization method by the collector A, there is a method of deleting an ID. Even if the traffic condition investigation organization B looks at the data obtained by deleting the ID from the data shown in FIG. 1, it is not possible to know which record is who. However, there is a problem that there is a record that can estimate the ID from the position value.

図１に示したデータからＩＤを削除したデータを図２のように地図上にプロットすると、３レコード目（Ｘ，Ｙ）＝（８，７）は「c1 駐車場」内であることが分かる。すなわち、ＩＤが削除されたデータしか見られない場合であっても、３レコード目のＩＤがc1である可能性が高いことが分かり、十分に匿名化されているとは言い難い。 When data obtained by deleting the ID from the data shown in FIG. 1 is plotted on the map as shown in FIG. 2, it is understood that the third record (X, Y) = (8, 7) is in “c1 parking lot”. . That is, even when only the data from which the ID is deleted can be seen, it is highly likely that the ID of the third record is c1, and it is difficult to say that the ID is sufficiently anonymized.

同様に、２レコード目（Ｘ，Ｙ）＝（１４，１２）も、公道上ではあるが「c1 駐車場」に行く車両以外の車両が通ることはまずなさそうな地点なので、c1のレコードだと推定できてしまう。このレコードの速度は「４０」であり、この地点の法定最高速度が「３０」だとすると、c1が違法行為をしていたことが知られてしまう。 Similarly, the second record (X, Y) = (14, 12) is also a record for c1, as it is unlikely that a vehicle other than the vehicle going to "c1 parking lot" will go through on the public road. Can be estimated. If the speed of this record is “40” and the legal maximum speed at this point is “30”, it is known that c1 was illegal.

このように、単にＩＤを削除するだけでは十分に匿名化できないレコードがある。 As described above, there is a record that cannot be sufficiently anonymized simply by deleting the ID.

なお、単にＩＤを削除するだけではなく、確率的にデータを攪乱することで匿名化する技術があるが、位置データが大きく変化するレコードがあるため、交通状況調査機関Ｂでの調査に支障をきたす可能性があるという問題がある。また、変更する確率を小さくすると交通状況調査機関Ｂでの調査への悪影響は少なくなるが、例えば図１に示したデータにおける２レコード目や３レコード目がそのまま提供される確率も大きくなり、匿名化が不十分になってしまう。 In addition, there is a technique that anonymizes by not only simply deleting the ID but also probabilistically disturbing the data, but there are records that greatly change the location data, which hinders the investigation by the traffic condition investigation organization B There is a problem that it may come. Further, if the probability of change is reduced, the adverse effect on the investigation by the traffic condition investigation organization B is reduced. However, for example, the probability that the second record and the third record in the data shown in FIG. Will become insufficient.

さらに、ｋ−匿名化技術を用いて複数のレコードをグループ化して匿名化する技術が存在するが、これらを単純に図１に示したデータに適用しても、必ずしも十分な匿名化がなされるわけではない。 Furthermore, there is a technique for anonymizing a plurality of records by using a k-anonymization technique. However, even if these are simply applied to the data shown in FIG. 1, sufficient anonymization is always performed. Do not mean.

特開２０１１−１００１１６号公報JP 2011-100116 A 特開２０１１−１２８８６２号公報JP 2011-128862 A 特開２０１２−０２２３１５号公報JP 2012-022315 A

L. Sweeney. Achieving k-Anonymity Privacy Protection using Generalization and Suppression. International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, Vol. 10, No. 5, pp. 571-588, 2002.L. Sweeney. Achieving k-Anonymity Privacy Protection using Generalization and Suppression.International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, Vol. 10, No. 5, pp. 571-588, 2002. K. LeFevre, D. J. DeWitt, R. Ramakrishnan. Incognito: Efficient Full-Domain k-Anonymity. In Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data, pp. 49-60, 2005.K. LeFevre, D. J. DeWitt, R. Ramakrishnan. Incognito: Efficient Full-Domain k-Anonymity. In Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data, pp. 49-60, 2005. K. LeFevre, D. J. DeWitt, R. Ramakrishnan. Mondrian Multidimensional k-Anonymity. In Proceedings of the 22nd International Conference on Data Engineering, pp. 25-, 2006.K. LeFevre, D. J. DeWitt, R. Ramakrishnan. Mondrian Multidimensional k-Anonymity. In Proceedings of the 22nd International Conference on Data Engineering, pp. 25-, 2006. A. Machanavajjhala, J. Gehrke, D. Kifer, M. Venkitasubramaniam. l-Diversity: Privacy Beyond k-Anonymity. ACM Transactions on Knowledge Discovery from Data, Vol. 1, Issue 1, Article No. 3, 2007.A. Machanavajjhala, J. Gehrke, D. Kifer, M. Venkitasubramaniam. L-Diversity: Privacy Beyond k-Anonymity. ACM Transactions on Knowledge Discovery from Data, Vol. 1, Issue 1, Article No. 3, 2007.

従って、本技術の目的は、一側面によれば、数値属性値を含むデータを匿名化するための技術を提供することである。 Therefore, the objective of this technique is to provide the technique for anonymizing the data containing a numerical attribute value according to one side surface.

本技術に係る匿名化データ生成方法は、（Ａ）各々機密属性値と数値属性値とを含む複数のデータブロックを格納するデータ格納部に格納された複数のデータブロックの各々について、当該データブロックに含まれる数値属性値に基づき数値属性に応じて定義される距離が閾値以内となるデータブロックの集合を生成し、（Ｂ）複数のデータブロックの各々について、当該データブロックと当該データブロックについて生成された集合に含まれるデータブロックとの機密属性値の度数分布が、所定の条件を満たしているか判断し、（Ｃ）度数分布が所定の条件を満たしているデータブロックについては、当該データブロックについて生成された集合に含まれるデータブロックの数値属性値から第２の数値属性値を生成し、（Ｄ）度数分布が所定の条件を満たしているデータブロックにおける数値属性値を、当該データブロックについて生成した第２の数値属性値で置換する処理を含む。 An anonymized data generation method according to the present technology includes (A) a data block for each of a plurality of data blocks stored in a data storage unit that stores a plurality of data blocks each including a confidential attribute value and a numerical attribute value. A set of data blocks in which the distance defined according to the numerical attribute is within a threshold value is generated based on the numerical attribute value included in the (B), and for each of the plurality of data blocks, the data block and the data block It is determined whether the frequency distribution of confidential attribute values with the data blocks included in the collected set satisfies a predetermined condition. (C) For a data block whose frequency distribution satisfies the predetermined condition, A second numerical attribute value is generated from the numerical attribute values of the data blocks included in the generated set, and (D) the frequency distribution is The numeric attribute values in the data blocks that meet the requirements, including the processing of replacing the second numerical attribute values generated for the data block.

数値属性値を含むデータを匿名化できるようになる。 Data including numeric attribute values can be anonymized.

図１は、匿名化すべきデータの一例を示す図である。FIG. 1 is a diagram illustrating an example of data to be anonymized. 図２は、地図上にデータをプロットした例を示す図である。FIG. 2 is a diagram illustrating an example in which data is plotted on a map. 図３は、実施の形態に係る情報処理装置の機能ブロック図である。FIG. 3 is a functional block diagram of the information processing apparatus according to the embodiment. 図４は、実施の形態に係るメインの処理フローを示す図である。FIG. 4 is a diagram illustrating a main processing flow according to the embodiment. 図５は、グループ化処理の処理フローを示す図である。FIG. 5 is a diagram illustrating a process flow of the grouping process. 図６は、本実施の形態の概要を説明するための図である。FIG. 6 is a diagram for explaining the outline of the present embodiment. 図７は、本実施の形態の概要を説明するための図である。FIG. 7 is a diagram for explaining the outline of the present embodiment. 図８は、グループデータ格納部に格納されるデータの一例を示す図である。FIG. 8 is a diagram illustrating an example of data stored in the group data storage unit. 図９は、度数分布パターンへの合致判定処理の処理フローを示す図である。FIG. 9 is a diagram illustrating a processing flow of the matching determination process for the frequency distribution pattern. 図１０は、第２データ格納部に格納される補助データの一例を示す図である。FIG. 10 is a diagram illustrating an example of auxiliary data stored in the second data storage unit. 図１１は、度数分布パターンへの合致判定処理の処理フローを示す図である。FIG. 11 is a diagram illustrating a processing flow of the matching determination process for the frequency distribution pattern. 図１２は、第２データ格納部に格納されるデータの一例を示す図である。FIG. 12 is a diagram illustrating an example of data stored in the second data storage unit. 図１３は、第２データ格納部に格納されるデータの一例を示す図である。FIG. 13 is a diagram illustrating an example of data stored in the second data storage unit. 図１４は、変換後データ生成処理の処理フローを示す図である。FIG. 14 is a diagram illustrating a processing flow of post-conversion data generation processing. 図１５は、第２データ格納部に格納されるデータの一例を示す図である。FIG. 15 is a diagram illustrating an example of data stored in the second data storage unit. 図１６は、変換後データ生成処理を説明するための図である。FIG. 16 is a diagram for explaining the post-conversion data generation process. 図１７は、変換後データ生成処理を説明するための図である。FIG. 17 is a diagram for explaining the post-conversion data generation process. 図１８は、変換後データ生成処理を説明するための図である。FIG. 18 is a diagram for explaining the post-conversion data generation process. 図１９は、データ変換処理の処理フローを示す図である。FIG. 19 is a diagram illustrating a processing flow of data conversion processing. 図２０は、匿名化データ格納部に格納されるデータの一例を示す図である。FIG. 20 is a diagram illustrating an example of data stored in the anonymized data storage unit. 図２１は、出力例を示す図である。FIG. 21 is a diagram illustrating an output example. 図２２は、コンピュータの機能ブロック図である。FIG. 22 is a functional block diagram of a computer.

図３に、本技術の実施の形態に係る情報処理装置１００の構成例を示す。本情報処理装置１００は、第１データ格納部１１と、グループ化部１２と、設定データ格納部１３と、グループデータ格納部１４と、判定部１５と、生成部１６と、第２データ格納部１７と、変換処理部１８と、匿名化データ格納部１９と、出力部２０とを有する。 FIG. 3 illustrates a configuration example of the information processing apparatus 100 according to the embodiment of the present technology. The information processing apparatus 100 includes a first data storage unit 11, a grouping unit 12, a setting data storage unit 13, a group data storage unit 14, a determination unit 15, a generation unit 16, and a second data storage unit. 17, a conversion processing unit 18, an anonymized data storage unit 19, and an output unit 20.

第１データ格納部１１は、例えば図１に示すような匿名化すべきデータを格納する。設定データ格納部１３は、いずれの列が機密情報（ここではＩＤ）であるかを指定するデータと、距離及び時間についての閾値データと、匿名化の条件となる度数分布パターンと、指定された変換方法のデータとを格納する。 The first data storage unit 11 stores, for example, data to be anonymized as shown in FIG. The setting data storage unit 13 is designated with data specifying which column is confidential information (here, ID), threshold data regarding distance and time, and a frequency distribution pattern as a condition for anonymization. Stores conversion method data.

グループ化部１２は、設定データ格納部１３に格納されているデータに従って、第１データ格納部１１に格納されているレコードをグループ化する処理を実行する。なお、グループ化は、距離を算出する対象となるレコードを限定することで、処理の高速化を図るために実行される。グループデータ格納部１４は、グループ化部１２の処理結果を格納する。 The grouping unit 12 executes a process for grouping records stored in the first data storage unit 11 according to the data stored in the setting data storage unit 13. The grouping is executed in order to speed up the processing by limiting the records for which the distance is calculated. The group data storage unit 14 stores the processing result of the grouping unit 12.

判定部１５は、第１データ格納部１１に格納されているレコードのうち、グループデータ格納部１４を用いて、設定データ格納部１３に格納されている条件を満たすレコードを特定し、生成部１６に対して、当該レコードの位置データ及び時刻データ（すなわち数値属性値）の変換後の位置データ及び時刻データを生成させる。生成部１６は、判定部１５の指示に応じて、設定データ格納部１３に格納されている変換方法のデータに従って変換後の位置データ及び時刻データを生成する。第２データ格納部１７には、判定部１５及び生成部１６の処理結果が格納される。 The determination unit 15 uses the group data storage unit 14 among the records stored in the first data storage unit 11 to identify records that satisfy the conditions stored in the setting data storage unit 13, and generates the unit 16. For this, the position data and time data after the conversion of the position data and time data (that is, numerical attribute values) of the record are generated. The generation unit 16 generates the converted position data and time data in accordance with the conversion method data stored in the setting data storage unit 13 in accordance with an instruction from the determination unit 15. The second data storage unit 17 stores the processing results of the determination unit 15 and the generation unit 16.

変換処理部１８は、第２データ格納部１７に格納されているデータ及び設定データ格納部１３に格納されているデータを用いて、第１データ格納部１１に格納されているレコードに対して処理を行って、処理結果を匿名化データ格納部１９に格納する。出力部２０は、匿名化データ格納部１９に格納されているデータを、例えば要求に従って出力する。 The conversion processing unit 18 processes the records stored in the first data storage unit 11 using the data stored in the second data storage unit 17 and the data stored in the setting data storage unit 13. And the processing result is stored in the anonymized data storage unit 19. The output unit 20 outputs the data stored in the anonymized data storage unit 19 according to a request, for example.

次に、図３に示した情報処理装置１００の処理内容について、図４乃至図２１を用いて説明する。まず、グループ化部１２は、設定データ格納部１３に格納されているデータに従って、第１データ格納部１１に格納されているレコードに対してグループ化処理を実行し、処理結果をグループデータ格納部１４に格納する（図４：ステップＳ１）。グループ化処理については、図５乃至図８を用いて説明する。 Next, processing contents of the information processing apparatus 100 illustrated in FIG. 3 will be described with reference to FIGS. First, the grouping unit 12 performs a grouping process on the records stored in the first data storage unit 11 according to the data stored in the setting data storage unit 13, and the processing result is transmitted to the group data storage unit. 14 (FIG. 4: Step S1). The grouping process will be described with reference to FIGS.

グループ化部１２は、設定データ格納部１３に格納されているデータに基づき、グループＩＤ生成関数を生成する（図５：ステップＳ１１）。 The grouping unit 12 generates a group ID generation function based on the data stored in the setting data storage unit 13 (FIG. 5: Step S11).

本実施の形態では、図６に示すように、あるレコードの位置データ及び時刻データと距離の閾値から特定される所定の範囲Ａ内に含まれる複数のレコードが、所定の度数分布パターンに合致しているか否かを判断する。しかし、１のレコードに対して他の全てのレコードが所定の範囲Ａ内であるか判断すると処理時間が長くなる。そこで、位置データについて定義された距離の閾値（ｔｈ１，ｔｈ２）及び時刻データについて定義された距離（すなわち時間差）の閾値（ｔｈ３）の幅で時空間を領域分割して（例えば図７）、各レコードがいずれの領域に包含されるかを特定する。この領域をグループと呼ぶ。この場合、直ぐ隣の領域（斜め隣の領域も含む）以外の領域に含まれるレコードが、処理に係るレコードについての所定の範囲内Ａに入ることはない。このようにして各レコードが所属する領域が特定されていれば、具体的に所定の範囲内Ａ内にあるか否かを判定することになるレコードの数を限定できるようになる。 In the present embodiment, as shown in FIG. 6, a plurality of records included in a predetermined range A specified from position data and time data of a certain record and a distance threshold match a predetermined frequency distribution pattern. Judge whether or not. However, if it is determined whether all other records are within the predetermined range A for one record, the processing time becomes longer. Therefore, the space-time is divided into regions with the widths of the distance thresholds (th1, th2) defined for the position data and the distance (ie, time difference) threshold (th3) defined for the time data (for example, FIG. 7). Specifies which region the record is contained in. This area is called a group. In this case, a record included in a region other than the immediately adjacent region (including a diagonally adjacent region) does not enter the predetermined range A for the record related to the process. If the area to which each record belongs is specified in this way, the number of records for which it is specifically determined whether or not it is within a predetermined range A can be limited.

従って、本実施の形態では、以下のようなグループＩＤ生成関数を生成する。なお、時刻データについて定義された距離の閾値が０：０１であり、位置データについて定義された距離の閾値が１２であるものとする。
ｆ（Ｒ）＝（floor(時刻／０：０１），floor(Ｘ／１２)，floor(Ｙ／１２)）
ここでfloor（Ｑ／Ｓ）は、ＱをＳの倍数に近い値に切り捨てる関数である。また、ｆ（Ｒ）は、時刻、緯度Ｘ及び経度Ｙを含むレコードＲのグループＩＤを算出する関数である。 Therefore, in this embodiment, the following group ID generation function is generated. It is assumed that the distance threshold defined for the time data is 0:01 and the distance threshold defined for the position data is 12.
f (R) = (floor (time / 0: 01), floor (X / 12), floor (Y / 12))
Here, floor (Q / S) is a function for rounding down Q to a value close to a multiple of S. F (R) is a function for calculating the group ID of the record R including the time, the latitude X, and the longitude Y.

そして、グループ化部１２は、空のグループ表を生成する(ステップＳ１３）。その後、グループ化部１２は、第１データ格納部１１において未処理のレコードを１つ特定する（ステップＳ１５）。さらに、グループ化部１２は、グループＩＤ生成関数により、特定されたレコードのグループＩＤを生成する（ステップＳ１７）。 Then, the grouping unit 12 generates an empty group table (step S13). Thereafter, the grouping unit 12 identifies one unprocessed record in the first data storage unit 11 (step S15). Further, the grouping unit 12 generates a group ID of the identified record by using a group ID generation function (step S17).

例えばｆ（４行目レコード）＝（floor(0:03/0:01)，floor(23/12)，floor(7/12)）＝（３，１，０）が得られる。 For example, f (fourth line record) = (floor (0: 03/0: 01), floor (23/12), floor (7/12)) = (3, 1, 0) is obtained.

そして、グループ化部１２は、グループＩＤに対応付けてレコード番号をグループ表に登録する（ステップＳ１９）。グループＩＤが登録されていない場合には、グループＩＤをグループ表に登録した上で、レコード番号をも登録する。 Then, the grouping unit 12 registers the record number in the group table in association with the group ID (step S19). If the group ID is not registered, the record number is also registered after the group ID is registered in the group table.

図１に示すようなデータについてこのような処理を実行すれば、図８に示すようなグループ表が生成され、グループデータ格納部１４に格納される。図８の例では、グループＩＤ毎に、当該グループに属するレコード番号が登録されるようになっている。但し、図８の例では、全てのレコードが異なるグループに所属する例を示している。 If such processing is executed for data as shown in FIG. 1, a group table as shown in FIG. 8 is generated and stored in the group data storage unit 14. In the example of FIG. 8, record numbers belonging to the group are registered for each group ID. However, the example of FIG. 8 shows an example in which all records belong to different groups.

その後、グループ化部１２は、第１データ格納部１１において未処理のレコードが存在するか判断する（ステップＳ２１）。未処理のレコードが存在する場合には、処理はステップＳ１５に戻る。一方、未処理のレコードが存在しない場合には、処理は呼出元の処理に戻る。 Thereafter, the grouping unit 12 determines whether there is an unprocessed record in the first data storage unit 11 (step S21). If there is an unprocessed record, the process returns to step S15. On the other hand, if there is no unprocessed record, the process returns to the caller process.

このようにすれば、位置データ及び時刻データについて定義されている閾値を加味した形でグループ分けが行われて、以下の処理の効率化が図られる。 In this way, the grouping is performed in consideration of the threshold values defined for the position data and the time data, and the following process can be made more efficient.

図４の処理の説明にもどって、判定部１５等は、度数分布パターンへの合致判定処理を実行する（ステップＳ３）。度数分布パターンへの合致判定処理については、図９乃至図２０を用いて説明する。 Returning to the description of the processing in FIG. 4, the determination unit 15 or the like executes a determination processing for matching the frequency distribution pattern (step S <b> 3). The matching determination process for the frequency distribution pattern will be described with reference to FIGS.

判定部１５は、第１データ格納部１１に格納されているレコードのうち未処理のレコードを１つ特定する（図９：ステップＳ３１）。また、判定部１５は、特定されたレコードＲのグループＩＤを、グループデータ格納部１４から取得する（ステップＳ３３）。例えば図１におけるデータの第４レコードを処理する場合には、図８から分かるように（３，１，０）が得られる。 The determination unit 15 identifies one unprocessed record among the records stored in the first data storage unit 11 (FIG. 9: Step S31). Further, the determination unit 15 acquires the group ID of the identified record R from the group data storage unit 14 (step S33). For example, when the fourth record of data in FIG. 1 is processed, (3, 1, 0) is obtained as can be seen from FIG.

その後、判定部１５は、取得されたグループＩＤの全近隣グループＩＤについて、レコード番号集合の和集合を生成する（ステップＳ３５）。近隣グループＩＤは、グループＩＤの各要素を、−１，０，＋１のいずれかを足したグループＩＤである。（３，１，０）であれば、以下の２７グループＩＤが特定される。
(2, 0, -1), (2, 0, 0), (2, 0, 1), (3, 0, -1) , (3, 0, 0), (3, 0, 1) , (4, 0, -1) , (4, 0, 0), (4, 0, 1), (2, 1, -1), (2, 1, 0), (2, 1, 1), (3, 1, -1) , (3, 1, 0), (3, 1, 1) , (4, 1, -1) , (4, 1, 0), (4, 1, 1), (2, 2, -1), (2, 2, 0), (2, 2, 1), (3, 2, -1) , (3, 2, 0), (3, 2, 1) , (4, 2, -1) , (4, 2, 0), (4, 2, 1) Thereafter, the determination unit 15 generates a union of record number sets for all neighboring group IDs of the acquired group ID (step S35). The neighborhood group ID is a group ID obtained by adding any of -1, 0, and +1 to each element of the group ID. If (3, 1, 0), the following 27 group IDs are specified.
(2, 0, -1), (2, 0, 0), (2, 0, 1), (3, 0, -1), (3, 0, 0), (3, 0, 1), (4, 0, -1), (4, 0, 0), (4, 0, 1), (2, 1, -1), (2, 1, 0), (2, 1, 1), (3, 1, -1), (3, 1, 0), (3, 1, 1), (4, 1, -1), (4, 1, 0), (4, 1, 1), (2, 2, -1), (2, 2, 0), (2, 2, 1), (3, 2, -1), (3, 2, 0), (3, 2, 1), (4, 2, -1), (4, 2, 0), (4, 2, 1)

この中で、（２，１，１）についてはレコード番号「２」が得られ、（３，１，０）についてはレコード番号「４」が得られ、（４，０，０）についてはレコード番号「３」が得られ、（４，１，０）についてレコード番号「５」が得られる。 Among these, the record number “2” is obtained for (2, 1, 1), the record number “4” is obtained for (3, 1, 0), and the record number is obtained for (4, 0, 0). The number “3” is obtained, and the record number “5” is obtained for (4, 1, 0).

そして、判定部１５は、生成された和集合から、未処理のレコードを１つ特定する（ステップＳ３７）。例えば、レコード番号「２」のレコードが特定されたものとする。 Then, the determination unit 15 specifies one unprocessed record from the generated union (step S37). For example, it is assumed that the record with the record number “2” is specified.

その後、判定部１５は、レコードＲ（レコード番号「４」のレコード）と、特定されたレコードとの距離を算出する（ステップＳ３９）。 Thereafter, the determination unit 15 calculates the distance between the record R (the record with the record number “4”) and the identified record (step S39).

本実施の形態では、空間についての距離にはユークリッド距離を用いる。よって、空間における距離は、｛（１４−２３）²＋（１２−７）²｝^1/2＝（１０６）^1/2となる。但し、マンハッタン距離を用いても良い。さらに、時間についての距離（ここでは時間差）は、｜０：０２−０：０３｜＝０．０１となる。時空間の距離を、予め定められている距離算出式に応じて算出するようにしても良い。 In the present embodiment, the Euclidean distance is used as the distance for the space. Accordingly, the distance in space is {(14-23) ² + (12-7) ² } ^1/2 = (106) ^1/2 . However, the Manhattan distance may be used. Further, the distance with respect to time (here, the time difference) is | 0: 02-0: 03 | = 0.01. The space-time distance may be calculated according to a predetermined distance calculation formula.

なお、例えば第２データ格納部１７に、図１０に示すようなデータを格納しておく。図１０の例では、レコードの組み合わせ毎に、空間距離及び時間距離を登録する。後の処理で再利用するためである。空間距離だけを後に再利用するのであれば、空間距離だけ格納しておく場合もある。 For example, data as shown in FIG. 10 is stored in the second data storage unit 17. In the example of FIG. 10, the spatial distance and the time distance are registered for each combination of records. This is for reuse in later processing. If only the spatial distance is reused later, only the spatial distance may be stored.

処理は端子Ａを介して図１１の処理に移行して、判定部１５は、ステップＳ３９で算出された距離が閾値以内であるか判断する（ステップＳ４１）。例えば、空間距離についての閾値が１２で時間距離についての閾値が０：０１である場合には、レコード番号「２」のレコードは上で述べた計算からすると距離が閾値以内であると判定される。 The process proceeds to the process of FIG. 11 via the terminal A, and the determination unit 15 determines whether the distance calculated in step S39 is within a threshold (step S41). For example, when the threshold for the spatial distance is 12 and the threshold for the time distance is 0:01, the record with the record number “2” is determined to be within the threshold based on the calculation described above. .

一方、例えば図１におけるデータの第３レコードと第４レコードとの時間距離は｜０：０４−０：０３｜＝０：０１であるが、空間距離は｛（８−２３）²＋（７−７）²｝^1/2＝１５となるので、距離が閾値以内とは言えない。 On the other hand, for example, the time distance between the third record and the fourth record of the data in FIG. 1 is | 0: 04-0: 03 | = 0: 01, but the spatial distance is {(8-23) ² + (7 -7) Since ² } ^1/2 = 15, the distance is not within the threshold.

距離が閾値以内であると判断された場合には、判定部１５は、レコードＲに対応付けて、特定されたレコードのレコード番号を登録する（ステップＳ４３）。例えば、図１２に示すようなデータが、第２データ格納部１７に格納される。図１２の例では、図１に示されたデータのコピーと、第４レコードに対応付けて、レコード番号「２」がレコード番号集合の要素として登録されている。そして処理はステップＳ４５に移行する。一方、ステップＳ４１で距離が閾値以内ではないと判断された場合には、処理はステップＳ４５に移行する。 When it is determined that the distance is within the threshold, the determination unit 15 registers the record number of the identified record in association with the record R (step S43). For example, data as shown in FIG. 12 is stored in the second data storage unit 17. In the example of FIG. 12, the copy of the data shown in FIG. 1 and the record number “2” are registered as elements of the record number set in association with the fourth record. Then, the process proceeds to step S45. On the other hand, if it is determined in step S41 that the distance is not within the threshold value, the process proceeds to step S45.

そして、判定部１５は、和集合において未処理のレコードが存在するか判断する（ステップＳ４５）。和集合に未処理のレコードが存在する場合には、処理は端子Ｂを介して図９のステップＳ３７にもどる。一方、和集合に未処理のレコードが存在しない場合には、処理はステップＳ４７に移行する。 Then, the determination unit 15 determines whether there is an unprocessed record in the union (step S45). If there is an unprocessed record in the union, the process returns to step S37 in FIG. On the other hand, if there is no unprocessed record in the union, the process proceeds to step S47.

ここまで処理を実行すると、第４レコードについての距離についての判定処理が完了することになり、第２レコード、第４レコード及び第５レコードが、距離が閾値以内であることが判定され、レコード番号集合は｛２，４，５｝となる。なお、他のレコードについて処理すると、図１３に示すように、距離が閾値以内であると判定されたレコード番号集合が得られる。 When the process is executed so far, the determination process for the distance for the fourth record is completed, and it is determined that the distances of the second record, the fourth record, and the fifth record are within the threshold, and the record number The set is {2, 4, 5}. When other records are processed, a record number set in which the distance is determined to be within the threshold is obtained as shown in FIG.

その後、判定部１５は、レコードＲに対応付けられているレコード番号集合についてＩＤの度数分布が、設定データ格納部１３に格納されている度数分布パターンに合致するか否かを判定する（ステップＳ４７）。本実施の形態では、ｌ多様性、すなわちＩＤがｌ種類以上含まれているという度数分布パターン、又は度数が多い順に上位ｌ種類のＩＤについて、１つ上位の度数のｂ（ｂは１以下の実数）倍以上の度数が存在するという度数分布パターンのいずれか指定された度数分布パターンに合致するか判断する。 Thereafter, the determination unit 15 determines whether or not the ID frequency distribution for the record number set associated with the record R matches the frequency distribution pattern stored in the setting data storage unit 13 (step S47). ). In the present embodiment, 1 diversity, that is, a frequency distribution pattern in which one or more types of IDs are included, or the highest 1 type of IDs in descending order of frequency, b (b is 1 or less) It is determined whether or not the frequency distribution pattern specified by any one of the frequency distribution patterns having a frequency greater than (real number) times matches.

例えばｌ＝３でｂ＝０．５であるとする。この場合、第４レコードについては、レコード番号集合に３つのレコード番号が登録されており、第２レコードのＩＤはc1であり、第４レコードのＩＤはc2であり、第５レコードのＩＤはc3であるので、度数分布は｛c1:1，c2:1，c3:1｝であるから、３つのＩＤについて１つ上位の度数の０．５倍以上の度数が存在するのが分かる。従って、指定度数分布パターンに合致することが分かる。 For example, assume that l = 3 and b = 0.5. In this case, for the fourth record, three record numbers are registered in the record number set, the ID of the second record is c1, the ID of the fourth record is c2, and the ID of the fifth record is c3. Therefore, since the frequency distribution is {c1: 1, c2: 1, c3: 1}, it can be seen that there are frequencies that are 0.5 times or more of the upper frequency for three IDs. Therefore, it can be seen that it matches the specified frequency distribution pattern.

一方、レコード番号集合に３以上のレコード番号が含まれないレコードについては指定度数分布パターンに合致しないことになる。図１３の場合、例えば第１レコード及び第３レコードの場合、指定度数分布パターンに合致しない。また、第２レコードの場合、度数分布は｛c1:2，c3:1｝であるから、ｌ＝３の条件を満たさないので、指定度数分布パターンに合致しない。 On the other hand, a record that does not include a record number of 3 or more in the record number set does not match the specified frequency distribution pattern. In the case of FIG. 13, for example, the first record and the third record do not match the specified frequency distribution pattern. In the case of the second record, since the frequency distribution is {c1: 2, c3: 1}, the condition of l = 3 is not satisfied, and thus the specified frequency distribution pattern is not met.

その後、指定度数分布パターンに度数分布が合致すると判定されると（ステップＳ４９：Ｙｅｓルート）、判定部１５は、処理対象のレコードＲのレコード番号及びレコード番号集合に含まれるレコードのデータを生成部１６に出力して、生成部１６に対して変換後データ生成処理を実行させる（ステップＳ５１）。変換後データ生成処理については、図１４乃至図１８を用いて説明する。なお、生成部１６は、設定データ格納部１３に格納されている変換方法のデータに従って以下の処理を実施する。すなわち、別の変換方法が指定された場合には、異なる処理が行われる。 Thereafter, when it is determined that the frequency distribution matches the specified frequency distribution pattern (step S49: Yes route), the determination unit 15 generates the record number of the record R to be processed and the record data included in the record number set. 16 to cause the generation unit 16 to execute a post-conversion data generation process (step S51). The post-conversion data generation process will be described with reference to FIGS. The generation unit 16 performs the following processing according to the data of the conversion method stored in the setting data storage unit 13. That is, when another conversion method is designated, different processing is performed.

生成部１６は、レコード番号集合から、レコードＲと同一のＩＤを有するレコードのレコード番号を削除する（ステップＳ６１）。レコード番号集合｛２，４，５｝の場合、レコード番号「４」のＩＤ「c2」と同じＩＤを有するレコードのレコード番号が削除され、｛２，５｝というレコード番号集合になる。 The generation unit 16 deletes the record number of the record having the same ID as the record R from the record number set (step S61). In the case of the record number set {2, 4, 5}, the record number of the record having the same ID as the ID “c2” of the record number “4” is deleted, and the record number set {2, 5} is obtained.

また、生成部１６は、ステップＳ６１で処理されたレコード番号集合に含まれるレコードのうち、レコードＲに空間距離が最も近いレコードＮＰを特定する（ステップＳ６３）。例えば、図１０に示すようなデータを保持しておけば、空間距離が最も近いレコードＮＰが特定される。なお、空間距離が同じレコードが複数存在する場合には、時間距離に基づきＮＰを選択するようにしても良い。 Further, the generation unit 16 identifies the record NP having the closest spatial distance to the record R among the records included in the record number set processed in step S61 (step S63). For example, if data as shown in FIG. 10 is held, the record NP having the closest spatial distance is specified. If there are a plurality of records having the same spatial distance, NP may be selected based on the time distance.

さらに、生成部１６は、レコード番号集合に含まれるレコードについての位置及び時刻の平均値ＡＰを算出する（ステップＳ６５）。上で述べた例では、（時刻，Ｘ，Ｙ）＝（０：０３，１６．５，７．５）が得られる。 Further, the generation unit 16 calculates an average value AP of the position and time for the records included in the record number set (step S65). In the example described above, (time, X, Y) = (0:03, 16.5, 7.5) is obtained.

そして、生成部１６は、ＮＰとＡＰを結ぶ線分上の点をランダムに選択することで、変換後データＣＲを生成し、レコードＲに対応付けて第２データ格納部１７に格納する（ステップＳ６７）。なお、この際、空間距離の閾値等をも併せて登録する。例えば、図１５のようなデータが、第２データ格納部１７に格納される。図１５の例では、ＩＤと、時刻と、緯度経度Ｘ及びＹと、速度と、変換後データＣＲと、空間距離閾値とが登録されるようになっている。なお、度数分布パターンに合致するのは上で述べたように第４レコード及び第５レコードのみであり、両者とも、ステップＳ６７では、ＣＲ＝ＡＰとなっている例を示している。 And the production | generation part 16 produces | generates the data CR after conversion by selecting the point on the line segment which connects NP and AP at random, and it matches with the record R and stores it in the 2nd data storage part 17 (step). S67). At this time, the threshold of the spatial distance and the like are also registered. For example, data as shown in FIG. 15 is stored in the second data storage unit 17. In the example of FIG. 15, ID, time, latitude / longitude X and Y, speed, converted data CR, and spatial distance threshold are registered. As described above, only the fourth record and the fifth record match the frequency distribution pattern, and both show an example in which CR = AP in step S67.

ＩＤが異なるレコードの平均値を取ることは、特性の強い数値をより特性の弱い数値に変換する効果がある。ただし、平均値は可逆性が強く、どのレコードも削除されなかった場合は元のデータを復元し得るという問題がある。乱数成分を入れる、すなわちＮＰとＡＰを結ぶ線分上の点をランダムに１つ選択することで、この問題を緩和する効果が得られる。 Taking an average value of records with different IDs has an effect of converting a numerical value with strong characteristics into a numerical value with weaker characteristics. However, the average value is highly reversible, and there is a problem that the original data can be restored if no record is deleted. The effect of alleviating this problem can be obtained by inserting a random number component, that is, by randomly selecting one point on the line segment connecting NP and AP.

なお、端点の１つとしてＮＰ（空間的に近い点）としているのは、特性の強い点があったことをできるだけ推定されないようにするためである。具体的には、特性の強い点は空間的に孤立しがちで、そのＮＰは特性の強い点になりにくいため、特性の強い点の近くに変換されにくくなる。 Note that NP (a spatially close point) is used as one of the end points in order to prevent as much as possible the estimation of a point having strong characteristics. Specifically, a point with strong characteristics tends to be spatially isolated, and the NP is unlikely to be a point with strong characteristics, so it is difficult to be converted near a point with strong characteristics.

例えば図１６に示すようなレコード分布の状況において、レコードＲ１及びＲ２を円で示した距離範囲内で変換するとする。なお、各×が１レコードの空間位置を表しており、「c1」乃至「c3」はそれぞれのレコードのＩＤを示す。 For example, in a record distribution situation as shown in FIG. 16, it is assumed that records R1 and R2 are converted within a distance range indicated by a circle. Each x represents the spatial position of one record, and “c1” to “c3” indicate the ID of each record.

上で述べた処理をレコードＲ１について実施すると、ＩＤがc1のレコードは除外されるので、ＩＤがc2及びc3のレコードから、図１７に示すようにＮＰを特定し、ＡＰを算出することになる。レコードＲ１は、空間的に孤立した傾向があるが、ＮＰ−ＡＰ間でＣＲを選択するとすると、変換によってその傾向が緩和されていることが分かる。すなわち、多くのレコードが存在する道路上のレコードに変換される。 When the process described above is performed on the record R1, the record with the ID c1 is excluded, so that the NP is specified from the records with the IDs c2 and c3 as shown in FIG. 17, and the AP is calculated. . The record R1 tends to be spatially isolated, but if CR is selected between NP and AP, it can be seen that the tendency is alleviated by conversion. That is, it is converted into a record on the road where many records exist.

一方、レコードＲ２についてＮＰを特定し、ＡＰを算出すると、図１８のようになる。レコードＲ２は、レコードＲ１とＩＤが異なるレコードのうちＲ１から最も近いレコード（すなわちレコードＲ１のＮＰ）であるが、変換によってもレコードＲ１に近づき過ぎず多くのレコードが存在する道路上に残ることがわかる。 On the other hand, when the NP is specified for the record R2 and the AP is calculated, it is as shown in FIG. The record R2 is the record closest to R1 among the records having different IDs from the record R1 (that is, the NP of the record R1). However, the record R2 is not too close to the record R1 even after conversion, and may remain on the road where many records exist. Recognize.

このように、孤立傾向のあるレコードを他のＩＤのレコード側に引き寄せることで、特性の強い点が推定しにくくなる。一方、孤立傾向のないレコードも、孤立したレコード側に大きく引き寄せられるわけではないので、孤立傾向のあるレコードを推定し易くするわけではない。 In this way, it is difficult to estimate points with strong characteristics by attracting records that tend to be isolated to the records of other IDs. On the other hand, since records that do not tend to be isolated are not attracted to the isolated records side, it is not easy to estimate records that tend to be isolated.

図１１の処理の説明に戻って、ステップＳ５１の後に又はステップＳ４９で指定度数分布パターンに合致しない場合には、判定部１５は、第１データ格納部１１において未処理のレコードが存在するか判断する（ステップＳ５３）。未処理のレコードが存在しない場合には、処理は呼出元の処理に戻る。一方、未処理のレコードが存在する場合には、処理は端子Ｃを介して図９のステップＳ３１に戻る。 Returning to the description of the processing in FIG. 11, after step S <b> 51 or when the specified frequency distribution pattern does not match after step S <b> 49, the determination unit 15 determines whether there is an unprocessed record in the first data storage unit 11. (Step S53). If there is no unprocessed record, the process returns to the caller process. On the other hand, if there is an unprocessed record, the process returns to step S31 in FIG.

図１５に示されているように、指定度数分布パターンに合致する度数分布が得られたレコードに対しては、ＣＲ及び空間距離閾値が登録される。 As shown in FIG. 15, CR and a spatial distance threshold are registered for a record in which a frequency distribution that matches the specified frequency distribution pattern is obtained.

図４の処理の説明に戻って、変換処理部１８は、第２データ格納部１７に格納されているデータを用いて、第１データ格納部１１に格納されているデータに対して、データ変換処理を実行する（ステップＳ５）。このデータ変換処理については、図１９及び図２０を用いて説明する。 Returning to the description of the processing in FIG. 4, the conversion processing unit 18 uses the data stored in the second data storage unit 17 to convert the data stored in the first data storage unit 11. Processing is executed (step S5). This data conversion process will be described with reference to FIGS.

変換処理部１８は、第１データ格納部１１における未処理のレコードを１つ特定する（ステップＳ７１）。そして、変換処理部１８は、特定されたレコードに対して、第２データ格納部１７において変換後データＣＲが設定されているか判断する（ステップＳ７３）。特定されたレコードに対して変換後データＣＲが設定されていない場合には、変換処理部１８は、特定されたレコードを削除する（ステップＳ７５）。そして処理はステップＳ７９に移行する。 The conversion processing unit 18 specifies one unprocessed record in the first data storage unit 11 (step S71). Then, the conversion processing unit 18 determines whether or not the converted data CR is set in the second data storage unit 17 for the specified record (step S73). If the post-conversion data CR is not set for the identified record, the conversion processing unit 18 deletes the identified record (step S75). Then, the process proceeds to step S79.

一方、特定されたレコードに対して変換後データＣＲが設定されている場合には、変換処理部１８は、特定されたレコードの位置及び時刻を、変換後データＣＲの位置及び時刻で置換して、匿名化データ格納部１９に格納する（ステップＳ７７）。なお、この際、空間距離閾値の値も付加する。 On the other hand, when the converted data CR is set for the specified record, the conversion processing unit 18 replaces the position and time of the specified record with the position and time of the converted data CR. Then, it is stored in the anonymized data storage unit 19 (step S77). At this time, the value of the spatial distance threshold is also added.

そして、変換処理部１８は、第１データ格納部１１において未処理のレコードが存在しているか判断する（ステップＳ７９）。未処理のレコードが存在する場合には処理はステップＳ７１に戻る。 Then, the conversion processing unit 18 determines whether there is an unprocessed record in the first data storage unit 11 (step S79). If there is an unprocessed record, the process returns to step S71.

このような処理を実行すると、例えば図２０に示すようなデータが、匿名化データ格納部１９に格納される。図２０の例では、ＩＤと、変換後の時刻と、変換後の緯度経度Ｘ及びＹと、速度と、空間距離閾値とを含む。上の例では、第４及び第５レコードのみが格納されることになる。 When such processing is executed, for example, data as shown in FIG. 20 is stored in the anonymized data storage unit 19. In the example of FIG. 20, the ID, the time after conversion, the latitude and longitude X and Y after conversion, the speed, and the spatial distance threshold are included. In the above example, only the fourth and fifth records are stored.

さらに、変換処理部１８は、匿名化データ格納部１９に格納されているデータから、設定データ格納部１３において指定されている機密属性値であるＩＤを削除する（ステップＳ８１）。これによって、匿名化がなされたことになる。そして処理は呼出元の処理に戻る。 Further, the conversion processing unit 18 deletes the ID that is the confidential attribute value specified in the setting data storage unit 13 from the data stored in the anonymized data storage unit 19 (step S81). As a result, anonymization is achieved. Then, the process returns to the caller process.

そして、図４の処理の説明に戻って、出力部２０は、匿名化データ格納部１９に格納されているデータを、表示装置などの出力装置に出力するか、又は要求に応じて他のコンピュータなどに出力する（ステップＳ７）。 Returning to the description of the processing in FIG. 4, the output unit 20 outputs the data stored in the anonymized data storage unit 19 to an output device such as a display device, or another computer upon request. (Step S7).

例えば、図２１に示すような表示がなされる。図２１では、匿名化データ格納部１９に格納されているレコードを×印で、地図データ上に示した例を示している。この例では、便宜的にＩＤを示しているが、ＩＤは削除されているので出力されることはない。また、c1駐車場や当該駐車場に続く道にはレコードが存在していないように示されている。これによって、ＩＤが推定される恐れも少なくなっている。なお、図２１において円は、c3のレコードに設定されている空間距離閾値を示している。これによって、真の値が存在する範囲を提示することができる。例えば、c3をユーザが選択した場合には、図２１に示すように表示し、c2をユーザが選択した場合には、c2を中心とした空間距離閾値の円を提示することで、c2の真の値が存在する範囲を提示するようにしても良い。 For example, a display as shown in FIG. 21 is made. FIG. 21 shows an example in which the records stored in the anonymized data storage unit 19 are indicated by x marks on the map data. In this example, the ID is shown for convenience, but the ID is not output because it has been deleted. Also, it is shown that there is no record on the c1 parking lot and the road that leads to the parking lot. As a result, the risk of estimating the ID is reduced. In FIG. 21, the circle indicates the spatial distance threshold set in the record c3. As a result, a range in which a true value exists can be presented. For example, when c3 is selected by the user, the screen is displayed as shown in FIG. 21, and when c2 is selected by the user, the circle of the spatial distance threshold centered on c2 is presented, so that c2 is true. You may make it show the range where the value of exists.

以上のような処理を実行することで、数値属性値を含むデータを、ＩＤなどの機密情報を推定できないようにした上で匿名化することができるようになる。 By executing the processing as described above, it becomes possible to anonymize data including numerical attribute values while preventing confidential information such as IDs from being estimated.

以上本技術の実施の形態を説明したが、本技術はこれに限定されるわけではない。例えば、機能ブロック図は、一例であって、プログラムモジュール構成とは一致しない場合もある。処理フローについても、処理結果が変わらない限りにおいて、処理ステップの順番を入れ替えたり、並列処理しても良い場合がある。 Although the embodiment of the present technology has been described above, the present technology is not limited to this. For example, the functional block diagram is an example, and may not match the program module configuration. As for the processing flow, as long as the processing result does not change, the order of processing steps may be changed or parallel processing may be performed.

なお、空間距離と時間距離とを分けて取り扱う場合を示したが、統合した距離を取り扱うようにしても良い。 In addition, although the case where the spatial distance and the time distance are handled separately is shown, an integrated distance may be handled.

なお、上で述べた情報処理装置１００は、例えばコンピュータ装置であって、図２２に示すように、メモリ２５０１とＣＰＵ２５０３とハードディスク・ドライブ（ＨＤＤ）２５０５と表示装置２５０９に接続される表示制御部２５０７とリムーバブル・ディスク２５１１用のドライブ装置２５１３と入力装置２５１５とネットワークに接続するための通信制御部２５１７とがバス２５１９で接続されている。オペレーティング・システム（ＯＳ：Operating System）及び本実施例における処理を実施するためのアプリケーション・プログラムは、ＨＤＤ２５０５に格納されており、ＣＰＵ２５０３により実行される際にはＨＤＤ２５０５からメモリ２５０１に読み出される。ＣＰＵ２５０３は、アプリケーション・プログラムの処理内容に応じて表示制御部２５０７、通信制御部２５１７、ドライブ装置２５１３を制御して、所定の動作を行わせる。また、処理途中のデータについては、主としてメモリ２５０１に格納されるが、ＨＤＤ２５０５に格納されるようにしてもよい。本技術の実施例では、上で述べた処理を実施するためのアプリケーション・プログラムはコンピュータ読み取り可能なリムーバブル・ディスク２５１１に格納されて頒布され、ドライブ装置２５１３からＨＤＤ２５０５にインストールされる。インターネットなどのネットワーク及び通信制御部２５１７を経由して、ＨＤＤ２５０５にインストールされる場合もある。このようなコンピュータ装置は、上で述べたＣＰＵ２５０３、メモリ２５０１などのハードウエアとＯＳ及びアプリケーション・プログラムなどのプログラムとが有機的に協働することにより、上で述べたような各種機能を実現する。 The information processing apparatus 100 described above is, for example, a computer apparatus, and as shown in FIG. 22, a display control unit 2507 connected to a memory 2501, a CPU 2503, a hard disk drive (HDD) 2505, and a display apparatus 2509. A drive device 2513 for the removable disk 2511, an input device 2515, and a communication control unit 2517 for connecting to a network are connected by a bus 2519. An operating system (OS) and an application program for executing the processing in this embodiment are stored in the HDD 2505, and are read from the HDD 2505 to the memory 2501 when executed by the CPU 2503. The CPU 2503 controls the display control unit 2507, the communication control unit 2517, and the drive device 2513 according to the processing content of the application program, and performs a predetermined operation. Further, data in the middle of processing is mainly stored in the memory 2501, but may be stored in the HDD 2505. In an embodiment of the present technology, an application program for performing the above-described processing is stored in a computer-readable removable disk 2511 and distributed, and installed from the drive device 2513 to the HDD 2505. In some cases, the HDD 2505 may be installed via a network such as the Internet and the communication control unit 2517. Such a computer apparatus realizes various functions as described above by organically cooperating hardware such as the CPU 2503 and the memory 2501 described above and programs such as the OS and application programs. .

以上述べた本実施の形態をまとめると、以下のようになる。 The above-described embodiment can be summarized as follows.

本実施の形態に係る匿名化データ生成方法は、（Ａ）各々機密属性値と数値属性値とを含む複数のデータブロックを格納するデータ格納部に格納された複数のデータブロックの各々について、当該データブロックに含まれる数値属性値に基づき数値属性に応じて定義される距離が閾値以内となるデータブロックの集合を生成し、（Ｂ）複数のデータブロックの各々について、当該データブロックと当該データブロックについて生成された集合に含まれるデータブロックとの機密属性値の度数分布が、所定の条件を満たしているか判断し、（Ｃ）度数分布が所定の条件を満たしているデータブロックについては、当該データブロックについて生成された集合に含まれるデータブロックの数値属性値から第２の数値属性値を生成し、（Ｄ）度数分布が所定の条件を満たしているデータブロックにおける数値属性値を、当該データブロックについて生成した第２の数値属性値で置換する処理を含む。 The anonymized data generation method according to the present embodiment includes (A) each of a plurality of data blocks stored in a data storage unit that stores a plurality of data blocks each including a confidential attribute value and a numerical attribute value. A set of data blocks in which the distance defined according to the numerical attribute is within a threshold is generated based on the numerical attribute value included in the data block, and (B) the data block and the data block for each of the plurality of data blocks It is determined whether the frequency distribution of the confidential attribute value with the data block included in the set generated for the above satisfies a predetermined condition. (C) For a data block whose frequency distribution satisfies the predetermined condition, the data A second numerical attribute value is generated from the numerical attribute values of the data blocks included in the set generated for the block, and (D) frequency There includes a process for replacing with a second numeric attribute values numeric attribute values were generated for the data block in the data block that satisfies a predetermined condition.

このようにすれば、数値属性値から機密属性値を推定しづらくなるため、適切なデータ匿名化が行われる。なお、本匿名化データ生成方法は、（Ｅ）度数分布が所定の条件を満たさないデータブロックについては削除する処理をさらに含むようにしても良い。 In this way, it is difficult to estimate the confidential attribute value from the numerical attribute value, so that appropriate data anonymization is performed. The anonymized data generation method may further include (E) a process of deleting a data block whose frequency distribution does not satisfy a predetermined condition.

上で述べた集合を生成する処理は、（ａ１）複数のデータブロックの各々について、閾値に基づき決定される複数の領域のうち当該データブロックの数値属性値が入る領域の識別子を特定し、（ａ２）複数のデータブロックの各々について、当該データブロックについて特定された識別子に基づき、距離を算出すべき候補データブロックを抽出し、（ａ３）複数のデータブロックの各々について、当該データブロックについて抽出された候補データブロックの各々との距離が閾値以内であるか判断する処理を含むようにしても良い。 The process for generating the set described above includes (a1) for each of a plurality of data blocks, identifying an identifier of a region in which a numerical attribute value of the data block is entered among a plurality of regions determined based on a threshold value, a2) For each of the plurality of data blocks, a candidate data block whose distance is to be calculated is extracted based on the identifier specified for the data block. (a3) For each of the plurality of data blocks, the data block is extracted. A process for determining whether the distance to each candidate data block is within a threshold value may be included.

このような処理を行えば、処理の高速化が図られる。 If such processing is performed, the processing speed can be increased.

また、本匿名化データ生成方法が、第２の数値属性値で数値属性値が置換されたデータブロックから機密属性値を削除する処理をさらに含むようにしても良い。完全な匿名化がなされる。 The anonymized data generation method may further include a process of deleting the confidential attribute value from the data block in which the numerical attribute value is replaced with the second numerical attribute value. Complete anonymization is done.

さらに、上で述べた数値属性値が、位置データを含む場合がある。この場合、数値属性に応じて定義される距離が、空間における距離、ユークリッド距離又はマンハッタン距離である場合もある。位置データを適切に処理できるようになる。 Further, the numerical attribute value described above may include position data. In this case, the distance defined according to the numerical attribute may be a distance in space, a Euclidean distance, or a Manhattan distance. The position data can be appropriately processed.

また、上で述べた数値属性値が、時刻データを含む場合もある。この場合、数値属性に応じて定義される距離が、時間間隔を含む場合もある。このように、空間距離と時間距離とを分けて距離を算出する場合もある。 In addition, the numerical attribute value described above may include time data. In this case, the distance defined according to the numerical attribute may include a time interval. Thus, the distance may be calculated by dividing the spatial distance and the time distance.

また、上で述べた所定の条件が、機密属性値が所定種類以上であるという条件、又は度数が多い順に上位所定個数の各機密属性値について１つ上位の度数の所定倍以上の度数があるという条件を含む場合がある。これによって機密属性値を推定しづらくすることができる。 Further, the predetermined condition described above is a condition that the confidential attribute value is a predetermined type or more, or there is a frequency that is a predetermined multiple of the upper frequency for each of the upper predetermined number of confidential attribute values in the descending order of frequency. May be included. This makes it difficult to estimate the confidential attribute value.

さらに、上で述べた第２の数値属性値を生成する処理が、上記集合のうち処理に係るデータブロックと同一の機密属性値を有するデータブロックを除外したデータブロック群のうち最も距離が短いデータブロックの数値属性値とデータブロック群の数値属性値の統計量とから第２の数値属性値を生成する処理を含むようにしても良い。これによって、特性の強いデータブロックの影響を抑制することができるようになる。 Further, the process for generating the second numerical attribute value described above is the data having the shortest distance among the data block group excluding the data block having the same confidential attribute value as the data block related to the process in the set. You may make it include the process which produces | generates a 2nd numerical value attribute value from the numerical value of a numerical value value of a block and the numerical value attribute value of a data block group. As a result, the influence of data blocks having strong characteristics can be suppressed.

なお、上で述べた統計量が平均値である場合もある。この場合、第２の数値属性値を、最も距離が短いデータブロックの数値属性値とデータブロック群の数値属性値の平均値との間の数値属性値を乱数で特定するようにしても良い。これによって、匿名化が十分に図られるようになる。 Note that the statistic described above may be an average value. In this case, as the second numerical attribute value, a numerical attribute value between the numerical attribute value of the data block having the shortest distance and the average value of the numerical attribute values of the data block group may be specified by a random number. As a result, anonymization is sufficiently achieved.

なお、本匿名化データ生成方法は、第２の数値属性値で数値属性値が置換されたデータブロックに対して、当該データブロックが存在しうる範囲を表すデータを付加する処理をさらに含むようにしても良い。これによって、データブロックが存在する可能性のある範囲をも把握できるようになる。 The anonymized data generation method may further include a process of adding data representing a range in which the data block can exist to the data block in which the numerical attribute value is replaced with the second numerical attribute value. good. This makes it possible to grasp the range in which the data block may exist.

なお、上で述べたような処理をコンピュータに実施させるためのプログラムを作成することができ、当該プログラムは、例えばフレキシブル・ディスク、ＣＤ−ＲＯＭなどの光ディスク、光磁気ディスク、半導体メモリ（例えばＲＯＭ）、ハードディスク等のコンピュータ読み取り可能な記憶媒体又は記憶装置に格納される。なお、処理途中のデータについては、ＲＡＭ等の記憶装置に一時保管される。 It is possible to create a program for causing a computer to carry out the processing described above, such as a flexible disk, an optical disk such as a CD-ROM, a magneto-optical disk, and a semiconductor memory (for example, ROM). Or a computer-readable storage medium such as a hard disk or a storage device. Note that data being processed is temporarily stored in a storage device such as a RAM.

以上の実施例を含む実施形態に関し、さらに以下の付記を開示する。 The following supplementary notes are further disclosed with respect to the embodiments including the above examples.

（付記１）
各々機密属性値と数値属性値とを含む複数のデータブロックを格納するデータ格納部に格納された前記複数のデータブロックの各々について、当該データブロックに含まれる数値属性値に基づき数値属性に応じて定義される距離が閾値以内となるデータブロックの集合を生成し、
前記複数のデータブロックの各々について、当該データブロックと当該データブロックについて生成された前記集合に含まれるデータブロックとの機密属性値の度数分布が、所定の条件を満たしているか判断し、
前記度数分布が前記所定の条件を満たしているデータブロックについては、当該データブロックについて生成された前記集合に含まれるデータブロックの数値属性値から第２の数値属性値を生成し、
前記度数分布が前記所定の条件を満たしているデータブロックにおける数値属性値を、当該データブロックについて生成した前記第２の数値属性値で置換する
処理を含み、コンピュータにより実行される匿名化データ生成方法。 (Appendix 1)
For each of the plurality of data blocks stored in the data storage unit that stores a plurality of data blocks each including a confidential attribute value and a numerical attribute value, according to the numerical attribute based on the numerical attribute value included in the data block Generate a set of data blocks whose defined distance is within the threshold,
For each of the plurality of data blocks, determine whether the frequency distribution of confidential attribute values of the data block and the data blocks included in the set generated for the data block satisfies a predetermined condition,
For the data block whose frequency distribution satisfies the predetermined condition, a second numerical attribute value is generated from the numerical attribute value of the data block included in the set generated for the data block,
Anonymized data generation method executed by a computer, including a process of replacing a numerical attribute value in a data block whose frequency distribution satisfies the predetermined condition with the second numerical attribute value generated for the data block .

（付記２）
前記度数分布が前記所定の条件を満たさないデータブロックについては削除する処理
をさらに含む付記１記載の匿名化データ生成方法。 (Appendix 2)
The anonymized data generation method according to supplementary note 1, further including a process of deleting a data block whose frequency distribution does not satisfy the predetermined condition.

（付記３）
前記集合を生成する処理は、
前記複数のデータブロックの各々について、前記閾値に基づき決定される複数の領域のうち当該データブロックの数値属性値が入る領域の識別子を特定し、
前記複数のデータブロックの各々について、当該データブロックについて特定された識別子に基づき、距離を算出すべき候補データブロックを抽出し、
前記複数のデータブロックの各々について、当該データブロックについて抽出された前記候補データブロックの各々との前記距離が前記閾値以内であるか判断する
処理を含む付記１又は２記載の匿名化データ生成方法。 (Appendix 3)
The process of generating the set is as follows:
For each of the plurality of data blocks, identify an identifier of a region in which a numerical attribute value of the data block is included among a plurality of regions determined based on the threshold value
For each of the plurality of data blocks, based on the identifier specified for the data block, extract a candidate data block whose distance is to be calculated,
The anonymized data generation method according to appendix 1 or 2, including a process for determining whether the distance between each of the plurality of data blocks and each of the candidate data blocks extracted for the data block is within the threshold.

（付記４）
前記第２の数値属性値で前記数値属性値が置換されたデータブロックから前記機密属性値を削除する
処理をさらに含む付記１乃至３のいずれか１つ記載の匿名化データ生成方法。 (Appendix 4)
The anonymized data generation method according to any one of appendices 1 to 3, further including a process of deleting the confidential attribute value from the data block in which the numeric attribute value is replaced with the second numeric attribute value.

（付記５）
前記数値属性値が、位置データを含み、
前記数値属性に応じて定義される距離が、空間における距離、ユークリッド距離又はマンハッタン距離である
付記１乃至４のいずれか１つ記載の匿名化データ生成方法。 (Appendix 5)
The numeric attribute value includes position data;
The anonymized data generation method according to any one of appendices 1 to 4, wherein the distance defined according to the numerical attribute is a space distance, a Euclidean distance, or a Manhattan distance.

（付記６）
前記数値属性値が、時刻データを含み、
前記数値属性に応じて定義される距離が、時間間隔を含む
付記５記載の匿名化データ生成方法。 (Appendix 6)
The numerical attribute value includes time data;
The anonymized data generation method according to claim 5, wherein the distance defined according to the numerical attribute includes a time interval.

（付記７）
前記所定の条件が、前記機密属性値が所定種類以上であるという条件、又は度数が多い順に上位所定個数の各機密属性値について１つ上位の度数の所定倍以上の度数があるという条件を含む
付記１乃至６のいずれか１つ記載の匿名化データ生成方法。 (Appendix 7)
The predetermined condition includes a condition that the confidential attribute value is greater than or equal to a predetermined type, or a condition that there is a frequency equal to or more than a predetermined multiple of the upper frequency for each of the upper predetermined number of confidential attribute values in descending order of frequency. The anonymized data generation method according to any one of appendices 1 to 6.

（付記８）
前記第２の数値属性値を生成する処理が、
前記集合のうち処理に係るデータブロックと同一の機密属性値を有するデータブロックを除外したデータブロック群のうち最も距離が短いデータブロックの数値属性値と前記データブロック群の数値属性値の統計量とから第２の数値属性値を生成する
処理を含む付記１乃至７のいずれか１つ記載の匿名化データ生成方法。 (Appendix 8)
The process of generating the second numerical attribute value includes
The numerical attribute value of the data block having the shortest distance among the data block group excluding the data block having the same confidential attribute value as the data block to be processed in the set, and the statistical value of the numerical attribute value of the data block group, The anonymized data generation method according to any one of appendices 1 to 7, including a process of generating a second numerical attribute value from

（付記９）
前記統計量が平均値であり、
前記第２の数値属性値を、前記最も距離が短いデータブロックの数値属性値と前記データブロック群の数値属性値の平均値との間の数値属性値を乱数で特定する
付記８記載の匿名化データ生成方法。 (Appendix 9)
The statistic is an average value,
The anonymization according to claim 8, wherein the second numerical attribute value is specified by a random number as a numerical attribute value between the numerical attribute value of the data block having the shortest distance and an average value of the numerical attribute values of the data block group. Data generation method.

（付記１０）
前記第２の数値属性値で前記数値属性値が置換されたデータブロックに対して、当該データブロックが存在しうる範囲を表すデータを付加する
処理をさらに含む付記１乃至９のいずれか１つ記載の匿名化データ生成方法。 (Appendix 10)
Any one of appendixes 1 to 9, further comprising a process of adding data representing a range in which the data block can exist to the data block in which the numeric attribute value is replaced with the second numeric attribute value. Anonymized data generation method.

（付記１１）
各々機密属性値と数値属性値とを含む複数のデータブロックを格納するデータ格納部に格納された前記複数のデータブロックの各々について、当該データブロックに含まれる数値属性値に基づき数値属性に応じて定義される距離が閾値以内となるデータブロックの集合を生成し、
前記複数のデータブロックの各々について、当該データブロックと当該データブロックについて生成された前記集合に含まれるデータブロックとの機密属性値の度数分布が、所定の条件を満たしているか判断し、
前記度数分布が前記所定の条件を満たしているデータブロックについては、当該データブロックについて生成された前記集合に含まれるデータブロックの数値属性値から第２の数値属性値を生成し、
前記度数分布が前記所定の条件を満たしているデータブロックにおける数値属性値を、当該データブロックについて生成した前記第２の数値属性値で置換する
処理を、コンピュータに実行させるための匿名化データ生成プログラム。 (Appendix 11)
For each of the plurality of data blocks stored in the data storage unit that stores a plurality of data blocks each including a confidential attribute value and a numerical attribute value, according to the numerical attribute based on the numerical attribute value included in the data block Generate a set of data blocks whose defined distance is within the threshold,
For each of the plurality of data blocks, determine whether the frequency distribution of confidential attribute values of the data block and the data blocks included in the set generated for the data block satisfies a predetermined condition,
For the data block whose frequency distribution satisfies the predetermined condition, a second numerical attribute value is generated from the numerical attribute value of the data block included in the set generated for the data block,
Anonymized data generation program for causing a computer to execute a process of replacing a numerical attribute value in a data block whose frequency distribution satisfies the predetermined condition with the second numerical attribute value generated for the data block .

（付記１２）
各々機密属性値と数値属性値とを含む複数のデータブロックを格納するデータ格納部に格納された前記複数のデータブロックの各々について、当該データブロックに含まれる数値属性値に基づき数値属性に応じて定義される距離が閾値以内となるデータブロックの集合を生成する手段と、
前記複数のデータブロックの各々について、当該データブロックと当該データブロックについて生成された前記集合に含まれるデータブロックとの機密属性値の度数分布が、所定の条件を満たしているか判断する手段と、
前記度数分布が前記所定の条件を満たしているデータブロックについては、当該データブロックについて生成された前記集合に含まれるデータブロックの数値属性値から第２の数値属性値を生成する手段と、
前記度数分布が前記所定の条件を満たしているデータブロックにおける数値属性値を、当該データブロックについて生成した前記第２の数値属性値で置換する手段と、
を有する情報処理装置。 (Appendix 12)
For each of the plurality of data blocks stored in the data storage unit that stores a plurality of data blocks each including a confidential attribute value and a numerical attribute value, according to the numerical attribute based on the numerical attribute value included in the data block Means for generating a set of data blocks whose defined distance is within a threshold;
Means for determining, for each of the plurality of data blocks, whether a frequency distribution of confidential attribute values of the data block and a data block included in the set generated for the data block satisfies a predetermined condition;
For a data block whose frequency distribution satisfies the predetermined condition, means for generating a second numerical attribute value from the numerical attribute value of the data block included in the set generated for the data block;
Means for replacing a numerical attribute value in a data block whose frequency distribution satisfies the predetermined condition with the second numerical attribute value generated for the data block;
An information processing apparatus.

１１第１データ格納部
１２グループ化部
１３設定データ格納部
１４グループデータ格納部
１５判定部
１６生成部
１７第２データ格納部
１８変換処理部
１９匿名化データ格納部
２０出力部 11 First Data Storage Unit 12 Grouping Unit 13 Setting Data Storage Unit 14 Group Data Storage Unit 15 Determination Unit 16 Generation Unit 17 Second Data Storage Unit 18 Conversion Processing Unit 19 Anonymized Data Storage Unit 20 Output Unit

Claims

For each of the plurality of data blocks stored in the data storage unit that stores a plurality of data blocks each including a confidential attribute value and a numerical attribute value, according to the numerical attribute based on the numerical attribute value included in the data block Generate a set of data blocks whose defined distance is within the threshold,
For each of the plurality of data blocks, determine whether the frequency distribution of confidential attribute values of the data block and the data blocks included in the set generated for the data block satisfies a predetermined condition,
For the data block whose frequency distribution satisfies the predetermined condition, a second numerical attribute value is generated from the numerical attribute value of the data block included in the set generated for the data block,
Anonymized data generation method executed by a computer, including a process of replacing a numerical attribute value in a data block whose frequency distribution satisfies the predetermined condition with the second numerical attribute value generated for the data block .

The anonymized data generation method according to claim 1, further comprising: deleting a data block whose frequency distribution does not satisfy the predetermined condition.

The process of generating the set is as follows:
For each of the plurality of data blocks, identify an identifier of a region in which a numerical attribute value of the data block is included among a plurality of regions determined based on the threshold value
For each of the plurality of data blocks, based on the identifier specified for the data block, extract a candidate data block whose distance is to be calculated,
The anonymized data generation method according to claim 1, further comprising: determining whether the distance between each of the plurality of data blocks and each of the candidate data blocks extracted for the data block is within the threshold. .

4. The anonymized data generation method according to claim 1, further comprising: deleting the confidential attribute value from a data block in which the numeric attribute value is replaced with the second numeric attribute value. 5.

The numeric attribute value includes position data;
The anonymized data generation method according to any one of claims 1 to 4, wherein the distance defined according to the numerical attribute is a distance in space, an Euclidean distance, or a Manhattan distance.

The numerical attribute value includes time data;
The anonymized data generation method according to claim 5, wherein the distance defined according to the numerical attribute includes a time interval.

The predetermined condition includes a condition that the confidential attribute value is greater than or equal to a predetermined type, or a condition that there is a frequency equal to or more than a predetermined multiple of the upper frequency for each of the upper predetermined number of confidential attribute values in descending order of frequency. The anonymized data generation method according to any one of claims 1 to 6.

The process of generating the second numerical attribute value includes
The numerical attribute value of the data block having the shortest distance among the data block group excluding the data block having the same confidential attribute value as the data block to be processed in the set, and the statistical value of the numerical attribute value of the data block group, The anonymized data generation method according to claim 1, further comprising: generating a second numerical attribute value from the process.

The statistic is an average value,
The anonymity according to claim 8, wherein the second numerical attribute value is specified by a random number between the numerical attribute value of the data block having the shortest distance and the average value of the numerical attribute values of the data block group. Data generation method.

10. The method according to claim 1, further comprising: adding data representing a range in which the data block can exist to the data block in which the numeric attribute value is replaced with the second numeric attribute value. The anonymized data generation method of description.

For each of the plurality of data blocks stored in the data storage unit that stores a plurality of data blocks each including a confidential attribute value and a numerical attribute value, according to the numerical attribute based on the numerical attribute value included in the data block Generate a set of data blocks whose defined distance is within the threshold,
For each of the plurality of data blocks, determine whether the frequency distribution of confidential attribute values of the data block and the data blocks included in the set generated for the data block satisfies a predetermined condition,
For the data block whose frequency distribution satisfies the predetermined condition, a second numerical attribute value is generated from the numerical attribute value of the data block included in the set generated for the data block,
Anonymized data generation program for causing a computer to execute a process of replacing a numerical attribute value in a data block whose frequency distribution satisfies the predetermined condition with the second numerical attribute value generated for the data block .

For each of the plurality of data blocks stored in the data storage unit that stores a plurality of data blocks each including a confidential attribute value and a numerical attribute value, according to the numerical attribute based on the numerical attribute value included in the data block Means for generating a set of data blocks whose defined distance is within a threshold;
Means for determining, for each of the plurality of data blocks, whether a frequency distribution of confidential attribute values of the data block and a data block included in the set generated for the data block satisfies a predetermined condition;
For a data block whose frequency distribution satisfies the predetermined condition, means for generating a second numerical attribute value from the numerical attribute value of the data block included in the set generated for the data block;
Means for replacing a numerical attribute value in a data block whose frequency distribution satisfies the predetermined condition with the second numerical attribute value generated for the data block;
An information processing apparatus.