JP5974858B2

JP5974858B2 - Anonymization processing method and apparatus

Info

Publication number: JP5974858B2
Application number: JP2012258555A
Authority: JP
Inventors: 芽生恵牛田; 伊藤　孝一; 孝一伊藤; 津田　宏; 宏津田
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2012-11-27
Filing date: 2012-11-27
Publication date: 2016-08-23
Anticipated expiration: 2032-11-27
Also published as: JP2014106691A

Description

本技術は、情報の匿名化技術に関する。 The present technology relates to information anonymization technology.

病院などが有する、個人の病歴などのデータを基にデータ分析を行って、例えば「○○代の××地区に住んでいる人は△△病になりやすい」といった有益な情報が得られることが期待されている。 By analyzing data based on data such as the personal medical history of hospitals, etc., it is possible to obtain useful information such as “people living in the XX area of XX are prone to △ illness”. Is expected.

このようなデータ分析処理は、データ分析についてノウハウを有する有識者がおり且つ複数の病院などから大量のデータを受け付けることができる大規模リソースを有する外部のクラウドコンピューティングなどの第三者機関に委託できることが望ましい。しかし、個人の機微なプライバシ情報である病歴データなどを、そのまま外部の第三者機関に公開することは出来ない。そこで、個人のプライバシ情報を保護するための匿名化技術が用いられることになる。 Such data analysis processing can be outsourced to a third party such as external cloud computing with a large-scale resource that has expert knowledge of data analysis and can accept a large amount of data from multiple hospitals. Is desirable. However, medical history data, which is personal privacy information, cannot be disclosed directly to an external third party. Therefore, an anonymization technique for protecting personal privacy information is used.

匿名化の最も基本的な方法として、情報提供者についての複数のデータ項目値を含む各レコードから、名前などの個人を識別するための情報（すなわちＩＤ）を除去するというものが考えられる。例えば図１に示されたようなデータが存在するものとする。図１の例では、ＩＤと、年齢と、性別と、住所と、病気というデータ項目の各々についてデータ項目値を含む３つのレコードが示されている。そして、このようなレコード群からＩＤを除去すると、図２に示すようなデータが得られる。 As the most basic method of anonymization, it is conceivable to remove information (namely, ID) for identifying an individual such as a name from each record including a plurality of data item values for an information provider. For example, it is assumed that data as shown in FIG. 1 exists. In the example of FIG. 1, three records including data item values are shown for each of the data items of ID, age, gender, address, and illness. When the ID is removed from such a record group, data as shown in FIG. 2 is obtained.

しかし、ＩＤが除去されたデータから、組み合わせることで個人を特定できる可能性があるデータ項目（住所や年齢など）を基に、個人とデータ（例えば病気）を紐付け、個人のプライバシ情報を取得するという攻撃方法が存在する。例えば、Ｘ病院に２４歳男性のＡが通院していることを攻撃者が知っているとする。Ｘ病院が公開した匿名化データ（図２）に、「２４歳男性」のレコードが１つしかなければ、それはＡのレコードであると容易に特定できる。これにより、Ａのものであろうレコードを確認することで、Ａの機微なプライバシ情報である病歴などが漏洩してしまう。 However, based on the data items (address, age, etc.) that can be identified by combining the data from which the ID has been removed, the individual and the data (for example, illness) are linked to obtain personal privacy information. There is an attack method to do. For example, it is assumed that the attacker knows that a 24-year-old male A is going to hospital X. If there is only one record of “24-year-old man” in the anonymized data released by Hospital X (FIG. 2), it can be easily identified as the A record. Thereby, by confirming the record that will be A's, the medical history that is A's sensitive privacy information will be leaked.

そこで、組み合わせることで個人を特定できる可能性があるデータ項目から個人を一意に特定できないように、同じデータ項目値を有するレコードがレコード群中にｋ個以上存在するようになるまでデータ項目値を一般化するｋ−匿名化技術が提案されている。 Therefore, the data item value is set until k or more records having the same data item value exist in the record group so that the individual cannot be uniquely specified from the data items that can be specified by combining them. A generalized k-anonymization technique has been proposed.

例えば、図２のようなデータであれば、Ａの年齢「２４歳」を「２０代」に、Ｂの年齢「２６歳」を「２０代」に一般化すれば、ｋ＝２を満たすｋ−匿名化がなされたことになる。なお、図３の例では、さらに住所についても、「埼玉」や「東京」を「関東」に一般化することで匿名化を行っている。 For example, in the case of data as shown in FIG. 2, if A's age “24 years old” is generalized to “20s” and B's age “26 years old” is generalized to “20s”, k = 2 is satisfied. -Anonymization has been done. In the example of FIG. 3, the address is further anonymized by generalizing “Saitama” and “Tokyo” to “Kanto”.

同じデータ項目値を有するレコードをｋ個以上にすることによって、攻撃者は、ある人物に関するレコードをｋ個以上絞り込めない。攻撃者は、２４歳男性のＡに相当する「２０代男性」のレコードが、Ｘ病院が公開した匿名化データにｋ個以上存在することから、ｋ個のうちいずれがＡのレコードなのか特定できず、Ａの機微なプライバシ情報を得ることは出来なくなる。 By making k or more records having the same data item value, the attacker cannot narrow down k or more records related to a certain person. The attacker has identified more than k records in the anonymized data published by X Hospital that correspond to A of a 24-year-old male, and therefore which of the k records is A records. It is impossible to obtain A's sensitive privacy information.

ｋ−匿名化されたレコード群のプライバシレベルはｋ−匿名化のパラメータであるｋの値によって決まる。一般に、ｋが大きければ大きいほど高いプライバシを保つことが出来るが、その分データが一般化されすぎたり、ｋ個のレコードを揃えることができなくなって、レコードそのものを削除することになったりして、匿名化による情報損失も多くなる。 The privacy level of the k-anonymized record group is determined by the value of k, which is a parameter for k-anonymization. In general, the larger k is, the higher privacy can be maintained, but the data is too generalized, or k records cannot be aligned and the record itself is deleted. Information loss due to anonymization also increases.

一方、プライバシに対する考え方は人それぞれであり、それほど高いプライバシを要求しない情報提供者も存在する。そのような情報提供者の有益な情報までも、既存のｋ−匿名化技術では失われてしまう。 On the other hand, each person has a different approach to privacy, and there are information providers who do not require such high privacy. Even useful information of such information providers is lost with the existing k-anonymization technology.

特開２００５−７８１３８号公報JP 2005-78138 A 特開２００９−３１９００号公報JP 2009-31900 A 特開２００６−３３９８９５号公報JP 2006-339895 A 特開２０１２−３４４０号公報JP 2012-3440 A

従って、本技術の目的は、一側面によれば、情報損失を抑制する匿名化処理技術を提供することである。 Therefore, the objective of this technique is to provide the anonymization processing technique which suppresses information loss according to one side surface.

本技術の一側面に係る匿名化処理方法は、（Ａ）第１のデータ項目値と当該第１のデータ項目値を匿名化するために一般化され得る第２のデータ項目値と当該第２のデータ項目値の一般化のためにグループ化すべきデータブロックの数とを含むデータブロックを複数格納するデータ格納部から、グループを代表する第１のデータブロックを選択する処理と、（Ｂ）データ格納部から、第２のデータ項目値に基づき算出される、第１のデータブロックとの距離が閾値未満であり且つグループ化すべきデータブロックの数が閾値以上である第２のデータブロックを抽出する処理と、（Ｃ）第１のデータブロック及び第２のデータブロックの数が、第１のデータブロックに含まれる、グループ化すべきデータブロックの数以上であれば、第２のデータブロックの少なくとも一部及び第１のデータブロックをグループ化する処理と、（Ｄ）上記選択する処理と上記抽出する処理と上記グループ化する処理とを、データ格納部に格納されたデータブロックのうち、グループ化すべきデータブロックの数が大きいデータブロックの順に実行する処理と、（Ｅ）グループ毎に、当該グループに含まれるデータブロックにおける第２のデータ項目値を、所定のルールに従って一般化する処理とを含む。 The anonymization processing method according to one aspect of the present technology includes (A) a first data item value and a second data item value that can be generalized to anonymize the first data item value and the second data item value. A process of selecting a first data block representing a group from a data storage unit storing a plurality of data blocks including the number of data blocks to be grouped for generalization of data item values of (B) data From the storage unit, a second data block calculated based on the second data item value and having a distance from the first data block that is less than the threshold and the number of data blocks to be grouped is greater than or equal to the threshold is extracted. And (C) if the number of first data blocks and second data blocks is equal to or greater than the number of data blocks to be grouped included in the first data block, the second data block A process of grouping at least a part of the block and the first data block; and (D) the process of selecting, the process of extracting, and the process of grouping are among the data blocks stored in the data storage unit , Processing executed in order of data blocks having the largest number of data blocks to be grouped, and (E) for each group, generalizing the second data item value in the data block included in the group according to a predetermined rule Including.

一側面によれば、匿名化処理において情報損失を抑制することができるようになる。 According to one aspect, information loss can be suppressed in the anonymization process.

図１は、レコードの一例を示す図である。FIG. 1 is a diagram illustrating an example of a record. 図２は、ＩＤを除去したレコードの一例を示す図である。FIG. 2 is a diagram illustrating an example of a record from which an ID is removed. 図３は、ｋ−匿名化の処理結果の一例を示す図である。FIG. 3 is a diagram illustrating an example of a processing result of k-anonymization. 図４は、本技術の実施の形態に係るシステムの概要を示す図である。FIG. 4 is a diagram illustrating an overview of a system according to an embodiment of the present technology. 図５は、情報収集者装置の機能ブロック図である。FIG. 5 is a functional block diagram of the information collector device. 図６は、第１データ格納部に格納されるデータの一例を示す図である。FIG. 6 is a diagram illustrating an example of data stored in the first data storage unit. 図７は、設定データ格納部に格納される一般化階層木の一例を示す図である。FIG. 7 is a diagram illustrating an example of a generalized hierarchical tree stored in the setting data storage unit. 図８は、設定データ格納部に格納される一般化階層木の一例を示す図である。FIG. 8 is a diagram illustrating an example of a generalized hierarchical tree stored in the setting data storage unit. 図９は、一般化階層木のバリエーションを説明するための図である。FIG. 9 is a diagram for explaining a variation of the generalized hierarchical tree. 図１０は、設定データ格納部に格納される定義データの一例を示す図である。FIG. 10 is a diagram illustrating an example of definition data stored in the setting data storage unit. 図１１は、実施の形態に係るメインの処理フローを示す図である。FIG. 11 is a diagram illustrating a main processing flow according to the embodiment. 図１２は、レコード探索処理の処理フローを示す図である。FIG. 12 is a diagram illustrating a processing flow of record search processing. 図１３は、ｋ−匿名化処理を説明するための図である。FIG. 13 is a diagram for explaining the k-anonymization process. 図１４は、ｋ−匿名化処理を説明するための図である。FIG. 14 is a diagram for explaining the k-anonymization process. 図１５は、ｋ−匿名化処理を説明するための図である。FIG. 15 is a diagram for explaining the k-anonymization process. 図１６は、ｋ−匿名化処理を説明するための図である。FIG. 16 is a diagram for explaining the k-anonymization process. 図１７は、本実施の形態の概要を説明するための図である。FIG. 17 is a diagram for explaining the outline of the present embodiment. 図１８は、本実施の形態の概要を説明するための図である。FIG. 18 is a diagram for explaining the outline of the present embodiment. 図１９は、レコード探索処理の処理フローを示す図である。FIG. 19 is a diagram showing a processing flow of record search processing. 図２０は、距離算出処理の処理フローを示す図である。FIG. 20 is a diagram illustrating a processing flow of the distance calculation processing. 図２１は、距離算出処理を説明するための図である。FIG. 21 is a diagram for explaining the distance calculation processing. 図２２は、距離算出処理を説明するための図である。FIG. 22 is a diagram for explaining the distance calculation processing. 図２３は、レコードグループ生成処理の処理フローを示す図である。FIG. 23 is a diagram illustrating a processing flow of record group generation processing. 図２４は、統合又は削除処理の処理フローを示す図である。FIG. 24 is a diagram showing a processing flow of integration or deletion processing. 図２５は、一般化処理の処理フローを示す図である。FIG. 25 is a diagram illustrating a processing flow of the generalization processing. 図２６は、一般化処理を説明するための図である。FIG. 26 is a diagram for explaining the generalization processing. 図２７は、処理結果の一例を示す図である。FIG. 27 is a diagram illustrating an example of the processing result. 図２８は、情報分析者装置における処理を説明するための図である。FIG. 28 is a diagram for explaining processing in the information analyst apparatus. 図２９は、情報分析者装置における処理を説明するための図である。FIG. 29 is a diagram for explaining processing in the information analyst apparatus. 図３０は、コンピュータの機能ブロック図である。FIG. 30 is a functional block diagram of a computer.

本技術の実施の形態では、情報提供者毎に異なる匿名化希望レベルｋを設定して、当該情報提供者毎に異なる匿名化希望レベルｋに基づき匿名化処理を実行する。但し、匿名化希望レベルｋの値が大きい情報提供者のレコードと、匿名化希望レベルｋの値が小さい情報提供者のレコードとが一緒に匿名化のために一般化されることがないようにして、情報損失を抑制する。 In the embodiment of the present technology, a different anonymization request level k is set for each information provider, and the anonymization process is executed based on the different anonymization request level k for each information provider. However, a record of an information provider with a large value of anonymization desired level k and a record of an information provider with a small value of anonymization desired level k should not be generalized for anonymization together. Information loss.

具体的に、本実施の形態に係る処理を行うシステムの構成例を図４に示す。図４に示すように、例えばインターネットなどのネットワーク１には、例えば患者のデータを分析のために提供する複数の情報提供者装置（図４では情報提供者装置Ａ及びＢ）と、データ分析のために匿名化処理を実行する情報収集者装置３と、情報収集者装置３から匿名化データの提供を受けて分析処理を実行する情報分析者装置５とが接続されている。これらの装置は、情報処理装置であり、記憶装置、通信機能、演算機能を有している。 Specifically, FIG. 4 shows a configuration example of a system that performs processing according to the present embodiment. As shown in FIG. 4, a network 1 such as the Internet includes, for example, a plurality of information provider devices (information provider devices A and B in FIG. 4) that provide patient data for analysis, and data analysis Therefore, an information collector device 3 that executes anonymization processing and an information analyzer device 5 that receives anonymized data from the information collector device 3 and executes analysis processing are connected. These devices are information processing devices and have a storage device, a communication function, and an arithmetic function.

情報提供者は、自らの個人のデータを提供する者である場合もあれば、病院などの複数の患者のデータを提供する者である場合もある。また、ネットワーク１を介する通信については、暗号化などで秘匿化されているものとする。 The information provider may be a person who provides his / her own personal data, or may be a person who provides data of a plurality of patients such as hospitals. Further, it is assumed that communication via the network 1 is concealed by encryption or the like.

情報提供者装置Ａ及びＢは、自らが保持しているデータをネットワーク１を介して情報収集者装置３に送信する。この際、送信するデータには、識別子（ＩＤ）と、組み合わせることで個人を特定できる可能性があるデータ（ｑＩＤ（quasi-ID）と呼ぶ）と、プライバシデータ又はセンシティブデータと、匿名化希望レベルとが含まれる。匿名化希望レベルは、ｋ−匿名化におけるｋに相当する値であり、同一値がｋ個以上となるようにｑＩＤの値を一般化するものとする。 The information provider devices A and B transmit the data held by the information provider devices A and B to the information collector device 3 via the network 1. In this case, the data to be transmitted includes an identifier (ID), data that can be identified by combining (referred to as qID (quasi-ID)), privacy data or sensitive data, and anonymization request level. And are included. The anonymization desired level is a value corresponding to k in k-anonymization, and the value of qID is generalized so that the same value becomes k or more.

情報収集者装置３は、以下で述べるような匿名化処理を実施して、匿名化データを保持しておく。そして、情報収集者装置３は、情報分析者装置５からの要求に応じて、匿名化データを、情報分析者装置５に送信する。情報分析装置５は、匿名化データを用いて所定の分析処理を実施し、何らかの分析結果を出力する。 The information collector device 3 performs anonymization processing as described below, and retains anonymization data. Then, the information collector device 3 transmits anonymized data to the information analyzer device 5 in response to a request from the information analyzer device 5. The information analysis device 5 performs a predetermined analysis process using the anonymized data and outputs some analysis result.

本実施の形態に係る主要な処理を実施する情報収集者装置３は、図５に示すような構成を有する。情報収集者装置３は、受信部３１と、第１データ格納部３２と、グループ化処理部３３と、設定データ格納部３４と、第２データ格納部３５と、一般化処理部３６と、第３データ格納部３７と、送信部３８とを有する。 The information collector 3 that performs the main processing according to the present embodiment has a configuration as shown in FIG. The information collector device 3 includes a receiving unit 31, a first data storage unit 32, a grouping processing unit 33, a setting data storage unit 34, a second data storage unit 35, a generalization processing unit 36, 3 includes a data storage unit 37 and a transmission unit 38.

受信部３１は、情報提供者装置Ａ及びＢからデータを受信し、第１データ格納部３２に格納する。第１データ格納部３２には、例えば図６に示すようなデータが格納される。図６の例では、ＩＤである名前と、ｑＩＤである年齢及び住所と、センシティブデータである病気及び体重と、匿名化希望レベルｋとが、各レコード（データブロックとも呼ぶ）に含まれるようになっている。匿名化希望レベルｋは、各レコードで異なり得るものである。 The receiving unit 31 receives data from the information provider devices A and B and stores the data in the first data storage unit 32. For example, data as shown in FIG. 6 is stored in the first data storage unit 32. In the example of FIG. 6, the name (ID), age and address (qID), illness and weight (sensitive data), and anonymization desired level k are included in each record (also referred to as a data block). It has become. The anonymization request level k can be different for each record.

グループ化処理部３３は、設定データ格納部３４に格納されているデータに基づき、第１データ格納部３２に格納されているレコード群をグループ化する処理を実行し、処理結果を第２データ格納部３５に格納する。一般化処理部３６は、第２データ格納部３５に格納されている各グループについて、ｑＩＤのデータを、設定データ格納部３４に格納されているデータに従って一般化する処理を実行し、処理結果を第３データ格納部３７に格納する。送信部３８は、情報分析者装置５の要求等に応じて、当該情報分析者装置５等に、第３データ格納部３７に格納するデータを送信する。 The grouping processing unit 33 executes processing for grouping the record group stored in the first data storage unit 32 based on the data stored in the setting data storage unit 34, and stores the processing result in the second data Stored in the unit 35. For each group stored in the second data storage unit 35, the generalization processing unit 36 performs a process of generalizing the qID data according to the data stored in the setting data storage unit 34, and displays the processing result. Store in the third data storage unit 37. The transmission unit 38 transmits data to be stored in the third data storage unit 37 to the information analyst apparatus 5 or the like in response to a request from the information analyst apparatus 5 or the like.

設定データ格納部３４には、各ｑＩＤについて、例えば図７に示すような一般化階層木のデータを保持しておく。図７は、住所というｑＩＤについての一般化階層木の一部を示しており、最上層として「日本」、第２階層として「東日本」「西日本」、第３階層として「東日本」の配下に「東北」及び「関東」など、「西日本」の配下に「関西」及び「九州」などがノードとして配置されている。住所であれば、例えば出現し得る住所のうち市区町村が葉ノードとなるように階層構造を予め用意しておく。このような一般化階層木を用いて、レコード間の距離や一般化処理を行う。 In the setting data storage unit 34, for each qID, for example, data of a generalized hierarchical tree as shown in FIG. 7 is held. FIG. 7 shows a part of a generalized hierarchical tree for an address qID, with “Japan” as the top layer, “East Japan” and “West Japan” as the second layer, and “East Japan” as the third layer. “Kansai” and “Kyushu” are arranged as nodes under “West Japan” such as “Tohoku” and “Kanto”. If it is an address, for example, a hierarchical structure is prepared in advance so that a municipality becomes a leaf node among possible addresses. Using such a generalized hierarchical tree, distance between records and generalization processing are performed.

また、年齢というｑＩＤについての一般化階層木の一部を図８に示す。図８では、２０代というノード以下の部分木を示している。このように、２０代というノードの配下には「２０代前半」及び「２０代後半」というノードが配置されており、２０代前半というノードの配下には２０歳から２４歳までのノードが配置されており、２０代後半というノードの配下には２５歳から２９歳までのノードが配置されている。この場合、葉ノードは、ｑＩＤとして実際に出現する値となっている。 FIG. 8 shows a part of a generalized hierarchical tree for qID called age. FIG. 8 shows a subtree below the 20th node. In this way, the nodes “early 20s” and “late 20s” are placed under the node of the 20s, and nodes from the age of 20 to 24 are placed under the node of the early 20s. The nodes from the age of 25 to 29 are arranged under the node of the late 20s. In this case, the leaf node has a value that actually appears as qID.

以下でも説明するが、本実施の形態では、階層の深さで距離を算出するようになっているので、ブランクの中間ノードを設けることで、階層の深さを調節して距離に対する重み付けを行ってもよい。例えば、図９に示すように、若者というノードの配下には、１０歳未満というノードと１０代というノードとを並列に配置するのではなく、１０代の階層を深くするために、１０歳未満というノードと並列にブランク｛ｂｌａｎｋ｝というノードを配置している。このように予め一般化処理や距離を算出する上で適切に設計された一般化階層木のデータを用意しておく。 As will be described below, in the present embodiment, the distance is calculated based on the depth of the hierarchy. Therefore, by providing a blank intermediate node, the depth of the hierarchy is adjusted to weight the distance. May be. For example, as shown in FIG. 9, under the younger node, the node of less than 10 years and the node of teenager are not arranged in parallel, but in order to deepen the hierarchy of teenagers, less than 10 years old In parallel with this node, a node {blank} is arranged. In this way, data of a generalized hierarchical tree that is appropriately designed for calculating generalization and distance is prepared in advance.

また、本実施の形態では、設定データ格納部３４には、図１０に示すような定義データについても格納しておく。図１０の例では、ＩＤと取り扱われるべきカラムが「名前」カラムであり、ｑＩＤとして取り扱われるべきカラムが「年齢」及び「住所」であり、センシティブデータとして取り扱われるべきカラムが「病気」及び「体重」であることが示されている。以下の処理では、ＩＤについては除去され、ｑＩＤについては距離を算出する際や一般化する際に用いられるので、本データは他の属性のカラムと区別するために用いられる。 In the present embodiment, definition data as shown in FIG. 10 is also stored in the setting data storage unit 34. In the example of FIG. 10, the column to be treated as ID is a “name” column, the columns to be treated as qID are “age” and “address”, and the columns to be treated as sensitive data are “disease” and “ It is shown to be “weight”. In the following processing, ID is removed, and qID is used when calculating or generalizing distance, so this data is used to distinguish it from other attribute columns.

次に、図１１乃至図２９を用いて、情報収集者装置３の処理内容について説明する。なお、既に受信部３１が、情報提供者装置Ａ及びＢ等からデータを受信して第１データ格納部３２に格納する処理が完了しているものとする。 Next, processing contents of the information collector 3 will be described with reference to FIGS. 11 to 29. It is assumed that the receiving unit 31 has already received the data from the information provider devices A and B and stored the data in the first data storage unit 32.

グループ化処理部３３は、第１データ格納部３２に格納されている処理すべきレコードを、レコードｒの集合Ｘに設定する（図１１：ステップＳ１）。例えば、ＩＤ以外のデータ項目値を含むものとしてレコードｒを特定する。すなわち、ｒ＝｛ｑＩＤ₁,...,ｑＩＤ_n,ＳＤ₁,..,ＳＤ_m，ｋ｝。但し、ＳＤはセンシティブデータを表し、ｑＩＤの数はｎ、センシティブデータの項目数がｍであるものとする。例えば、図６のＡという患者の場合、ｒ＝｛２３，東京，胃炎，６７，１｝となっており、２３及び東京はｑＩＤであり、胃炎及び「６７」はＳＤであり、１はｋである。 The grouping processing unit 33 sets the records to be processed stored in the first data storage unit 32 in the set X of records r (FIG. 11: step S1). For example, the record r is specified as including a data item value other than the ID. That is, r = {qID ₁ ,..., QID _n , SD ₁ , .., SD _m , k}. However, SD represents sensitive data, the number of qIDs is n, and the number of items of sensitive data is m. For example, in the case of the patient A in FIG. 6, r = {23, Tokyo, gastritis, 67, 1}, 23 and Tokyo are qID, gastritis and “67” are SD, and 1 is k. It is.

次に、グループ化処理部３３は、レコードグループｅの集合Ｙ＝｛｝（空集合）を設定する（ステップＳ３）。本実施の形態におけるレコードグループｅは、当該グループの代表レコードと、グループに含まれるレコードの集合Ｒとで表される。例えば、ｅ＝｛ｒ_e，Ｒ＝｛ｒ₁,...,ｒ_l｝｝。なお、ｌは、Ｒに含まれるレコードの匿名化希望レベルｋのうちの最大値以上の整数である。 Next, the grouping processing unit 33 sets a set Y = {} (empty set) of the record group e (step S3). The record group e in the present embodiment is represented by a representative record of the group and a set R of records included in the group. For example, e = {r _e , R = {r ₁ ,..., R _l }}. Note that l is an integer equal to or greater than the maximum value among the desired levels of anonymization k of the records included in R.

そして、グループ化処理部３３は、集合Ｘにおいて匿名化希望レベルが最大のレコードのうち、１つのレコードを代表レコードｒ＿ｍａｘとして選択する（ステップＳ５）。匿名化希望レベルが大きいほどグループ化しにくいので、匿名化希望レベルが大きい順に優先的にグループ化するものである。 Then, the grouping processing unit 33 selects one record as the representative record r_max among the records having the maximum anonymization desired level in the set X (step S5). The larger the anonymization request level, the more difficult it is to group. Therefore, grouping is performed preferentially in descending order of anonymization request level.

その後、グループ化処理部３３は、レコードｒ＿ｍａｘに基づき、レコード探索処理を実行する（ステップＳ７）。レコード探索処理については、図１２乃至図２２を用いて説明する。 Thereafter, the grouping processing unit 33 performs a record search process based on the record r_max (step S7). The record search process will be described with reference to FIGS.

まず、グループ化処理部３３は、レコードｒ＿ｍａｘについてのレコードグループｅ＝｛ｒ＿ｍａｘ，Ｒ＝｛ｒ＿ｍａｘ｝｝を生成する（ステップＳ２１）。また、グループ化処理部３３は、グループに含めるべきレコードの集合Ｒｃ＝｛｝（空集合）を設定する（ステップＳ２３）。 First, the grouping processing unit 33 generates a record group e = {r_max, R = {r_max}} for the record r_max (step S21). Further, the grouping processing unit 33 sets a set Rc = {} (empty set) of records to be included in the group (step S23).

さらに、グループ化処理部３３は、ｒ＿ｍａｘの匿名化希望レベルｋ＿ｍａｘからｋ＿ａ＝α×ｋ＿ｍａｘを算出する（ステップＳ２５）。ｋ＿ａは、グループ化する際に他のレコードの匿名化希望レベルの閾値として用いられる。αについては、例えば設定データ格納部３４に予め格納しておく係数である。 Further, the grouping processing unit 33 calculates k_a = α × k_max from the anonymization desired level k_max of r_max (step S25). k_a is used as a threshold for the level of anonymization of other records when grouping. For example, α is a coefficient stored in advance in the setting data storage unit 34.

また、グループ化処理部３３は、距離についての閾値Δ＿ａ＝β×ｋ＿ｍａｘを算出する（ステップＳ２７）。βについても、例えば設定データ格納部３４に予め格納しておく係数である。 Further, the grouping processing unit 33 calculates a threshold value Δ_a = β × k_max for the distance (step S27). β is also a coefficient stored in advance in the setting data storage unit 34, for example.

ここで、一般的なｋ−匿名化との差を具体的に示しておく。例えば図１３に示すような年齢及び住所というｑＩＤを含むレコードをｋ＝３でｋ−匿名化する場合には、図１４に模式的に示すようにｑＩＤの値が近いｋ個のレコードでグループ（図１４中点線丸）を生成し、グループ内のレコードが同じｑＩＤとなるように、値を一般化する。そうすると、図１５に示すように、ｑＩＤである年齢及び住所は一般化される。このように一般的なｋ−匿名化ではｋは全てのレコードについて同一であるが、本実施の形態のようにレコード毎に匿名化希望レベルｋが設定される場合には、同一のｋを有するレコードのみからなるグループを形成しようとすると情報損失が大きくなってしまう。例えば、図１６に示すように、ｋ＝５のレコードを表す丸が４つしかないと、×印が付されているようにそれらのレコードは削除されてしまう。一方、ｋ＝３のレコードを表す丸は３つ存在しているが、大きく離れており、これらのｑＩＤを同一の値に一般化してしまうと、情報が一般化されすぎてしまう。さらに、削除される４つのレコードの近くには、１つｋ＝３のレコードがあるのにグループ化されない。 Here, the difference with general k-anonymization is shown concretely. For example, when a record including qIDs such as age and address as shown in FIG. 13 is k-anonymized with k = 3, as shown schematically in FIG. 14) and generalize the values so that the records in the group have the same qID. Then, as shown in FIG. 15, the age and address which are qID are generalized. In this way, in general k-anonymization, k is the same for all records, but when the desired level of anonymization k is set for each record as in this embodiment, it has the same k. If a group consisting of only records is formed, information loss will increase. For example, as shown in FIG. 16, if there are only four circles representing records with k = 5, those records are deleted as indicated by a cross. On the other hand, there are three circles representing the record of k = 3, but they are far apart, and if these qIDs are generalized to the same value, the information will be generalized too much. Further, although there are 1 k = 3 records near the 4 records to be deleted, they are not grouped.

本実施の形態では、代表レコードｒ＿ｍａｘについてのｋに基づき算出されるｋ＿ａ以上であれば、同じｋでなくとも、以下で説明する距離がΔ＿ａ以下であれば同じグループに纏めてしまう。図１６と同じようなレコード群をグループ化する際に、図１７に模式的に示すように、ｋ＝５のレコード４つと、ｋ＝３の１つのレコードは、ｋの値が近く且つ距離も近いため１つのグループ（点線丸）に纏められる。この際、ｋ＝５のレコードに応じて５つのレコードでグループ化する。また、ｋ＝３の２つのレコードとｋ＝１の１つのレコードについても、ｋの値が近く且つ距離も近いため１つのグループ（点線丸）に纏められる。この際、ｋ＝３のレコードに応じて３つのレコードでグループ化する。 In the present embodiment, if k_a calculated based on k for the representative record r_max is equal to or greater than k_a, they are grouped into the same group if the distance described below is equal to or smaller than Δ_a. When grouping record groups similar to those in FIG. 16, as schematically shown in FIG. 17, four records with k = 5 and one record with k = 3 are close to each other in k value and distance. Because they are close, they are grouped into one group (dotted circle). At this time, the records are grouped into five records according to the record of k = 5. Also, two records with k = 3 and one record with k = 1 are grouped into one group (dotted circle) because the values of k are close and the distance is close. At this time, three records are grouped according to the record of k = 3.

より具体的には、図１８に示すように、代表レコード１００を中心に、距離の閾値Δ＿ａ以内（一点鎖線の範囲）に入るレコードがグループ化の対象となる。但し、代表レコード１００の匿名化希望レベルｋは「３」でα＝０．６であるとすると、ｋ＿ａ＝１．８以上のｋを有するレコードでなければグループ化しないので、ｋ＝１のレコード１０１については同じグループには入れられない。 More specifically, as shown in FIG. 18, the records that fall within the distance threshold Δ_a (the range of the alternate long and short dash line) centering on the representative record 100 are grouped. However, if the desired level of anonymization k of the representative record 100 is “3” and α = 0.6, the records are not grouped unless they have k_a = 1.8 or higher, so the record of k = 1 101 cannot be in the same group.

なお、当初は、Δ＿ａ以内に入るレコードが代表レコードの匿名化希望レベルｋを超えて見つかるかもしれない。例えば、代表レコード１１０の場合、距離の閾値Δ＿ａ以内の範囲１１１に４つのレコードが見つかるが、最も距離が遠いレコード１１２は、一旦グループから除外される。但し、レコード１１２を代表レコードとしてグループ化を行った結果、当該レコード１１２の匿名化希望レベルｋ＝２以上のレコードが見つからなかった場合には、レコード１１２は代表レコード１１０のグループに入れられる。なお、レコード１０２のように孤立してしまっている、すなわち距離と匿名化希望レベルと個数の要件を満たす他のレコードが見つからない場合には、削除される。 Initially, a record that falls within Δ_a may be found exceeding the anonymization desired level k of the representative record. For example, in the case of the representative record 110, four records are found in the range 111 within the distance threshold Δ_a, but the record 112 with the longest distance is temporarily excluded from the group. However, as a result of grouping the record 112 as a representative record, if no record with an anonymization desired level k = 2 or higher of the record 112 is found, the record 112 is put into the group of the representative record 110. If the record 102 is isolated, that is, if no other record that satisfies the requirements of the distance, the level of anonymization, and the number is found, it is deleted.

以下、図１２の処理フローに戻って具体的な処理内容について説明する。グループ化処理部３３は、集合Ｘに含まれる未処理のレコードｒ＿ｊを１つ選択する（ステップＳ２９）。処理は端子Ａを介して図１９の処理に移行する。 Hereinafter, returning to the processing flow of FIG. 12, specific processing contents will be described. The grouping processing unit 33 selects one unprocessed record r_j included in the set X (step S29). The processing shifts to the processing in FIG.

図１９の処理の説明に移行して、グループ化処理部３３は、レコードｒ＿ｊの匿名化希望レベルがｋ＿ａ以上であるか判断する（ステップＳ３１）。レコードｒ＿ｊの匿名化希望レベルがｋ＿ａ未満であれば処理はステップＳ３９に移行する。 19, the grouping processing unit 33 determines whether or not the anonymization request level of the record r_j is equal to or higher than k_a (step S31). If the desired level of anonymization of record r_j is less than k_a, the process proceeds to step S39.

一方、レコードｒ＿ｊの匿名化希望レベルがｋ＿ａ以上であれば、グループ化処理部３３は、代表レコードｒ＿ｍａｘとレコードｒ＿ｊとの間の距離Δ＿ｊを算出する距離算出処理を実行する（ステップＳ３３）。距離算出処理については、図２０乃至図２２を用いて説明する。 On the other hand, if the anonymization desired level of the record r_j is equal to or higher than k_a, the grouping processing unit 33 executes a distance calculation process for calculating the distance Δ_j between the representative record r_max and the record r_j (step S33). The distance calculation process will be described with reference to FIGS.

まず、グループ化処理部３３は、距離Δを０に設定する（図２０：ステップＳ４１）。そして、グループ化処理部３３は、全てのｑＩＤのうち未処理のｑＩＤｉを選択する（ステップＳ４３）。その後、グループ化処理部３３は、設定データ格納部３４におけるｑＩＤｉについての一般化階層木から、距離を算出すべきレコードｒ０及びｒ１のｑＩＤｉの値を含む最小の部分木の深さＤｅｐｔｈ＿ｐを特定する（ステップＳ４５）。 First, the grouping processing unit 33 sets the distance Δ to 0 (FIG. 20: Step S41). Then, the grouping processing unit 33 selects an unprocessed qIDi among all the qIDs (Step S43). Thereafter, the grouping processing unit 33 specifies the depth Depth_p of the minimum subtree including the qIDi values of the records r0 and r1 whose distances are to be calculated from the generalized hierarchical tree for qIDi in the setting data storage unit 34. (Step S45).

図８のような部分木が一般化階層木であるとすると、図２１に示すように、レコードｒ０のｑＩＤｉである年齢が「２０歳」で、レコードｒ１のｑＩＤｉである年齢が「２４歳」であれば、レコードｒ０及びｒ１の値を両方とも含む部分木は点線Ｘで囲まれたものである。そうすると、部分木の深さはＡの矢印で示された深さであり、Ｄｅｐｔｈ＿ｐ＝１となる。一方、一般化階層木の深さは「２」となる。 If the subtree as shown in FIG. 8 is a generalized hierarchical tree, as shown in FIG. 21, the age that is qIDi of record r0 is “20 years”, and the age that is qIDi of record r1 is “24 years”. Then, the subtree including both the values of the records r0 and r1 is surrounded by the dotted line X. Then, the depth of the partial tree is the depth indicated by the arrow A, and Depth_p = 1. On the other hand, the depth of the generalized hierarchical tree is “2”.

一方、図２２に示すように、レコードｒ０のｑＩＤｉである年齢が「２０歳」で、レコードｒ１のｑＩＤｉである年齢が「２９歳」であれば、レコードｒ０及びｒ１の値を両方とも含む部分木は点線Ｙで囲まれた一般化階層木全体となる。従って、部分木の深さはＢの矢印で示された深さであり、Ｄｅｐｔｈ＿ｐ＝２となる。 On the other hand, as shown in FIG. 22, if the age that is qIDi of record r0 is “20 years old” and the age that is qIDi of record r1 is “29 years old”, a portion that includes both the values of records r0 and r1 The tree is the entire generalized hierarchical tree surrounded by the dotted line Y. Therefore, the depth of the partial tree is the depth indicated by the arrow B, and Depth_p = 2.

そして、グループ化処理部３３は、ｑＩＤｉについての距離Δ＿ｉ＝Ｄｅｐｔｈ＿ｐ／｛ｑＩＤｉの一般化階層木の深さ｝を算出する（ステップＳ４７）。さらに、グループ化処理部３３は、Δ＝Δ＋Δ＿ｉを算出する（ステップＳ４９）。すなわち、ｑＩＤｉについて算出された距離Δ＿ｉの総和が、レコード間の距離となる。 Then, the grouping processing unit 33 calculates the distance Δ_i = Depth_p / {depth of the generalized hierarchical tree of qIDi} for qIDi (step S47). Further, the grouping processing unit 33 calculates Δ = Δ + Δ_i (step S49). That is, the sum of the distances Δ_i calculated for qIDi is the distance between records.

そして、グループ化処理部３３は、未処理のｑＩＤｉが存在するか判断する（ステップＳ５１）。未処理のｑＩＤｉが存在する場合には、処理はステップＳ４３に戻る。一方、未処理のｑＩＤｉが存在しない場合には、処理は呼出元の処理に戻る。 Then, the grouping processing unit 33 determines whether there is an unprocessed qIDi (step S51). If there is an unprocessed qIDi, the process returns to step S43. On the other hand, if there is no unprocessed qIDi, the process returns to the caller process.

図１９の処理の説明に戻って、代表レコードｒ＿ｍａｘ及びレコードｒ＿ｊの間の距離が算出されると、グループ化処理部３３は、Δ＿ｊ≦Δ＿ａであるか判断する（ステップＳ３５）。すなわち、代表レコードｒ＿ｍａｘ及びレコードｒ＿ｊの間の距離Δ＿ｊが閾値Δ＿ａ以下であるかを判断する。距離Δ＿ｊが閾値Δ＿ａを超える場合には処理はステップＳ３９に移行する。 Returning to the description of the processing in FIG. 19, when the distance between the representative record r_max and the record r_j is calculated, the grouping processing unit 33 determines whether Δ_j ≦ Δ_a is satisfied (step S35). That is, it is determined whether the distance Δ_j between the representative record r_max and the record r_j is equal to or less than the threshold value Δ_a. If the distance Δ_j exceeds the threshold value Δ_a, the process proceeds to step S39.

一方、距離Δ＿ｊが閾値Δ＿ａ以下である場合には、グループ化処理部３３は、集合Ｒｃに、レコードｒ＿ｊ及び距離Δ＿ｊを設定する（ステップＳ３７）。その後、グループ化処理部３３は、集合Ｘにおいて未処理のレコードが存在するか判断する（ステップＳ３９）。そして、集合Ｘにおいて未処理のレコードが存在しない場合には、端子Ｂを介して図１２のステップＳ２９に戻る。 On the other hand, when the distance Δ_j is equal to or smaller than the threshold value Δ_a, the grouping processing unit 33 sets the record r_j and the distance Δ_j in the set Rc (step S37). Thereafter, the grouping processing unit 33 determines whether or not an unprocessed record exists in the set X (step S39). If no unprocessed record exists in the set X, the process returns to the step S29 in FIG.

このようにすれば、距離と匿名化レベルとについて条件を満たすレコード及びそのレコードとの距離とが集合Ｒｃに登録される。 In this way, the record that satisfies the conditions for the distance and the anonymization level and the distance to the record are registered in the set Rc.

図１１の処理の説明に戻って、グループ化処理部３３は、Ｒｃの要素数がｋ＿ｍａｘ−１以上であるか判断する（ステップＳ９）。Ｒｃの要素数がｋ＿ｍａｘ−１以上であれば、グループ化処理部３３は、レコードグループ生成処理を実行する（ステップＳ１１）。レコードグループ生成処理については図２３を用いて説明する。レコードグループ生成処理が終了するとステップＳ１５に移行する。 Returning to the description of the processing in FIG. 11, the grouping processing unit 33 determines whether the number of elements of Rc is equal to or greater than k_max−1 (step S <b> 9). If the number of elements of Rc is greater than or equal to k_max−1, the grouping processing unit 33 executes a record group generation process (step S11). The record group generation process will be described with reference to FIG. When the record group generation process ends, the process proceeds to step S15.

グループ化処理部３３は、集合ＲｃにおいてΔ＿ｊが小さい順にレコードｒ＿ｊをｋ＿ｍａｘ−１個抽出し、集合Ｒに対して抽出したレコードｒ＿ｊを追加する（ステップＳ６１）。できるだけレコードを何らかのグループに含めるためには、当初は集合Ｒに代表レコードｒ＿ｍａｘを含めてｋ＿ｍａｘ個のレコードを含めるだけにしておく。結果としては、このグループにはｋ＿ｍａｘ個以上のレコードが含まれる場合もある。 The grouping processing unit 33 extracts k_max-1 records r_j in ascending order of Δ_j in the set Rc, and adds the extracted record r_j to the set R (step S61). In order to include as many records as possible in a certain group, initially, the set R includes the representative record r_max and only includes k_max records. As a result, this group may contain k_max or more records.

また、グループ化処理部３３は、レコードグループｅを集合Ｙに追加する（ステップＳ６３）。集合Ｙについてのデータは、第２データ格納部３５に格納される。 Further, the grouping processing unit 33 adds the record group e to the set Y (step S63). Data about the set Y is stored in the second data storage unit 35.

さらに、グループ化処理部３３は、集合Ｘから集合Ｒに含まれるレコードを除外する（ステップＳ６５）。そして処理は、呼出元の処理に戻る。 Further, the grouping processing unit 33 excludes records included in the set R from the set X (step S65). Then, the process returns to the caller process.

このような処理を実施することで、いっしょに一般化すべきレコードグループｅが１つできあがったことになる。但し、レコードグループｅには、まだレコードが追加される可能性はある。 By performing such processing, one record group e to be generalized is completed. However, a record may still be added to the record group e.

図１１の処理の説明に戻って、一方、集合Ｒｃの要素数がｋ＿ｍａｘ−１未満であれば、グループ化処理部３３は、統合又は削除処理を実行する（ステップＳ１３）。統合又は削除処理については、図２４を用いて説明する。なお、この処理が終了すると、処理はステップＳ１５に移行する。 Returning to the description of the processing in FIG. 11, if the number of elements of the set Rc is less than k_max−1, the grouping processing unit 33 executes integration or deletion processing (step S13). The integration or deletion process will be described with reference to FIG. Note that when this process ends, the process proceeds to step S15.

グループ化処理部３３は、代表レコードｒ＿ｍａｘと、集合Ｙに含まれるレコードグループｅの代表レコードｒ＿ｏとの距離Δ（ｒ＿ｏ，ｒ＿ｍａｘ）が最小のレコードグループｅ＿ｍｉｎを特定する（図２４：ステップＳ７１）。 The grouping processing unit 33 identifies the record group e_min having the smallest distance Δ (r_o, r_max) between the representative record r_max and the representative record r_o of the record group e included in the set Y (FIG. 24: step S71).

そして、グループ化処理部３３は、Δ（ｒ＿ｏ，ｒ＿ｍａｘ）≦β×｛代表レコードｒ＿ｏの匿名化希望レベル｝であるかを判断する（ステップＳ７３）。自らの周辺に十分なレコードが存在しなかった代表レコードｒ＿ｍａｘであっても、他のレコードグループの代表レコードとの距離のうち最も短い距離が、当該他のレコードグループの代表レコードの圏内（β×｛代表レコードｒ＿ｏの匿名化希望レベル）であれば、例えば図２３のステップＳ６１で選に漏れたレコードである。従って、ここで再確認を行うものである。 Then, the grouping processing unit 33 determines whether Δ (r_o, r_max) ≦ β × {desired anonymization level of the representative record r_o} (step S73). Even for the representative record r_max in which there are not enough records in the vicinity of itself, the shortest distance among the representative records of the other record groups is within the range of the representative records of the other record groups (β × If it is {anonymization request level of representative record r_o), for example, it is a record that was not selected in step S61 of FIG. Therefore, reconfirmation is performed here.

この処理フローでは、最初に距離が最小のレコードグループｅ＿ｍｉｎを特定しているが、距離Δ（ｒ＿ｏ，ｒ＿ｍａｘ）が、β×｛代表レコードｒ＿ｏの匿名化希望レベル｝以下という条件を満たすレコードグループであれば、そのレコードグループにレコードｒ＿ｍａｘを含めるようにしても大きな問題は無い。 In this processing flow, the record group e_min having the smallest distance is first identified, but the record group e satisfies the condition that the distance Δ (r_o, r_max) satisfies β × {desired level of representative record r_o} or less. If so, there is no significant problem even if the record r_max is included in the record group.

Δ（ｒ＿ｏ，ｒ＿ｍａｘ）≦β×｛代表レコードｒ＿ｏの匿名化希望レベル｝ではない場合には、処理はステップＳ７７に移行する。すなわち、代表レコードｒ＿ｍａｘは、削除されることになる。 If Δ (r_o, r_max) ≦ β × {desired level of representative record r_o} is not satisfied, the process proceeds to step S77. That is, the representative record r_max is deleted.

一方、Δ（ｒ＿ｏ，ｒ＿ｍａｘ）≦β×｛代表レコードｒ＿ｏの匿名化希望レベル｝である場合には、グループ化処理部３３は、レコードグループｅ＿ｍｉｎのレコード集合Ｒ＿ｍｉｎに処理に係るレコードｒ＿ｍａｘを追加する（ステップＳ７５）。 On the other hand, when Δ (r_o, r_max) ≦ β × {desired level of representative record r_o}, the grouping processing unit 33 adds the record r_max related to the process to the record set R_min of the record group e_min. (Step S75).

そして、グループ化処理部３３は、集合Ｘからレコードｒ＿ｍａｘを除外する（ステップＳ７７）。これによって、レコードｒ＿ｍａｘは、これ以降の処理の対象から除外される。そして処理は呼出元の処理に戻る。 Then, the grouping processing unit 33 excludes the record r_max from the set X (Step S77). As a result, the record r_max is excluded from the target of subsequent processing. Then, the process returns to the caller process.

図１１の処理の説明に戻って、グループ化処理部３３は、集合Ｘが空集合であるか判断する（ステップＳ１５）。集合Ｘが空集合でない場合には、グループ化処理部３３は、集合Ｘにおいて匿名化希望レベルが最大のレコードのうち、現ｒ＿ｍａｘとの距離が最大となるレコードｒを新ｒ＿ｍａｘと設定する（ステップＳ１７）。そして処理はステップＳ７に戻る。 Returning to the description of the processing in FIG. 11, the grouping processing unit 33 determines whether the set X is an empty set (step S15). If the set X is not an empty set, the grouping processing unit 33 sets, as a new r_max, a record r having a maximum distance from the current r_max among the records having the maximum anonymization desired level in the set X (step r). S17). Then, the process returns to step S7.

このように現ｒ＿ｍａｘと距離が最大となるレコードｒを抽出すれば、レコードｒのグループ生成が効率的に行われるようになる。すなわち、まだレコードｒの周辺により多くのレコードが残っていることが期待されるためである。但し、現ｒ＿ｍａｘから距離Δ＿ａを超えた位置にあるレコードであれば、このステップで新たなｒ＿ｍａｘとして選択しても処理は可能である。 If the record r having the maximum distance from the current r_max is extracted in this way, the group generation of the record r can be efficiently performed. That is, it is expected that more records still remain around the record r. However, if the record is located at a position beyond the distance Δ_a from the current r_max, the process can be performed even if it is selected as a new r_max in this step.

一方、集合Ｘが空集合であれば、一般化処理部３６は、設定データ格納部３４に格納されているデータを用いて、一般化処理を実行し、処理結果を第３データ格納部３７に格納する（ステップＳ１９）。一般化処理については、図２５乃至図２７を用いて説明する。 On the other hand, if the set X is an empty set, the generalization processing unit 36 executes the generalization processing using the data stored in the setting data storage unit 34 and sends the processing result to the third data storage unit 37. Store (step S19). The generalization process will be described with reference to FIGS.

一般化処理部３６は、集合Ｙに含まれるレコードグループｅのうち未処理のレコードグループを特定する（図２５：ステップＳ８１）。そして、一般化処理部３６は、未処理のｑＩＤｉを１つ選択する（ステップＳ８３）。 The generalization processing unit 36 identifies an unprocessed record group among the record groups e included in the set Y (FIG. 25: Step S81). Then, the generalization processing unit 36 selects one unprocessed qIDi (step S83).

その後、一般化処理部３６は、レコードグループｅの集合Ｒに含まれるレコードのｑＩＤｉの値について共通の親を一般化階層木から特定し、集合Ｒに含まれるレコードのｑＩＤｉの値を、共通の親の値で置換する（ステップＳ８５）。例えば、図２６に示すように、ｑＩＤｉである年齢の値が「２０歳」「２４歳」「２５歳」であるレコード群が集合Ｒに含まれている場合には、共通の親「２０代」に、置換される。 Thereafter, the generalization processing unit 36 identifies a common parent for the qIDi values of the records included in the set R of the record group e from the generalized hierarchical tree, and determines the qIDi values of the records included in the set R as the common Replace with the parent value (step S85). For example, as shown in FIG. 26, when a set of records whose age values of qIDi are “20 years old”, “24 years old”, and “25 years old” are included in the set R, the common parent “20 generations” Is replaced.

そして、一般化処理部３６は、未処理のｑＩＤｉが存在するか判断する（ステップＳ８７）。未処理のｑＩＤｉが存在する場合には、処理はステップＳ８３に戻る。一方、未処理のｑＩＤｉが存在しない場合には、一般化処理部３６は、未処理のレコードグループｅが集合Ｙに存在しているか判断する（ステップＳ８９）。未処理のレコードグループｅが存在する場合にはステップＳ８１に戻る。一方、未処理のレコードグループｅが存在していない場合には、一般化処理部３６は、集合Ｙに含まれる全てのレコードグループｅ及び当該レコードグループｅの集合Ｒに含まれるレコードのデータを、第３データ格納部３７に格納する（ステップＳ９１）。 Then, the generalization processing unit 36 determines whether there is an unprocessed qIDi (step S87). If there is an unprocessed qIDi, the process returns to step S83. On the other hand, if there is no unprocessed qIDi, the generalization processing unit 36 determines whether an unprocessed record group e exists in the set Y (step S89). If there is an unprocessed record group e, the process returns to step S81. On the other hand, in the case where there is no unprocessed record group e, the generalization processing unit 36 stores the data of all the record groups e included in the set Y and the records included in the set R of the record group e. It stores in the 3rd data storage part 37 (step S91).

以上のような処理を実行することで、レコード毎に匿名化希望レベルｋが異なる場合であっても、情報損失を抑えつつ匿名化することができるようになる。 By executing the processing as described above, anonymization can be performed while suppressing information loss even when the anonymization request level k is different for each record.

例えば図６のようなデータを上で述べたような処理を実行すれば、図２７に示すようなデータが得られる。図２７の例では、年齢及び住所といったｑＩＤについては一般化されている。 For example, if the process as described above is executed on the data as shown in FIG. 6, the data as shown in FIG. 27 is obtained. In the example of FIG. 27, qIDs such as age and address are generalized.

このような処理を行った後、図２７に示すようなデータは、送信部３８により、例えば情報分析者装置５からの要求に応じて、情報分析者装置５へ送信される。 After performing such processing, data as shown in FIG. 27 is transmitted to the information analyst apparatus 5 by the transmission unit 38 in response to a request from the information analyst apparatus 5, for example.

しかしながら、例えば図２８に示すように、できるだけ情報損失が少なくなるように匿名化しているが、住所が「東京都」や「関東」、年齢が「２０代」と「２３歳」といったようにデータの粒度が異なるとデータ分析がやりにくいという側面もある。従って、これに対しては図２９に示すように、一般化階層木の階層に従って属性を分割し、一般化されてしまっており不明な属性値についてはブランク「ｂｌａｎｋ」として設定する。すなわち、住所が「関東」でどの県か不明な場合には、県属性は「ｂｌａｎｋ」となる。また、年齢が「２０代」で二十何歳か分からないので、年齢属性は「ｂｌａｎｋ」となる。このような処理を情報分析者装置５で実施すれば、相関関係抽出などのデータ分析が容易になる。 However, as shown in FIG. 28, for example, anonymization is performed so as to minimize information loss, but data such as “Tokyo” or “Kanto”, “20s”, and “23” There is also an aspect that data analysis is difficult if the granularity of each is different. Accordingly, as shown in FIG. 29, the attribute is divided according to the hierarchy of the generalized hierarchical tree, and the attribute value that has been generalized and is unknown is set as a blank “blank”. That is, when the address is “Kanto” and the prefecture is unknown, the prefecture attribute is “blank”. In addition, since the age is “20's” and it is not known how many 20 years old, the age attribute is “blank”. If such processing is performed by the information analyst apparatus 5, data analysis such as correlation extraction becomes easy.

以上本技術の実施の形態を説明したが、本技術はこれに限定されるものではない。例えば、図５に示した情報収集装置３の機能ブロック図は、プログラムモジュール構成とは一致しない場合もある。また、処理フローについても、処理結果が変わらない限り、ステップの順番を入れ替えたり、並列実行できる場合もある。 Although the embodiment of the present technology has been described above, the present technology is not limited to this. For example, the functional block diagram of the information collection device 3 shown in FIG. 5 may not match the program module configuration. As for the processing flow, as long as the processing result does not change, the order of the steps may be changed or executed in parallel.

また、医療情報の例を示しているが、これは一例であってどのようなデータであっても良い。 Moreover, although the example of medical information is shown, this is an example and any data may be sufficient.

なお、上で述べた情報提供者装置Ａ及びＢ、情報収集者装置３及び情報分析者装置５は、コンピュータ装置であって、図３０に示すように、メモリ２５０１とＣＰＵ（Central Processing Unit）２５０３とハードディスク・ドライブ（ＨＤＤ：Hard Disk Drive）２５０５と表示装置２５０９に接続される表示制御部２５０７とリムーバブル・ディスク２５１１用のドライブ装置２５１３と入力装置２５１５とネットワークに接続するための通信制御部２５１７とがバス２５１９で接続されている。オペレーティング・システム（ＯＳ：Operating System）及び本実施例における処理を実施するためのアプリケーション・プログラムは、ＨＤＤ２５０５に格納されており、ＣＰＵ２５０３により実行される際にはＨＤＤ２５０５からメモリ２５０１に読み出される。ＣＰＵ２５０３は、アプリケーション・プログラムの処理内容に応じて表示制御部２５０７、通信制御部２５１７、ドライブ装置２５１３を制御して、所定の動作を行わせる。また、処理途中のデータについては、主としてメモリ２５０１に格納されるが、ＨＤＤ２５０５に格納されるようにしてもよい。本技術の実施例では、上で述べた処理を実施するためのアプリケーション・プログラムはコンピュータ読み取り可能なリムーバブル・ディスク２５１１に格納されて頒布され、ドライブ装置２５１３からＨＤＤ２５０５にインストールされる。インターネットなどのネットワーク及び通信制御部２５１７を経由して、ＨＤＤ２５０５にインストールされる場合もある。このようなコンピュータ装置は、上で述べたＣＰＵ２５０３、メモリ２５０１などのハードウエアとＯＳ及びアプリケーション・プログラムなどのプログラムとが有機的に協働することにより、上で述べたような各種機能を実現する。 The information provider devices A and B, the information collector device 3 and the information analyst device 5 described above are computer devices, and as shown in FIG. 30, a memory 2501 and a CPU (Central Processing Unit) 2503. A hard disk drive (HDD) 2505, a display control unit 2507 connected to the display device 2509, a drive device 2513 for the removable disk 2511, an input device 2515, and a communication control unit 2517 for connecting to the network. Are connected by a bus 2519. An operating system (OS) and an application program for executing the processing in this embodiment are stored in the HDD 2505, and are read from the HDD 2505 to the memory 2501 when executed by the CPU 2503. The CPU 2503 controls the display control unit 2507, the communication control unit 2517, and the drive device 2513 according to the processing content of the application program, and performs a predetermined operation. Further, data in the middle of processing is mainly stored in the memory 2501, but may be stored in the HDD 2505. In an embodiment of the present technology, an application program for performing the above-described processing is stored in a computer-readable removable disk 2511 and distributed, and installed from the drive device 2513 to the HDD 2505. In some cases, the HDD 2505 may be installed via a network such as the Internet and the communication control unit 2517. Such a computer apparatus realizes various functions as described above by organically cooperating hardware such as the CPU 2503 and the memory 2501 described above and programs such as the OS and application programs. .

以上述べた本実施の形態をまとめると、以下のようになる。 The above-described embodiment can be summarized as follows.

本実施の形態に係る匿名化処理方法は、（Ａ）第１のデータ項目値と当該第１のデータ項目値を匿名化するために一般化され得る第２のデータ項目値と当該第２のデータ項目値の一般化のためにグループ化すべきデータブロックの数とを含むデータブロックを複数格納するデータ格納部から、グループを代表する第１のデータブロックを選択する処理と、（Ｂ）データ格納部から、第２のデータ項目値に基づき算出される、第１のデータブロックとの距離が閾値未満であり且つグループ化すべきデータブロックの数が閾値以上である第２のデータブロックを抽出する処理と、（Ｃ）第１のデータブロック及び第２のデータブロックの数が、第１のデータブロックに含まれる、グループ化すべきデータブロックの数以上であれば、第２のデータブロックの少なくとも一部及び第１のデータブロックをグループ化する処理と、（Ｄ）上記選択する処理と上記抽出する処理と上記グループ化する処理とを、データ格納部に格納されたデータブロックのうち、グループ化すべきデータブロックの数が大きいデータブロックの順に実行する処理と、（Ｅ）グループ毎に、当該グループに含まれるデータブロックにおける第２のデータ項目値を、所定のルールに従って一般化する処理とを含む。 The anonymization processing method according to the present embodiment includes (A) a first data item value and a second data item value that can be generalized to anonymize the first data item value and the second data item value. A process of selecting a first data block representing a group from a data storage unit that stores a plurality of data blocks including the number of data blocks to be grouped for generalization of data item values; and (B) data storage Processing to extract a second data block calculated based on the second data item value and having a distance from the first data block that is less than the threshold and the number of data blocks to be grouped is greater than or equal to the threshold (C) if the number of first data blocks and second data blocks is equal to or greater than the number of data blocks to be grouped included in the first data block, the second data A process of grouping at least a part of the lock and the first data block; and (D) the process of selecting, the process of extracting, and the process of grouping are among the data blocks stored in the data storage unit , Processing executed in order of data blocks having the largest number of data blocks to be grouped, and (E) for each group, generalizing the second data item value in the data block included in the group according to a predetermined rule Including.

このような処理を実行すれば、データブロック毎に上記グループ化すべきデータブロックの数（実施の形態における匿名化希望レベルに相当）が設定されていても、情報損失を抑えつつ適切に匿名化を行うことができるようになる。 If such processing is executed, even if the number of data blocks to be grouped for each data block (corresponding to the desired level of anonymization in the embodiment) is set, anonymization is appropriately performed while suppressing information loss. Will be able to do.

上記実行する処理において、直前に選択された第１のデータブロックとの距離が最大となるデータブロック又は直前に選択された第１のデータブロックとの距離が上記閾値を超えるデータブロックを選択するようにしても良い。前者であれば効率的にグループ化を行うことができるようになる。 In the process to be executed, the data block having the maximum distance to the first data block selected immediately before or the data block whose distance to the first data block selected immediately before exceeds the threshold is selected. Anyway. If the former, grouping can be performed efficiently.

また、上記距離が、第２のデータ項目値を段階的に一般化する階層木のデータにおいて、２つの第２のデータ項目値を含む最小の部分木における階層数と、階層木の階層数とから算出されるようにしても良い。このようにすれば、適切に距離を定義できるようになる。 Further, in the hierarchical tree data in which the distance is generalized in a stepwise manner with respect to the second data item value, the number of hierarchies in the smallest subtree including the two second data item values, It may be calculated from In this way, the distance can be appropriately defined.

また、上記距離の閾値が、第１のデータブロックに含まれる、グループ化すべきデータブロックの数に応じて決定されるようにしても良い。例えばグループ化すべきデータブロックの数が大きければ閾値を大きくし、小さければ閾値を小さくして、データブロックをグループ化する範囲を適切に調節するものである。 The distance threshold may be determined according to the number of data blocks to be grouped included in the first data block. For example, the threshold value is increased if the number of data blocks to be grouped is large, and the threshold value is decreased if the number is small, and the range for grouping the data blocks is adjusted appropriately.

さらに、データブロックの数の閾値が、第１のデータブロックに含まれる、グループ化すべきデータブロックの数に応じて決定されるようにしても良い。あまりにグループ化すべきデータブロックの数が異なるデータブロックがグループ化されると、一般化されすぎるので、制限を加えて情報損失を抑えるものである。 Further, the threshold value for the number of data blocks may be determined according to the number of data blocks to be grouped included in the first data block. If data blocks that have too different numbers of data blocks to be grouped are grouped together, they are too generalized to limit information loss.

さらに、上記匿名化処理方法は、上記第１のデータブロック及び第２のデータブロックの数が、第１のデータブロックに含まれる、グループ化すべきデータブロックの数未満である場合、他のグループに含まれる第１のデータブロックとの距離が閾値以下であれば、当該他のグループに上記選択する処理において選択された第１のデータブロックを追加する処理をさらに含むようにしても良い。このようにすれば破棄されるデータブロックの数を抑えることができるようになる。また、追加の条件として、「最短の距離が閾値以下」という条件を付しても良い。 Furthermore, in the anonymization processing method, when the number of the first data block and the second data block is less than the number of data blocks to be grouped included in the first data block, If the distance from the included first data block is equal to or smaller than the threshold value, a process of adding the first data block selected in the selecting process to the other group may be further included. In this way, the number of data blocks discarded can be suppressed. As an additional condition, a condition that “the shortest distance is equal to or less than a threshold value” may be added.

さらに、上記匿名化処理方法は、上記他のグループに含まれる第１のデータブロックとの距離が閾値を超える場合には、第１のデータブロックを破棄する処理をさらに含むようにしても良い。 Furthermore, the anonymization processing method may further include a process of discarding the first data block when the distance from the first data block included in the other group exceeds a threshold value.

なお、上で述べたような処理をコンピュータに実行させるためのプログラムを作成することができ、当該プログラムは、例えばフレキシブル・ディスク、ＣＤ−ＲＯＭなどの光ディスク、光磁気ディスク、半導体メモリ（例えばＲＯＭ）、ハードディスク等のコンピュータ読み取り可能な記憶媒体又は記憶装置に格納される。 Note that a program for causing a computer to execute the processing described above can be created, and the program includes, for example, a flexible disk, an optical disk such as a CD-ROM, a magneto-optical disk, and a semiconductor memory (for example, ROM). Or a computer-readable storage medium such as a hard disk or a storage device.

以上の実施例を含む実施形態に関し、さらに以下の付記を開示する。 The following supplementary notes are further disclosed with respect to the embodiments including the above examples.

（付記１）
第１のデータ項目値と当該第１のデータ項目値を匿名化するために一般化され得る第２のデータ項目値と当該第２のデータ項目値の一般化のためにグループ化すべきデータブロックの数とを含むデータブロックを複数格納するデータ格納部から、グループを代表する第１のデータブロックを選択する処理と、
前記データ格納部から、第２のデータ項目値に基づき算出される、前記第１のデータブロックとの距離が閾値未満であり且つグループ化すべきデータブロックの数が閾値以上である第２のデータブロックを抽出する処理と、
前記第１のデータブロック及び前記第２のデータブロックの数が、前記第１のデータブロックに含まれる、グループ化すべきデータブロックの数以上であれば、前記第２のデータブロックの少なくとも一部及び前記第１のデータブロックをグループ化する処理と、
前記選択する処理と前記抽出する処理と前記グループ化する処理とを、前記データ格納部に格納されたデータブロックのうち、前記グループ化すべきデータブロックの数が大きいデータブロックの順に実行する処理と、
グループ毎に、当該グループに含まれるデータブロックにおける第２のデータ項目値を、所定のルールに従って一般化する処理と、
を含み、コンピュータにより実行される匿名化処理方法。 (Appendix 1)
A first data item value and a second data item value that can be generalized to anonymize the first data item value and a data block to be grouped for generalization of the second data item value A process of selecting a first data block representing a group from a data storage unit that stores a plurality of data blocks including a number;
A second data block calculated from the data storage unit based on a second data item value and having a distance from the first data block that is less than a threshold and the number of data blocks to be grouped is greater than or equal to the threshold The process of extracting
If the number of the first data block and the second data block is equal to or greater than the number of data blocks to be grouped included in the first data block, at least a part of the second data block and Processing to group the first data blocks;
A process of executing the process of selecting, the process of extracting, and the process of grouping in order of data blocks having a large number of data blocks to be grouped among the data blocks stored in the data storage unit;
For each group, a process of generalizing the second data item value in the data block included in the group according to a predetermined rule;
And an anonymization processing method executed by a computer.

（付記２）
前記実行する処理において、
直前に選択された前記第１のデータブロックとの距離が最大となるデータブロック又は前記直前に選択された前記第１のデータブロックとの距離が前記閾値を超えるデータブロックを選択する
付記１記載の匿名化処理方法。 (Appendix 2)
In the processing to be executed,
The data block whose distance from the first data block selected immediately before is the maximum or the data block whose distance from the first data block selected immediately before exceeds the threshold is selected. Anonymization processing method.

（付記３）
前記距離が、
第２のデータ項目値を段階的に一般化する階層木のデータにおいて、２つの第２のデータ項目値を含む最小の部分木における階層数と、前記階層木の階層数とから算出される
付記１又は２記載の匿名化処理方法。 (Appendix 3)
The distance is
In the hierarchical tree data that generalizes the second data item value step by step, it is calculated from the number of hierarchies in the smallest subtree including the two second data item values and the number of hierarchies of the hierarchical tree. The anonymization processing method according to 1 or 2.

（付記４）
前記距離の閾値が、
前記第１のデータブロックに含まれる、グループ化すべきデータブロックの数に応じて決定される
付記１乃至３のいずれか１つ記載の匿名化処理方法。 (Appendix 4)
The distance threshold is
The anonymization processing method according to any one of appendices 1 to 3, which is determined according to the number of data blocks to be grouped included in the first data block.

（付記５）
前記データブロックの数の閾値が、
前記第１のデータブロックに含まれる、グループ化すべきデータブロックの数に応じて決定される
付記１乃至４のいずれか１つ記載の匿名化処理方法。 (Appendix 5)
The threshold of the number of data blocks is
The anonymization processing method according to any one of appendices 1 to 4, which is determined according to the number of data blocks to be grouped, included in the first data block.

（付記６）
前記第１のデータブロック及び前記第２のデータブロックの数が、前記第１のデータブロックに含まれる、グループ化すべきデータブロックの数未満である場合、他のグループに含まれる第１のデータブロックとの距離が閾値以下であれば、当該他のグループに前記選択する処理において選択された前記第１のデータブロックを追加する処理
をさらに含む付記１乃至５のいずれか１つ記載の匿名化処理方法。 (Appendix 6)
When the number of the first data block and the second data block is less than the number of data blocks to be grouped included in the first data block, the first data block included in another group The anonymization process according to any one of appendices 1 to 5, further including: a process of adding the first data block selected in the process of selecting to the other group if the distance to is less than or equal to a threshold value Method.

（付記７）
前記他のグループに含まれる第１のデータブロックとの距離が閾値を超える場合には、前記第１のデータブロックを破棄する処理
をさらに含む付記６記載の匿名化処理方法。 (Appendix 7)
The anonymization processing method according to appendix 6, further including a process of discarding the first data block when a distance from the first data block included in the other group exceeds a threshold.

（付記８）
第１のデータ項目値と当該第１のデータ項目値を匿名化するために一般化され得る第２のデータ項目値と当該第２のデータ項目値の一般化のためにグループ化すべきデータブロックの数とを含むデータブロックを複数格納するデータ格納部から、グループを代表する第１のデータブロックを選択する処理と、
前記データ格納部から、第２のデータ項目値に基づき算出される、前記第１のデータブロックとの距離が閾値未満であり且つグループ化すべきデータブロックの数が閾値以上である第２のデータブロックを抽出する処理と、
前記第１のデータブロック及び前記第２のデータブロックの数が、前記第１のデータブロックに含まれる、グループ化すべきデータブロックの数以上であれば、前記第２のデータブロックの少なくとも一部及び前記第１のデータブロックをグループ化する処理と、
前記選択する処理と前記抽出する処理と前記グループ化する処理とを、前記データ格納部に格納されたデータブロックのうち、前記グループ化すべきデータブロックの数が大きいデータブロックの順に実行する処理と、
グループ毎に、当該グループに含まれるデータブロックにおける第２のデータ項目値を、所定のルールに従って一般化する処理と、
を、コンピュータに実行させるための匿名化処理プログラム。 (Appendix 8)
A first data item value and a second data item value that can be generalized to anonymize the first data item value and a data block to be grouped for generalization of the second data item value A process of selecting a first data block representing a group from a data storage unit that stores a plurality of data blocks including a number;
A second data block calculated from the data storage unit based on a second data item value and having a distance from the first data block that is less than a threshold and the number of data blocks to be grouped is greater than or equal to the threshold The process of extracting
If the number of the first data block and the second data block is equal to or greater than the number of data blocks to be grouped included in the first data block, at least a part of the second data block and Processing to group the first data blocks;
A process of executing the process of selecting, the process of extracting, and the process of grouping in order of data blocks having a large number of data blocks to be grouped among the data blocks stored in the data storage unit;
For each group, a process of generalizing the second data item value in the data block included in the group according to a predetermined rule;
Is an anonymization processing program for causing a computer to execute.

（付記９）
第１のデータ項目値と当該第１のデータ項目値を匿名化するために一般化され得る第２のデータ項目値と当該第２のデータ項目値の一般化のためにグループ化すべきデータブロックの数とを含むデータブロックを複数格納するデータ格納部から、グループを代表する第１のデータブロックを選択する処理と、前記データ格納部から、第２のデータ項目値に基づき算出される、前記第１のデータブロックとの距離が閾値未満であり且つグループ化すべきデータブロックの数が閾値以上である第２のデータブロックを抽出する処理と、前記第１のデータブロック及び前記第２のデータブロックの数が、前記第１のデータブロックに含まれる、グループ化すべきデータブロックの数以上であれば、前記第２のデータブロックの少なくとも一部及び前記第１のデータブロックをグループ化する処理とを、前記データ格納部に格納されたデータブロックのうち、前記グループ化すべきデータブロックの数が大きいデータブロックの順に実行する第１処理部と、
グループ毎に、当該グループに含まれるデータブロックにおける第２のデータ項目値を、所定のルールに従って一般化する第２処理部と、
を有する情報処理装置。 (Appendix 9)
A first data item value and a second data item value that can be generalized to anonymize the first data item value and a data block to be grouped for generalization of the second data item value Calculating a first data block representing a group from a data storage unit that stores a plurality of data blocks including a number, and calculating the second data item value from the data storage unit. A process of extracting a second data block whose distance from one data block is less than a threshold and the number of data blocks to be grouped is equal to or greater than a threshold; and the first data block and the second data block If the number is greater than or equal to the number of data blocks to be grouped included in the first data block, at least part of the second data block and the previous And a process of grouping a first data block, and wherein among the data blocks stored in the data storage unit, a first processing unit that executes the order of a few large data blocks of the data block to be the grouping,
A second processing unit that generalizes a second data item value in a data block included in the group for each group according to a predetermined rule;
An information processing apparatus.

３情報収集者装置
３１受信部
３２第１データ格納部
３３グループ化処理部
３４設定データ格納部
３５第２データ格納部
３６一般化処理部
３７第３データ格納部
３８送信部 3 information collector device 31 receiving unit 32 first data storage unit 33 grouping processing unit 34 setting data storage unit 35 second data storage unit 36 generalization processing unit 37 third data storage unit 38 transmission unit

Claims

A first data item value and a second data item value that can be generalized to anonymize the first data item value and a data block to be grouped for generalization of the second data item value A process of selecting a first data block representing a group from a data storage unit that stores a plurality of data blocks including a number;
A second data block calculated from the data storage unit based on a second data item value and having a distance from the first data block that is less than a threshold and the number of data blocks to be grouped is greater than or equal to the threshold The process of extracting
If the number of the first data block and the second data block is equal to or greater than the number of data blocks to be grouped included in the first data block, at least a part of the second data block and Processing to group the first data blocks;
A process of executing the process of selecting, the process of extracting, and the process of grouping in order of data blocks having a large number of data blocks to be grouped among the data blocks stored in the data storage unit;
For each group, a process of generalizing the second data item value in the data block included in the group according to a predetermined rule;
And an anonymization processing method executed by a computer.

In the processing to be executed,
The data block having the maximum distance from the first data block selected immediately before or the data block having a distance from the first data block selected immediately before exceeds the threshold is selected. Anonymization processing method.

The distance is
In the hierarchical tree data that generalizes the second data item value step by step, it is calculated from the number of hierarchies in the smallest subtree including two second data item values and the number of hierarchies of the hierarchical tree. Item 3. Anonymization processing method according to item 1 or 2.

The distance threshold is
The anonymization processing method according to claim 1, wherein the anonymization processing method is determined according to the number of data blocks to be grouped included in the first data block.

The threshold of the number of data blocks is
The anonymization processing method according to claim 1, wherein the anonymization processing method is determined according to the number of data blocks to be grouped included in the first data block.

When the number of the first data block and the second data block is less than the number of data blocks to be grouped included in the first data block, the first data block included in another group The anonymization according to any one of claims 1 to 5, further comprising: adding the first data block selected in the process to be selected to the other group if the distance to is less than or equal to a threshold value. Processing method.

The anonymization processing method according to claim 6, further comprising: a process of discarding the first data block when a distance from the first data block included in the other group exceeds a threshold.

A first data item value and a second data item value that can be generalized to anonymize the first data item value and a data block to be grouped for generalization of the second data item value A process of selecting a first data block representing a group from a data storage unit that stores a plurality of data blocks including a number;
A second data block calculated from the data storage unit based on a second data item value and having a distance from the first data block that is less than a threshold and the number of data blocks to be grouped is greater than or equal to the threshold The process of extracting
If the number of the first data block and the second data block is equal to or greater than the number of data blocks to be grouped included in the first data block, at least a part of the second data block and Processing to group the first data blocks;
A process of executing the process of selecting, the process of extracting, and the process of grouping in order of data blocks having a large number of data blocks to be grouped among the data blocks stored in the data storage unit;
For each group, a process of generalizing the second data item value in the data block included in the group according to a predetermined rule;
Is an anonymization processing program for causing a computer to execute.

A first data item value and a second data item value that can be generalized to anonymize the first data item value and a data block to be grouped for generalization of the second data item value Calculating a first data block representing a group from a data storage unit that stores a plurality of data blocks including a number, and calculating the second data item value from the data storage unit. A process of extracting a second data block whose distance from one data block is less than a threshold and the number of data blocks to be grouped is equal to or greater than a threshold; and the first data block and the second data block If the number is greater than or equal to the number of data blocks to be grouped included in the first data block, at least part of the second data block and the previous And a process of grouping a first data block, and wherein among the data blocks stored in the data storage unit, a first processing unit that executes the order of a few large data blocks of the data block to be the grouping,
A second processing unit that generalizes a second data item value in a data block included in the group for each group according to a predetermined rule;
An information processing apparatus.