JP6148370B1

JP6148370B1 - Grouping device, grouping method, and computer program

Info

Publication number: JP6148370B1
Application number: JP2016066129A
Authority: JP
Inventors: 優一真田; 悠佑榎本; 柳本　清; 清柳本; 浩鞍留; 寛寺門
Original assignee: Nippon Telegraph and Telephone West Corp
Current assignee: Nippon Telegraph and Telephone West Corp
Priority date: 2016-03-29
Filing date: 2016-03-29
Publication date: 2017-06-14
Anticipated expiration: 2036-03-29
Also published as: JP2017182304A

Abstract

【課題】匿名性及び有用性の双方を保つこと。【解決手段】匿名化されていない属性の値を有する非匿名化情報である複数のレコードを、各グループに含まれるレコードの数が均等化するように複数のグループに分ける均等化処理部、を備え、複数のレコードは、時系列の情報によって構成され、均等化処理部は、レコードの中から決定されるグループの基準となるレコードを含む１又は複数のグループに基づいて、複数のレコードのうち未処理のレコードを複数のグループに分けるグループ化装置。【選択図】図１An object of the present invention is to maintain both anonymity and usefulness. An equalization processing unit that divides a plurality of records that are non-anonymized information having non-anonymized attribute values into a plurality of groups so that the number of records included in each group is equalized. The plurality of records are configured by time-series information, and the equalization processing unit is based on one or a plurality of groups including a record serving as a group reference determined from the records. A grouping device that divides unprocessed records into groups. [Selection] Figure 1

Description

本発明は、情報の匿名化技術に関する。 The present invention relates to information anonymization technology.

従来、多くの情報をビッグデータとして収集し、それらを解析することによって新たな情報を取得することが行われている。ビッグデータには、個人の情報などそのままでは解析にかけることができない情報も含まれている。そのため、収集された情報を二次利用することが可能となるように、収集された情報に対して匿名化処理が行われている。 Conventionally, a lot of information is collected as big data, and new information is acquired by analyzing them. Big data also includes information that cannot be directly analyzed, such as personal information. Therefore, anonymization processing is performed on the collected information so that the collected information can be secondarily used.

特開２０１５−０４６０３０号公報JP, 2015-046030, A

しかしながら、従来の匿名化処理では匿名性及び有用性のバランスを適切に保つことが困難であった。 However, it has been difficult to keep the balance between anonymity and usefulness appropriately in the conventional anonymization process.

上記事情に鑑み、本発明は、匿名性及び有用性の双方を保つことが可能な匿名化技術を提供することを目的としている。 In view of the above circumstances, an object of the present invention is to provide an anonymization technique capable of maintaining both anonymity and usefulness.

本発明の一態様は、匿名化されていない属性の値を有する非匿名化情報である複数のレコードを、各グループに含まれるレコードの数が均等化するように複数のグループに分ける均等化処理部、を備え、前記複数のレコードは、時系列の情報によって構成され、前記均等化処理部は、前記レコードの中から決定されるグループの基準となるレコードを含む１又は複数のグループに基づいて、前記複数のレコードのうち未処理のレコードを複数のグループに分けるグループ化装置である。 One aspect of the present invention is an equalization process in which a plurality of records that are non-anonymized information having non-anonymized attribute values are divided into a plurality of groups so that the number of records included in each group is equalized. The plurality of records are configured by time-series information, and the equalization processing unit is based on one or a plurality of groups including a record serving as a group reference determined from the records. The grouping device divides unprocessed records among the plurality of records into a plurality of groups.

本発明の一態様は、上記のグループ化装置であって、前記均等化処理部は、前記未処理のレコードと前記グループに含まれるレコードとを用いて、各組み合わせにおける２つのレコード間の時系列の情報の差分を、時系列のデータを構成する各時間データの値をベクトル要素とするベクトル空間における距離としたレコード間距離の平均値をグループ毎に算出し、前記未処理のレコードを前記平均値が最小のグループに分ける。 One aspect of the present invention is the grouping device described above, in which the equalization processing unit uses the unprocessed records and the records included in the group to make a time series between two records in each combination. For each group, the average value of the inter-record distance is calculated by using the difference between the information as the distance in the vector space having the time data values constituting the time-series data as vector elements, and the unprocessed records are calculated as the average. Divide into groups with the smallest value.

本発明の一態様は、上記のグループ化装置であって、前記均等化処理部は、前記複数のレコードと、一つのグループに含まれるレコードの数の最小限の数とに基づいて得られる最適なグループ数の組み合わせを全て作成し、作成した各組み合わせにおいて２つのレコード間の時系列の情報の差分を、時系列のデータを構成する各時間データの値をベクトル要素とするベクトル空間における距離としたレコード間距離の合計値を算出し、算出した合計値が最大の組み合わせを前記グループの基準となるレコードに決定する。 One aspect of the present invention is the grouping device described above, wherein the equalization processing unit is an optimum obtained based on the plurality of records and a minimum number of records included in one group. All combinations of the number of groups are created, and in each created combination, the time-series information difference between two records is calculated as the distance in the vector space having the time data values constituting the time-series data as vector elements. The total value of the inter-record distances is calculated, and the combination having the maximum calculated total value is determined as a record serving as a reference for the group.

本発明の一態様は、匿名化されていない属性の値を有する非匿名化情報である複数のレコードをグループ分けするグループ化装置が行うグループ化方法であって、前記グループ化装置が、匿名化されていない属性の値を有する非匿名化情報である複数のレコードを、各グループに含まれるレコードの数が均等化するように複数のグループに分ける均等化処理ステップ、を有し、前記複数のレコードは、時系列の情報によって構成され、前記グループ化装置が、前記均等化処理ステップにおいて、前記レコードの中から決定されるグループの基準となるレコードを含む１又は複数のグループに基づいて、前記複数のレコードのうち未処理のレコードを複数のグループに分けし、前記グループ化装置が、前記均等化処理ステップにおいて、前記複数のレコードと、一つのグループに含まれるレコードの数の最小限の数とに基づいて得られる最適なグループ数の組み合わせを全て作成し、作成した各組み合わせにおいて２つのレコード間の時系列の情報の差分を、時系列のデータを構成する各時間データの値をベクトル要素とするベクトル空間における距離としたレコード間距離の合計値を算出し、算出した合計値が最大の組み合わせを前記グループの基準となるレコードに決定するグループ化方法である。 One aspect of the present invention is a grouping method performed by a grouping device that groups a plurality of records that are non-anonymized information having non-anonymized attribute values, wherein the grouping device is anonymized A plurality of records that are non-anonymized information having a value of an attribute that has not been performed, and an equalization processing step for dividing the plurality of records into a plurality of groups so that the number of records included in each group is equalized, The record is composed of time-series information, and the grouping device , based on one or a plurality of groups including a record serving as a group reference determined from the records in the equalization processing step, the unprocessed record among the plurality of records is divided into a plurality of groups, the grouping device has, in the equalization process step, the double All the combinations of the optimal number of groups obtained based on the records and the minimum number of records included in one group are created, and time-series information between two records is created for each created combination. The total value of the distance between records is calculated using the difference as the distance in the vector space having the time data values constituting the time series data as vector elements, and the combination with the maximum calculated total value is used as the reference of the group. it is a grouping method that determine the record to be.

本発明の一態様は、匿名化されていない属性の値を有する非匿名化情報である複数のレコードを、各グループに含まれるレコードの数が均等化するように複数のグループに分ける均等化処理ステップ、をコンピュータに実行させ、前記複数のレコードは、時系列の情報によって構成され、前記均等化処理ステップにおいて、前記レコードの中から決定されるグループの基準となるレコードを含む１又は複数のグループに基づいて、前記複数のレコードのうち未処理のレコードを複数のグループに分けるためのコンピュータプログラムである。 One aspect of the present invention is an equalization process in which a plurality of records that are non-anonymized information having non-anonymized attribute values are divided into a plurality of groups so that the number of records included in each group is equalized. And the plurality of records are configured by time-series information, and in the equalization processing step, one or a plurality of groups including a record serving as a group reference determined from the records And a computer program for dividing unprocessed records among the plurality of records into a plurality of groups.

本発明により、匿名性及び有用性の双方を保つことが可能になる。 The present invention makes it possible to maintain both anonymity and usefulness.

匿名化システム１のシステム構成を表すシステム構成図である。1 is a system configuration diagram illustrating a system configuration of an anonymization system 1. FIG. 処理対象レコードの具体例を示す図である。It is a figure which shows the specific example of a process target record. グループ化装置２０の処理の流れを示すフローチャートである。4 is a flowchart showing a flow of processing of the grouping device 20. グループ化装置２０の処理の流れを示すフローチャートである。4 is a flowchart showing a flow of processing of the grouping device 20. グループ情報の具体例を示す図である。It is a figure which shows the specific example of group information.

以下、本発明の一実施形態を、図面を参照しながら説明する。
図１は、匿名化システム１のシステム構成を表すシステム構成図である。匿名化システム１は、非匿名化情報記憶部１０、グループ化装置２０、グループ情報記憶部３０、匿名化処理部４０及び匿名化情報記憶部５０を備える。
非匿名化情報記憶部１０は、磁気ハードディスク装置や半導体記憶装置などの記憶装置を用いて構成される。非匿名化情報記憶部１０は、匿名化されていない情報（以下「非匿名化情報」という。）を記憶する。非匿名化情報は、少なくとも１つの属性を含む。以下、１つの属性が時系列のデータとして表された情報のかたまりをレコードと呼ぶ。例えば、ユーザの在不在を表すレコードは、各曜日の時間帯の値（例えば、曜日・ＡＭ／曜日・ＰＭ）を有する。以下、各曜日の各時間帯（ＡＭ又はＰＭ）をそれぞれ時間データと記載する。つまり、レコードは、複数の時間データで構成される。なお、ここでは、時間データが各曜日の各時間帯（ＡＭ又はＰＭ）の値である一例を示したが、時間データは各曜日の各時刻の値であってもよい。非匿名化情報は、このようなレコードとして表されてもよい。なお、非匿名化情報記憶部１０が記憶する情報の一部には、既に匿名化された情報が含まれていてもよい。 Hereinafter, an embodiment of the present invention will be described with reference to the drawings.
FIG. 1 is a system configuration diagram illustrating a system configuration of the anonymization system 1. The anonymization system 1 includes a non-anonymization information storage unit 10, a grouping device 20, a group information storage unit 30, an anonymization processing unit 40, and an anonymization information storage unit 50.
The non-anonymized information storage unit 10 is configured using a storage device such as a magnetic hard disk device or a semiconductor storage device. The non-anonymized information storage unit 10 stores information that is not anonymized (hereinafter referred to as “non-anonymized information”). The non-anonymized information includes at least one attribute. Hereinafter, a group of information in which one attribute is represented as time-series data is referred to as a record. For example, a record indicating the presence / absence of a user has a value of a time zone of each day of the week (for example, day of the week / AM / day of the week / PM). Hereinafter, each time zone (AM or PM) of each day of the week is described as time data. That is, the record is composed of a plurality of time data. Here, an example is shown in which the time data is the value of each time zone (AM or PM) of each day of the week, but the time data may be the value of each time of the day of the week. Non-anonymized information may be represented as such a record. Note that information already anonymized may be included in part of the information stored in the non-anonymized information storage unit 10.

非匿名化情報記憶部１０は、さらに条件情報を記憶する。条件情報は、非匿名化情報に関する条件の定義を示す情報である。条件情報の具体例として、連結定義がある。連結定義は、個々の数値情報をレコードに連結するための定義を示す情報である。 The non-anonymized information storage unit 10 further stores condition information. Condition information is information which shows the definition of the conditions regarding non-anonymization information. A specific example of the condition information is a connection definition. The connection definition is information indicating a definition for connecting individual numerical information to a record.

グループ化装置２０は、非匿名化情報記憶部１０に記憶される非匿名化情報のうち、匿名化の対象となっている属性（以下「匿名化対象属性」という。）の複数のレコードを、複数のグループに分類する。グループ化装置２０は、メインフレームやワークステーションやパーソナルコンピュータなどの情報処理装置を用いて構成される。グループ化装置２０は、バスで接続されたＣＰＵ（Central Processing Unit）やメモリや補助記憶装置などを備える。グループ化装置２０は、グループ化プログラムを実行することによって、条件情報取得部２０１及び均等化処理部２０２を備える装置として機能する。なお、グループ化装置２０の各機能の全て又は一部は、ＡＳＩＣ（Application Specific Integrated Circuit）やＰＬＤ（Programmable Logic Device）やＦＰＧＡ（Field Programmable Gate Array）等のハードウェアを用いて実現されてもよい。 The grouping apparatus 20 includes a plurality of records of attributes (hereinafter referred to as “anonymization target attributes”) that are anonymization targets among the non-anonymization information stored in the non-anonymization information storage unit 10. Classify into multiple groups. The grouping device 20 is configured using an information processing device such as a mainframe, a workstation, or a personal computer. The grouping device 20 includes a CPU (Central Processing Unit), a memory, an auxiliary storage device, and the like connected by a bus. The grouping device 20 functions as a device including a condition information acquisition unit 201 and an equalization processing unit 202 by executing a grouping program. All or some of the functions of the grouping device 20 may be realized by using hardware such as an application specific integrated circuit (ASIC), a programmable logic device (PLD), and a field programmable gate array (FPGA). .

条件情報取得部２０１は、非匿名化情報記憶部１０から、匿名化対象属性に関する条件情報を取得する。条件情報取得部２０１は、取得した条件情報を均等化処理部２０２に出力する。
均等化処理部２０２は、非匿名化情報記憶部１０から、匿名化処理の対象となる非匿名化情報の複数のレコード（以下、「処理対象レコード」という。）を取得する。均等化処理部２０２は、取得した処理対象レコードに基づいて均等化処理を実行する。均等化処理の実行によって、均等化処理部２０２は、取得した処理対象レコードを、各グループに含まれるレコードの数が均等化するように複数のグループに分ける。均等化処理において、均等化処理部２０２は、各グループに含まれるレコードの数が、予め指定された最小数を下回らないようにグループ化を行う。最小数は、例えばｋ匿名化処理における“ｋ”の値である。本実施形態では、“ｋ”の値が４の場合について説明するが、“ｋ”の値は自然数であれば他の値であってもよい。均等化処理部２０２は、均等化処理の結果を示す情報（以下「グループ情報」という。）をグループ情報記憶部３０に記録する。グループ情報は、均等化処理によって生成された各グループに属するレコードを示す。グループ情報は、例えばグループ番号、各グループの定義を示す情報（以下「グループ定義情報」という。）と、各グループに属するレコードの個数を示す情報（以下「レコード数情報」という。）とを含む。 The condition information acquisition unit 201 acquires condition information related to the anonymization target attribute from the non-anonymization information storage unit 10. The condition information acquisition unit 201 outputs the acquired condition information to the equalization processing unit 202.
The equalization processing unit 202 acquires, from the non-anonymized information storage unit 10, a plurality of records of non-anonymized information (hereinafter referred to as “processing target records”) that are targets of the anonymization process. The equalization processing unit 202 executes equalization processing based on the acquired processing target record. By performing the equalization process, the equalization processing unit 202 divides the acquired records to be processed into a plurality of groups so that the number of records included in each group is equalized. In the equalization process, the equalization processing unit 202 performs grouping so that the number of records included in each group does not fall below a predetermined minimum number. The minimum number is, for example, the value of “k” in the k anonymization process. In this embodiment, the case where the value of “k” is 4 will be described. However, the value of “k” may be another value as long as it is a natural number. The equalization processing unit 202 records information indicating the result of the equalization processing (hereinafter referred to as “group information”) in the group information storage unit 30. The group information indicates records belonging to each group generated by the equalization process. The group information includes, for example, a group number, information indicating the definition of each group (hereinafter referred to as “group definition information”), and information indicating the number of records belonging to each group (hereinafter referred to as “record number information”). .

グループ情報記憶部３０は、磁気ハードディスク装置や半導体記憶装置などの記憶装置を用いて構成される。グループ情報記憶部３０は、グループ化装置２０によって生成されたグループ情報を記憶する。
匿名化処理部４０は、グループ情報記憶部３０に記憶されているグループ情報に基づいて、非匿名化情報記憶部１０に記憶されている非匿名化情報に対して匿名化処理を行う。例えば、匿名化処理部４０は、各グループに属しているレコードの匿名化対象属性の値を、そのグループに属している各レコードの値を一般化することによって得られる値に置き換えることによって匿名化する。例えば、あるグループに属しているレコードの匿名化対象属性の値が１０，１１，１３，１４である場合、範囲を示す“１０−１５”という値や、中央値又は平均値を示す“１２”という値などに置き換えることによって匿名化処理が行われる。また、例えば、匿名化処理部４０は、各グループに属しているレコードの匿名化対象属性の各値（各時間データの値）のいずれかを欠落させる、つまりある時間帯の情報を欠落させることによって匿名化処理を行う。匿名化処理部４０は、このような匿名化処理の実行によって、非匿名化情報の匿名化対象属性の値が匿名化された情報（以下「匿名化情報」という。）を生成する。 The group information storage unit 30 is configured using a storage device such as a magnetic hard disk device or a semiconductor storage device. The group information storage unit 30 stores group information generated by the grouping device 20.
The anonymization processing unit 40 performs anonymization processing on the non-anonymization information stored in the non-anonymization information storage unit 10 based on the group information stored in the group information storage unit 30. For example, the anonymization processing unit 40 anonymizes by replacing the value of the anonymization target attribute of the record belonging to each group with the value obtained by generalizing the value of each record belonging to the group To do. For example, when the value of the anonymization target attribute of a record belonging to a certain group is 10, 11, 13, 14, a value “10-15” indicating a range, or “12” indicating a median value or an average value Anonymization processing is performed by substituting with a value such as. Also, for example, the anonymization processing unit 40 causes one of the values of the anonymization target attribute (value of each time data) of the records belonging to each group to be lost, that is, to delete information in a certain time zone. The anonymization process is performed. The anonymization process part 40 produces | generates the information (henceforth "anonymization information") by which the value of the anonymization object attribute of non-anonymization information was anonymized by execution of such anonymization process.

図２は、処理対象レコードの具体例を示す図である。
図２に示されるように処理対象レコードには、１つの属性における非匿名化情報の複数のレコードが含まれる。図２では、２３個のレコードが処理対象レコードに含まれる。図２では、レコード毎に、各曜日と各曜日のある時間帯（例えば、ＡＭ、ＰＭ）におけるユーザの在不在が表されている。つまり、１つのレコードは、１４次元の情報を有している。そして、レコードの“１”はユーザがその時間帯にいたことを表し、レコードの“０”はユーザがその時間帯にいなかったことを表す。 FIG. 2 is a diagram illustrating a specific example of the processing target record.
As shown in FIG. 2, the processing target record includes a plurality of records of non-anonymized information in one attribute. In FIG. 2, 23 records are included in the processing target record. In FIG. 2, the presence or absence of a user in a certain time zone (for example, AM, PM) of each day of the week and each day of the week is represented for each record. That is, one record has 14-dimensional information. The record “1” indicates that the user was in that time zone, and the record “0” indicates that the user was not in that time zone.

図２において、処理対象レコードの最上段のレコードは、ＮＯの値が“１”、月ＡＭの値が“１”、月ＰＭの値が“０”、火ＡＭの値が“１”、火ＰＭの値が“０”、・・・、土ＡＭの値が“０”、土ＰＭの値が“０”、日ＡＭの値が“０”、日ＰＭの値が“０”である。すなわち、最上段のレコードには、ＮＯ“１”のレコードで識別されるユーザが、月ＡＭと火ＡＭにいたことが表されており、月ＰＭと火ＰＭと土ＡＭと土ＰＭにいなかったことが表されている。 In FIG. 2, the uppermost record of the records to be processed has a NO value of “1”, a month AM value of “1”, a month PM value of “0”, a fire AM value of “1”, The value of PM is “0”,..., The value of soil AM is “0”, the value of soil PM is “0”, the value of day AM is “0”, and the value of day PM is “0”. That is, the top record shows that the user identified by the record of “1” is in the moon AM and the fire AM, and is not in the moon PM, the fire PM, the soil AM, and the soil PM. It is expressed.

図３及び図４は、グループ化装置２０の処理の流れを示すフローチャートである。なお、図３及び４の処理開始時には、まず均等化処理部２０２は、処理対象レコードを用いて、グループ内でレコード数をｋ個（例えば、４個）にするための最適なグループ数を算出する（ステップＳ１０１）。具体的には、均等化処理部２０２は、抽出した処理対象レコード数をｋで除算することによって得られた値を最適なグループ数とする。ここで、図２を例に説明すると、処理対象レコードが２３個であり、ｋが４であるとすると、最適なグループ数は５となる。 3 and 4 are flowcharts showing the flow of processing of the grouping apparatus 20. 3 and 4, the equalization processing unit 202 first calculates the optimum number of groups for setting the number of records to k (for example, 4) within the group using the processing target records. (Step S101). Specifically, the equalization processing unit 202 sets the value obtained by dividing the extracted number of records to be processed by k as the optimum number of groups. Here, referring to FIG. 2 as an example, if there are 23 records to be processed and k is 4, the optimum number of groups is 5.

次に、均等化処理部２０２は、処理対象レコードから、算出した最適なグループ数分のレコードを選択して、作成可能な組み合わせを全て作成する（ステップＳ１０２）。ここで、最適なグループ数を５として図２を例に説明すると、２３個のレコードから、各グループの基準となるレコード（以下、「基準レコード」という。）を選択するための組み合わせ数は、_２３Ｃ_５＝３３６４９通りとなる。 Next, the equalization processing unit 202 selects records for the calculated optimum number of groups from the processing target records, and creates all possible combinations (step S102). Here, when the optimum number of groups is set to 5 and FIG. 2 is described as an example, the number of combinations for selecting a record (hereinafter referred to as “reference record”) as a reference for each group from 23 records is as follows. ₂₃ C ₅ = 33649.

次に、均等化処理部２０２は、全ての組み合わせにおいて２レコード間距離の合計値と、２レコード間距離の標準偏差を算出する（ステップＳ１０３）。２レコード間距離は、例えば、マンハッタン距離を用いて算出される。なお、２レコード間距離は、マンハッタン距離に限定される必要はなく、例えば２レコード間の時系列情報の差分を、時系列のデータを構成する各時間データの値をベクトル要素とするベクトル空間における距離を算出可能な方法であればどのような方法が用いられてもよい。上記の例の場合、各組み合わせは５個のレコードにより構成されるため、組み合わせ毎に_５Ｃ_２＝１０通りの２レコード間距離の合計値と、２レコード間距離の標準偏差とが算出される。均等化処理部２０２は、組み合わせ内の２つのレコード間で同一の時間帯の情報に基づいて２レコード間距離の合計値と、２レコード間距離の標準偏差を算出する。均等化処理部２０２は、２レコード間距離の合計値が最大となる組み合わせを選択する（ステップＳ１０４）。均等化処理部２０２は、選択した組み合わせが一通りであるか否か判定する（ステップＳ１０５）。組み合わせが一通りである場合（ステップＳ１０５−ＹＥＳ）、均等化処理部２０２は選択した組み合わせに含まれる各レコードを基準レコードに決定する（ステップＳ１０６）。 Next, the equalization processing unit 202 calculates the total value of the distance between the two records and the standard deviation of the distance between the two records in all combinations (step S103). The distance between two records is calculated using, for example, the Manhattan distance. Note that the distance between two records does not need to be limited to the Manhattan distance. For example, in a vector space in which the time series information difference between two records is a vector element with the value of each time data constituting the time series data. Any method that can calculate the distance may be used. In the case of the above example, since each combination is composed of _five records, a total value of distances between two records of ₅ C ₂ = 10 and a standard deviation of distances between two records are calculated for each combination. . The equalization processing unit 202 calculates the total value of the distance between the two records and the standard deviation of the distance between the two records based on the information of the same time zone between the two records in the combination. The equalization processing unit 202 selects a combination that maximizes the total value of the distances between two records (step S104). The equalization processing unit 202 determines whether or not the selected combination is one (step S105). When there is one combination (step S105—YES), the equalization processing unit 202 determines each record included in the selected combination as a reference record (step S106).

一方、組み合わせが一通りではない場合（ステップＳ１０５−ＮＯ）、均等化処理部２０２は２レコード間距離の標準偏差が最小となる組み合わせを選択する（ステップＳ１０７）。均等化処理部２０２は、選択した組み合わせに含まれる各レコードを基準レコードに決定する（ステップＳ１０６）。その後、均等化処理部２０２は、各基準レコードに異なるグループ番号を付与する（ステップＳ１０８）。 On the other hand, when there is not one combination (NO in step S105), the equalization processing unit 202 selects a combination that minimizes the standard deviation of the distance between two records (step S107). The equalization processing unit 202 determines each record included in the selected combination as a reference record (step S106). Thereafter, the equalization processing unit 202 assigns a different group number to each reference record (step S108).

均等化処理部２０２は、処理対象レコードのうち未処理のレコードを抽出する（ステップＳ１０９）。ここで、未処理のレコードとは、グループ番号が付与されていないレコードを表す。例えば、均等化処理部２０２は、未処理のレコードのうち番号（図２におけるＮＯ）が最も小さいレコードを抽出する。均等化処理部２０２は、レコード数がｋ未満のグループが０であるか否か判定する（ステップＳ１１０）。レコード数がｋ未満のグループが０である場合（ステップＳ１１０−ＹＥＳ）、均等化処理部２０２は各グループで、ステップＳ１０９の処理で抽出したレコードを含めて２レコード間距離の平均値を算出する（ステップＳ１１１）。均等化処理部２０２は、平均値が最小となるグループのグループ番号を、ステップＳ１０９の処理で抽出したレコードに付与する（ステップＳ１１２）。 The equalization processing unit 202 extracts unprocessed records from the processing target records (step S109). Here, an unprocessed record represents a record to which no group number is assigned. For example, the equalization processing unit 202 extracts the record with the smallest number (NO in FIG. 2) from the unprocessed records. The equalization processing unit 202 determines whether or not the group having the number of records less than k is 0 (step S110). When the number of records having less than k is 0 (step S110-YES), the equalization processing unit 202 calculates the average value of the distance between two records including the records extracted in the process of step S109 for each group. (Step S111). The equalization processing unit 202 assigns the group number of the group having the minimum average value to the record extracted in the process of step S109 (step S112).

均等化処理部２０２は、処理対象レコードのうち未処理のレコードがあるか否か判定する（ステップＳ１１３）。未処理のレコードがある場合（ステップＳ１１３−ＹＥＳ）、均等化処理部２０２はステップＳ１０９以降の処理を実行する。
一方、未処理のレコードがない場合（ステップＳ１１３−ＮＯ）、均等化処理部２０２は各グループのレコードの数を算出する（ステップＳ１１４）。次に、均等化処理部２０２は、各グループの条件を決定する（ステップＳ１１５）。具体的には、均等化処理部２０２は、グループ内の全レコード間で時系列データ毎に平均値の情報をグループ条件に決定する。ただし、均等化処理部２０２は、時系列情報がグループ内のレコードで値の分布幅が所定の値未満の時系列情報のみをグループ条件に決定する。なお、値の分布幅は、グループ内のレコードの値の標準偏差であっても良いし、グループ内のレコードの値の最大と最小の差であっても良い。そして、均等化処理部２０２は、グループ番号、グループ定義情報及びグループのレコード数情報とを含むグループ情報をグループ情報記憶部３０に出力する（ステップＳ１１６）。 The equalization processing unit 202 determines whether there is an unprocessed record among the processing target records (step S113). When there is an unprocessed record (step S113—YES), the equalization processing unit 202 executes the processes after step S109.
On the other hand, when there is no unprocessed record (step S113—NO), the equalization processing unit 202 calculates the number of records in each group (step S114). Next, the equalization processing unit 202 determines conditions for each group (step S115). Specifically, the equalization processing unit 202 determines the average value information as the group condition for each time-series data among all records in the group. However, the equalization processing unit 202 determines only time-series information whose time-series information is a record in the group and whose value distribution width is less than a predetermined value as the group condition. The value distribution width may be a standard deviation of the values of the records in the group, or may be the difference between the maximum and minimum values of the records in the group. Then, the equalization processing unit 202 outputs group information including the group number, group definition information, and group record number information to the group information storage unit 30 (step S116).

また、ステップＳ１１０の処理において、レコード数がｋ未満のグループが０ではない場合（ステップＳ１１０−ＮＯ）、均等化処理部２０２はレコードの数がｋ未満のグループで、ステップＳ１０９の処理で抽出したレコードを含めて２レコード間距離の平均値を算出する（ステップＳ１１７）。均等化処理部２０２は、平均値が最小となるグループのグループ番号を、ステップＳ１０９の処理で抽出したレコードに付与する（ステップＳ１１２）。このような処理によって、図５に示されるようなグループ分けが完了する。図５は、グループ情報の具体例を示す図である。図５に示すように、各グループのレコードの個数が５、４、４、４、５であり、ｋ−匿名化における“ｋ”を満たしていることが分かる。 Further, in the process of step S110, when the group having the number of records less than k is not 0 (step S110-NO), the equalization processing unit 202 is the group having the number of records of less than k and extracted in the process of step S109. The average value of the distance between two records including the record is calculated (step S117). The equalization processing unit 202 assigns the group number of the group having the minimum average value to the record extracted in the process of step S109 (step S112). By such processing, grouping as shown in FIG. 5 is completed. FIG. 5 is a diagram illustrating a specific example of group information. As shown in FIG. 5, the number of records in each group is 5, 4, 4, 4, 5, and it is understood that “k” in k-anonymization is satisfied.

以上のように構成されたグループ化装置２０によれば、匿名性及び有用性の双方を保つことが可能になる。具体的には、グループ化装置２０は、均等化処理により各グループに含まれるレコードの数を、予め指定された最小数を下回らないように均等にグループ化を行う。これにより、各グループのレコードの数を均等化することができる。これにより、匿名性を担保することができる。また、グループ化装置２０は、時系列で構成されたレコードを１つの情報として扱い、２レコード間の時系列情報の差分を、時系列のデータを構成する各時間データの値をベクトル要素とするベクトル空間における距離としたクラスタリングを行う。これにより、時系列で構成された情報が類似するグループに各レコードを分類させることができる。したがって、有用性を担保することができる。このように、グループ化装置２０は、データの有用性を維持しつつ、レコードの数を均等化することで匿名性を高めることが可能となる。 According to the grouping device 20 configured as described above, both anonymity and usefulness can be maintained. Specifically, the grouping device 20 performs grouping evenly so that the number of records included in each group by the equalization process does not fall below a predetermined minimum number. Thereby, the number of records of each group can be equalized. Thereby, anonymity can be ensured. Further, the grouping device 20 treats a time-series record as one piece of information, and uses a time-series information difference between two records as a vector element with a value of each time data constituting the time-series data. Perform clustering with distance in vector space. Thereby, it is possible to classify each record into a group having similar information configured in time series. Therefore, usability can be ensured. Thus, the grouping device 20 can improve anonymity by equalizing the number of records while maintaining the usefulness of data.

＜変形例＞
匿名化システム１は、非匿名化情報記憶部１０、グループ情報記憶部３０、匿名化情報記憶部５０のいずれか一つ又は複数を備えないように構成されてもよい。この場合、各記憶部に相当する構成が、匿名化システム１の外部に設けられる。匿名化システム１に含まれる構成（グループ化装置２０及び匿名化処理部４０）は、匿名化システム１の外部に設けられた各記憶部に相当する構成とネットワークを介して通信し、記憶されている情報を取得する。
本実施形態では、処理対象レコードには、数値として“１”と“０”を用いた場合を例に説明したが、これに限定される必要はない。処理対象レコードには、数値としてその他の数値が用いられてもよい。また、本実施形態では、１つのレコードが１４次元の情報を有している構成を示したが、１つのレコードはｎ次元（ｎは２以上の整数）の情報を有してもよい。 <Modification>
The anonymization system 1 may be configured not to include any one or more of the non-anonymized information storage unit 10, the group information storage unit 30, and the anonymization information storage unit 50. In this case, a configuration corresponding to each storage unit is provided outside the anonymization system 1. The configuration included in the anonymization system 1 (the grouping device 20 and the anonymization processing unit 40) communicates with the configuration corresponding to each storage unit provided outside the anonymization system 1 via the network and is stored. Get information.
In this embodiment, the case where “1” and “0” are used as numerical values for the processing target record has been described as an example. However, the present invention is not limited to this. Other numerical values may be used as numerical values for the processing target record. Further, in the present embodiment, a configuration is shown in which one record has 14-dimensional information, but one record may have n-dimensional (n is an integer of 2 or more) information.

以上、この発明の実施形態について図面を参照して詳述してきたが、具体的な構成はこの実施形態に限られるものではなく、この発明の要旨を逸脱しない範囲の設計等も含まれる。 The embodiment of the present invention has been described in detail with reference to the drawings. However, the specific configuration is not limited to this embodiment, and includes designs and the like that do not depart from the gist of the present invention.

１０…非匿名化情報記憶部，２０…グループ化装置，３０…グループ情報記憶部，４０…匿名化処理部，５０…匿名化情報記憶部，２０１…条件情報取得部，２０２…均等化処理部 DESCRIPTION OF SYMBOLS 10 ... Non-anonymization information storage part, 20 ... Grouping apparatus, 30 ... Group information storage part, 40 ... Anonymization process part, 50 ... Anonymization information storage part, 201 ... Condition information acquisition part, 202 ... Equalization process part

Claims

An equalization processing unit that divides a plurality of records that are non-anonymized information having non-anonymized attribute values into a plurality of groups so that the number of records included in each group is equalized,
With
The plurality of records are configured by time-series information,
The equalization processing unit divides unprocessed records among the plurality of records into a plurality of groups based on one or a plurality of groups including a record serving as a group reference determined from the records ,
The equalization processing unit creates all optimum combinations of group numbers obtained based on the plurality of records and the minimum number of records included in one group, and 2 for each created combination Calculate the total value of the distance between records, where the difference in time series information between two records is the distance in the vector space with each time data value constituting the time series data as a vector element. maximum grouping device combinations that determine the record as a reference for the group of.

The equalization processing unit uses the unprocessed records and the records included in the group to calculate a time-series information difference between two records in each combination, and each time data constituting the time-series data 2. The grouping according to claim 1, wherein an average value of inter-record distances is calculated for each group using a distance in a vector space having a value of vector as a vector element, and the unprocessed records are divided into groups having the minimum average value. apparatus.

A grouping method performed by a grouping device that groups a plurality of records that are non-anonymized information having attribute values that are not anonymized,
The grouping device, equalization processing steps into a plurality of groups such that the number is equalized records a plurality of records included in each group is a non-anonymous information having a value of an attribute that is not anonymous ,
Have
The plurality of records are configured by time-series information,
In the equalization processing step, the grouping device selects a plurality of unprocessed records from the plurality of records based on one or a plurality of groups including a record serving as a group reference determined from the records. then divided into groups,
The grouping device creates all optimum combinations of group numbers obtained based on the plurality of records and the minimum number of records included in one group in the equalization processing step, Calculate the total value of the distance between records with the difference in time series information between two records in each created combination as the distance in the vector space with the value of each time data constituting the time series data as a vector element. , grouping method calculated total value that determine the maximum of the combination in the record as a reference for the group.

An equalization processing step for dividing a plurality of records that are non-anonymized information having attribute values that are not anonymized into a plurality of groups so that the number of records included in each group is equalized,
To the computer,
The plurality of records are configured by time-series information,
In the equalization processing step, based on one or a plurality of groups including a record serving as a group reference determined from the records, unprocessed records among the plurality of records are divided into a plurality of groups ,
In the equalization processing step, all combinations of optimum group numbers obtained based on the plurality of records and the minimum number of records included in one group are created, and 2 for each created combination Calculate the total value of the distance between records, where the difference in time series information between two records is the distance in the vector space with each time data value constituting the time series data as a vector element. because of a computer program to determine the maximum combination to record as a reference for the group.