JP2016133872A

JP2016133872A - Information anonymity method, information anonymity program and information anonymity device

Info

Publication number: JP2016133872A
Application number: JP2015006586A
Authority: JP
Inventors: 孝徳及川; Takanori Oikawa; 伊藤　孝一; Koichi Ito; 孝一伊藤; 裕司山岡; Yuji Yamaoka
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2015-01-16
Filing date: 2015-01-16
Publication date: 2016-07-25

Abstract

PROBLEM TO BE SOLVED: To suppress reduction of a disclosure amount of pieces of information.SOLUTION: An information anonymity method of an embodiment is configured so that: a computer executes processing for extracting character strings which are common to a number of k or higher of users, from pieces of log data each of which is associated with each of the plural users. The information anonymity method is configured so that, the computer executes processing for grouping the extracted character strings based on an inclusion relationship, and allocating an attribute to each group. The information anonymity method is configured so that, the computer executes processing for, associating the character string in the attribute with the user having the log data including the character string, for every attribute. The computer further executes processing for anonymization of the character string of each attribute, so that, a number of users having a same combination of the character strings of attributes is equal to or higher than k.SELECTED DRAWING: Figure 1

Description

本発明の実施形態は、情報匿名化方法、情報匿名化プログラムおよび情報匿名化装置に関する。 Embodiments described herein relate generally to an information anonymization method, an information anonymization program, and an information anonymization apparatus.

各行（レコード）に各人の情報を格納したログデータなどを、各人のプライバシー保護に配慮しつつ、多くの情報が残るように変換し、市場分析などで二次活用したい場合がある。このログデータの変換時におけるプラバシー保護を目的とした技術の一つにデータの匿名化がある。匿名化では、置換・削除等により、データの１単位（行・レコード等）を個人識別性がないように加工する。個人識別性とは、知識のある人間（例えば、データの持ち主）が加工後のデータを見た時に、自分のデータを一意に特定できることである。具体的には、匿名化後のデータの１単位を見た時、データの持ち主をｋ人以上にしか絞れないように加工する。 In some cases, log data, etc., in which each person's information is stored in each row (record) is converted so that a large amount of information remains, taking into account the protection of each person's privacy, and may be used for secondary purposes in market analysis or the like. One technique for protecting privacy during conversion of log data is data anonymization. In anonymization, one unit (row, record, etc.) of data is processed so as not to have personal identification by replacement / deletion. Personal identification means that a knowledgeable person (for example, a data owner) can uniquely identify his / her data when he / she views the processed data. Specifically, when one unit of data after anonymization is viewed, processing is performed so that the number of data owners can be limited to k or more.

ｋ−匿名化は、１データオーナー１行としデータを属性に分類して列とした２次元表に対してｋ−匿名性を達成するように値の置換・削除を行う。ｋ−匿名性とは、ＱＩ（準識別子）として設定した属性の組み合わせが対応しうる行がｋオーナー以上存在することである。 In k-anonymization, values are replaced and deleted so as to achieve k-anonymity for a two-dimensional table in which one data owner is one row and data is classified into attributes and columns. k-anonymity means that there are k or more rows that can be associated with a combination of attributes set as QI (quasi-identifier).

国際公開第２０１１／１４５４０１号International Publication No. 2011/145401

しかしながら、プライバシー保護のためにデータの匿名化を行って匿名性を十分に確保すると、匿名化により秘匿されるデータ量が多くなり、情報の開示量が低減してしまう場合がある、という問題がある。そして、匿名化により秘匿されるデータ量が多いと、二次活用時において、匿名化後のデータを利用する価値が薄れてしまう場合がある。 However, there is a problem that if anonymization of data is performed for privacy protection and anonymity is sufficiently secured, the amount of data concealed by anonymization increases, and the amount of information disclosed may be reduced. is there. And if there is much data amount concealed by anonymization, the value of using the data after anonymization may fade at the time of secondary utilization.

１つの側面では、情報の開示量が低減することを抑止可能とする情報匿名化方法、情報匿名化プログラムおよび情報匿名化装置を提供することを目的とする。 In one aspect, an object is to provide an information anonymization method, an information anonymization program, and an information anonymization device that can prevent a reduction in the amount of information disclosure.

第１の案では、情報匿名化方法は、コンピュータが、複数のユーザのそれぞれに関するログデータから、ｋ以上の数のユーザに共通の文字列を抽出する処理を実行する。また、情報匿名化方法は、コンピュータが、抽出した文字列を包括関係に基づいてグループ化し、各グループに属性を割り当てる処理を実行する。また、情報匿名化方法は、コンピュータが、属性毎に、当該属性中の文字列を当該文字列を含むログデータのユーザに対応付ける処理を実行する。また、情報匿名化方法は、コンピュータが、各属性の文字列の組み合わせが同一となるユーザ数がｋ以上となるよう各属性の文字列を匿名化する処理を実行する。 In the first proposal, in the information anonymization method, the computer executes a process of extracting a character string common to k or more users from log data relating to each of a plurality of users. In the information anonymization method, the computer groups the extracted character strings based on the comprehensive relationship, and executes a process of assigning an attribute to each group. In the information anonymization method, the computer executes a process for associating a character string in the attribute with a user of log data including the character string for each attribute. Further, in the information anonymization method, the computer executes a process of anonymizing the character strings of the respective attributes so that the number of users having the same combination of character strings of the respective attributes becomes k or more.

本発明の１実施態様によれば、情報の開示量が低減することを抑止可能とすることができる。 According to one embodiment of the present invention, it is possible to prevent a reduction in the amount of information disclosed.

図１は、実施形態にかかる情報匿名化装置の構成を例示するブロック図である。FIG. 1 is a block diagram illustrating the configuration of the information anonymization device according to the embodiment. 図２は、実施形態にかかる情報匿名化装置の動作を例示するフローチャートである。FIG. 2 is a flowchart illustrating the operation of the information anonymization device according to the embodiment. 図３は、単語系列の列挙を説明する説明図である。FIG. 3 is an explanatory diagram illustrating enumeration of word series. 図４は、表データへの変換を説明する説明図である。FIG. 4 is an explanatory diagram for explaining conversion into tabular data. 図５は、表データへの値の設定を説明する説明図である。FIG. 5 is an explanatory diagram for explaining setting of values in the table data. 図６Ａは、表データへの値の設定を説明する説明図である。FIG. 6A is an explanatory diagram illustrating setting of values in table data. 図６Ｂは、表データへの値の設定を説明する説明図である。FIG. 6B is an explanatory diagram illustrating setting of values in table data. 図７Ａは、一般化階層木の作成を説明する説明図である。FIG. 7A is an explanatory diagram illustrating creation of a generalized hierarchical tree. 図７Ｂは、一般化階層木の作成を説明する説明図である。FIG. 7B is an explanatory diagram illustrating creation of a generalized hierarchical tree. 図８は、匿名化済み表データの生成を説明する説明図である。FIG. 8 is an explanatory diagram for explaining the generation of anonymized table data. 図９は、ログデータへの変換を説明する説明図である。FIG. 9 is an explanatory diagram illustrating conversion into log data. 図１０は、ログデータへの変換を説明する説明図である。FIG. 10 is an explanatory diagram illustrating conversion into log data. 図１１は、情報匿名化プログラムを実行するコンピュータの一例を示す説明図である。FIG. 11 is an explanatory diagram illustrating an example of a computer that executes an information anonymization program. 図１２は、匿名化において閾値による開示量の違いを説明する説明図である。FIG. 12 is an explanatory diagram for explaining a difference in the disclosure amount depending on a threshold value in anonymization.

以下、図面を参照して、実施形態にかかる情報匿名化方法、情報匿名化プログラムおよび情報匿名化装置を説明する。実施形態において同一の機能を有する構成には同一の符号を付し、重複する説明は省略する。なお、以下の実施形態で説明する情報匿名化方法、情報匿名化プログラムおよび情報匿名化装置は、一例を示すに過ぎず、実施形態を限定するものではない。また、以下の各実施形態は、矛盾しない範囲内で適宜組みあわせてもよい。 Hereinafter, an information anonymization method, an information anonymization program, and an information anonymization device according to embodiments will be described with reference to the drawings. In the embodiment, configurations having the same functions are denoted by the same reference numerals, and redundant description is omitted. In addition, the information anonymization method, the information anonymization program, and the information anonymization apparatus which are demonstrated by the following embodiment only show an example, and do not limit embodiment. In addition, the following embodiments may be appropriately combined within a consistent range.

図１は、実施形態にかかる情報匿名化装置１の構成を例示するブロック図である。図１に示すように、情報匿名化装置１は、ＰＣ（Personal Computer）等の情報処理装置であり、入力部１０と、制御部２０と、出力部３０とを有する。 FIG. 1 is a block diagram illustrating the configuration of the information anonymization device 1 according to the embodiment. As shown in FIG. 1, the information anonymization device 1 is an information processing device such as a PC (Personal Computer), and includes an input unit 10, a control unit 20, and an output unit 30.

入力部１０は、キーボードやマウスなどの入力装置、ＣＤ−ＲＯＭやＤＶＤディスク等の記憶媒体からデータを読み取る読取装置または他の機器と通信を行う通信装置（いずれの装置も図示しない）等からデータを入力する。入力部１０が入力を受け付けるデータは、例えば、ログデータＤ１０および閾値ｋがある。入力されたログデータＤ１０および閾値ｋは、制御部２０により参照される。 The input unit 10 receives data from an input device such as a keyboard or a mouse, a reading device that reads data from a storage medium such as a CD-ROM or a DVD disk, or a communication device (not shown) that communicates with other devices. Enter. Data that the input unit 10 accepts input includes, for example, log data D10 and a threshold value k. The input log data D10 and threshold value k are referred to by the control unit 20.

ログデータＤ１０は、複数のユーザのそれぞれに関する履歴等を記述するデータである。ログデータＤ１０は、例えば、アプリケーション・ネットワーク機器等のログデータやセンサ機器の出力データ等、企業同士・企業−ユーザ間でやり取りされるデータのうち、テキストデータとして扱えるものである。テキストデータとして扱えるログデータＤ１０は、行やレコードといった単位で要素を区切ることができる。本実施形態では、ログデータＤ１０には、１行に１つの履歴（１ユーザが行った動作等）が記述されているものとする。情報匿名化装置１におけるログデータＤ１０の匿名化では、ログデータＤ１０の要素１単位（行・レコード等）の組み合わせを個人識別性が無いように加工する。 The log data D10 is data describing a history and the like regarding each of a plurality of users. The log data D10 can be handled as text data among data exchanged between companies / company-users such as log data of application / network devices and output data of sensor devices. The log data D10 that can be handled as text data can be divided into elements such as lines and records. In the present embodiment, it is assumed that one log (such as an operation performed by one user) is described in one line in the log data D10. In the anonymization of the log data D10 in the information anonymization apparatus 1, a combination of one unit (row, record, etc.) of the log data D10 is processed so as not to have personal identification.

閾値ｋは、個人識別性が無いようにログデータＤ１０を匿名化した後のデータを見た時、データの持ち主をｋ人以上にしか絞れないようにする、プライバシー保護の基準に合わせて設定される閾値である。例えば、閾値ｋには、２≦ｋ≦入力されたログデータＤ１０のユーザ数の値が設定される。 The threshold value k is set in accordance with a privacy protection standard so that when the data after anonymizing the log data D10 is viewed so that there is no personal identification, the number of data owners can be limited to only k or more. Threshold. For example, 2 ≦ k ≦ the number of users of the input log data D10 is set as the threshold value k.

制御部２０は、各種の処理手順などを規定したプログラムおよび所要データを格納するための内部メモリと、プログラムを順次実行するＣＰＵ（Central Processing Unit）とを有し、これらによって種々の処理を実行する。制御部２０は、生成部２１、匿名化処理部２２およびログデータ変換部２３を有する。 The control unit 20 includes an internal memory for storing a program defining various processing procedures and necessary data, and a CPU (Central Processing Unit) that sequentially executes the program, and executes various processes using these. . The control unit 20 includes a generation unit 21, an anonymization processing unit 22, and a log data conversion unit 23.

生成部２１は、表データ変換部２１１と、一般化階層木生成部２１２とを有し、匿名化処理部２２によりログデータＤ１０のｋ−匿名化を行うための中間データである表データＤ２１および一般化階層木Ｄ２２を生成する。ｋ−匿名化は、１データオーナー（ユーザ）１行とし、データを属性に分類して列とした２次元表である表データＤ２１に対して適用できる。ＱＩ（準識別子）として設定した属性の組み合わせが対応しうる行がｋオーナー以上存在することをｋ−匿名性といい、ｋ−匿名化では、表データＤ２１に対して、ｋ−匿名性を達成するように値の置換・削除を行う。値の置換には、属性間の包括関係を木構造で示し、置換する値を定義した一般化階層木Ｄ２２を用いる。一般化階層木Ｄ２２を用いた置換は、削除と比較して、木構造によって段階的に情報を落とす（匿名化する）ことができるため、情報損失を少なくできる。 The generation unit 21 includes a table data conversion unit 211 and a generalized hierarchical tree generation unit 212. The table data D21, which is intermediate data for performing k-anonymization of the log data D10 by the anonymization processing unit 22, and A generalized hierarchical tree D22 is generated. The k-anonymization can be applied to the table data D21 which is a two-dimensional table in which one data owner (user) is one row and the data is classified into attributes and columns. The k-anonymity means that there are more than k owner rows that can be supported by attribute combinations set as QIs (quasi-identifiers). In k-anonymization, k-anonymity is achieved for the table data D21. Replace and delete values as you do. For value replacement, a generalized hierarchical tree D22 in which a comprehensive relationship between attributes is shown in a tree structure and a value to be replaced is defined is used. In the replacement using the generalized hierarchical tree D22, information loss can be reduced stepwise by the tree structure (anonymization), compared with deletion, so that information loss can be reduced.

ここで、匿名化後のデータのうち、元のデータのままの箇所を開示箇所と呼び、開示箇所の量を開示量と呼ぶ。この開示量は、文字数や単語数で表す。ｋ−匿名性を満たしているように匿名化した後のログデータ同士を比較した場合、開示量が多いほど、二次活用の際に有用である。よって、開示量の多い匿名化ほど二次活用に有用である。 Here, in the data after anonymization, the part as the original data is called a disclosed part, and the amount of the disclosed part is called a disclosed quantity. The amount of disclosure is expressed by the number of characters and the number of words. When log data after anonymizing so as to satisfy k-anonymity is compared, the more the disclosed amount, the more useful it is in secondary utilization. Therefore, anonymization with a large amount of disclosure is useful for secondary utilization.

表データ変換部２１１は、ログデータＤ１０および閾値ｋを参照し、ログデータＤ１０を表データＤ２１に変換する。具体的には、表データ変換部２１１は、区切り文字での分割・文字種分割・形態素解析等の既存技術を用いてログデータＤ１０の単語分割を行う。例えば、文字種分割では、ログデータＤ１０における「ｎａｍｅ＝ｄｂ．ｐａｔｈ＝ｃ：／ｄａｔａ」の行の場合、｛ｎａｍｅ，＝，ｄｂ，．，ｐａｔｈ，＝，ｃ，：，／，ｄａｔａ｝のように分割する。 The table data converter 211 refers to the log data D10 and the threshold value k, and converts the log data D10 into table data D21. Specifically, the table data conversion unit 211 performs word division of the log data D10 using existing techniques such as division by delimiters, character type division, and morphological analysis. For example, in character type division, in the case of a line “name = db.path = c: / data” in the log data D10, {name, =, db,. , Path, =, c,:, /, data}.

表データ変換部２１１は、ログデータＤ１０の単語分割をもとに、ｋオーナー以上のログデータＤ１０に出現する単語系列を表データＤ２１の属性（列）に設定する。単語系列は、順序をもった単語の組み合わせである。この単語系列では、行頭から行末における単語同士の隣接・非隣接を区別する。例えば、並んだ単語同士が隣接していない場合は、その間を非隣接箇所とし、特殊文字列「［ＧＡＰ］」をおいて表現する。 The table data conversion unit 211 sets a word series appearing in the log data D10 of k owners or more as attributes (columns) of the table data D21 based on the word division of the log data D10. A word sequence is a combination of words with an order. In this word sequence, adjacent / non-adjacent words are distinguished from the beginning to the end of the line. For example, if the aligned words are not adjacent to each other, the space between them is set as a non-adjacent portion, and the special character string “[GAP]” is used for the expression.

例えば｛ｎａｍｅ，［ＧＡＰ］，ｄｂ｝という単語系列において、「［ＧＡＰ］」は、「ｎａｍｅ」と「ｄｂ」と隣接しておらず、「ｎａｍｅ」と「ｄｂ」の間に１つ以上の単語が含まれていることを示す。なお、非隣接箇所を示す特殊文字列は、後から識別可能であればいずれの文字列を用いてもよい。 For example, in the word sequence {name, [GAP], db}, “[GAP]” is not adjacent to “name” and “db”, and one or more between “name” and “db”. Indicates that a word is included. As the special character string indicating the non-adjacent portion, any character string may be used as long as it can be identified later.

また、単語系列同士では包括関係を持つ場合がある。例えば、｛ｎａｍｅ，［ＧＡＰ］，ｄｂ｝と｛ｎａｍｅ，＝，ｄｂ｝は包括関係があり、｛ｎａｍｅ，［ＧＡＰ］，ｄｂ｝が｛ｎａｍｅ，＝，ｄｂ｝を包括している。本実施形態では、包括している側（上記の例では｛ｎａｍｅ，［ＧＡＰ］，ｄｂ｝）を上位として扱う。 In addition, word sequences may have a comprehensive relationship. For example, {name, [GAP], db} and {name, =, db} have a comprehensive relationship, and {name, [GAP], db} includes {name, =, db}. In this embodiment, a comprehensive side (in the above example, {name, [GAP], db}) is treated as a higher level.

表データ変換部２１１は、単語分割されたログデータＤ１０の各行から、抽出可能な全ての単語系列を列挙する。具体的には、表データ変換部２１１は、単語分割された後の単語系列をベースに、各単語が「そのまま」である単語系列と、「［ＧＡＰ］」の２通りの状態をとるものとして、取りうる全状態を単語系列として出力する。これにより２の［単語数］乗の単語系列が出力される。 The table data conversion unit 211 lists all word sequences that can be extracted from each row of the log data D10 that has been divided into words. Specifically, the table data conversion unit 211 assumes two states: a word sequence in which each word is “as is” and “[GAP]”, based on the word sequence after word division. , Output all possible states as a word sequence. As a result, a word sequence raised to the power of 2 [number of words] is output.

例えば、｛ｎａｍｅ，＝，ｄｂ，．｝の場合は、｛ｎａｍｅ，＝，ｄｂ，．｝、｛［ＧＡＰ］，＝，ｄｂ，．｝、｛［ＧＡＰ］，［ＧＡＰ］，ｄｂ，．｝…の単語系列が出力される。ここで、出力された単語系列において、［ＧＡＰ］が連続している場合は、１つにまとめる。例えば、｛ｎａｍｅ，＝，［ＧＡＰ］，［ＧＡＰ］｝は、｛ｎａｍｅ，＝，［ＧＡＰ］｝とする。また、｛［ＧＡＰ］｝となった単語系列は除外する。 For example, {name, =, db,. }, {Name, =, db,. }, {[GAP], =, db,. }, {[GAP], [GAP], db,. } Is output. Here, when [GAP] continues in the output word series, they are combined into one. For example, {name, =, [GAP], [GAP]} is assumed to be {name, =, [GAP]}. Also, the word series that is {[GAP]} is excluded.

次いで、表データ変換部２１１は、ｋオーナー以上に出現する単語系列を抽出して表データＤ２１の属性（列）に設定し、ログデータＤ１０の各行を、当てはまる属性の中で、包括関係が最上位の属性の値として表データＤ２１の行に設定する。包括関係においては、包括する側を上位とする。また、最上位の属性が複数ある場合は、「文字列長・単語数」や「出現オーナー数」等の基準によって上位の属性を判定する。また、表データ変換部２１１は、既に値が設定されている属性は除外し、値が設定されていない属性の中で、包括関係が最上位の属性の値として設定する。また、ログデータＤ１０の各行において、当てはまる属性が無い場合は、その行は値として設定しないものとする。表データ変換部２１１は、ログデータＤ１０を変換した表データＤ２１として、属性（列）および行を設定した表データＤ２１を出力する。 Next, the table data conversion unit 211 extracts word series appearing more than k owners and sets them in the attribute (column) of the table data D21, and each row of the log data D10 has the most comprehensive relationship among the applicable attributes. It is set in the row of the table data D21 as the value of the upper attribute. In the inclusion relationship, the inclusion side is the higher rank. If there are a plurality of top-level attributes, the top-level attributes are determined based on criteria such as “character string length / number of words” and “number of appearance owners”. Also, the table data conversion unit 211 excludes attributes for which values have already been set, and sets the comprehensive relationship as the value of the highest attribute among the attributes for which values have not been set. Further, if there is no applicable attribute in each line of the log data D10, the line is not set as a value. The table data conversion unit 211 outputs table data D21 in which attributes (columns) and rows are set as table data D21 obtained by converting the log data D10.

一般化階層木生成部２１２は、表データＤ２１の各属性について、属性間の包括関係を木構造で示した一般化階層木Ｄ２２を生成する。具体的には、一般化階層木生成部２１２は、表データＤ２１の対象属性の単語系列を根ノードとし、その根ノードからの単語系列の包括関係のある属性をノードとする木構造の一般化階層木Ｄ２２を生成する。一般化階層木生成部２１２は、包括関係のある属性において、包括する側を親ノードとする。また、一般化階層木生成部２１２は、入力されたログデータＤ１０の全ての行と、表データＤ２１の全属性とをノードの候補とする。また、一般化階層木生成部２１２は、ノード追加の際に、親ノード候補が複数ある場合は、親ノードの単語系列の「文字列長・単語数」や「出現オーナー数」等の基準で一つを選択するものとする。 For each attribute of the table data D21, the generalized hierarchical tree generation unit 212 generates a generalized hierarchical tree D22 that shows a comprehensive relationship between attributes in a tree structure. Specifically, the generalized hierarchical tree generation unit 212 generalizes a tree structure in which the word sequence of the target attribute of the table data D21 is a root node and the attribute having the comprehensive relationship of the word sequence from the root node is a node. A hierarchical tree D22 is generated. The generalized hierarchical tree generation unit 212 sets an inclusion side as a parent node in an attribute having inclusion relation. The generalized hierarchical tree generation unit 212 sets all the rows of the input log data D10 and all the attributes of the table data D21 as node candidates. In addition, when there are a plurality of parent node candidates when adding a node, the generalized hierarchical tree generation unit 212 uses a criterion such as “character string length / number of words” or “number of appearance owners” of the word sequence of the parent node. One shall be selected.

匿名化処理部２２は、生成部２１により生成された表データＤ２１および一般化階層木Ｄ２２をもとに、表データＤ２１に対して各属性の文字列の組み合わせが同一となるユーザ（オーナー）数がｋ以上となるよう各属性の文字列を匿名化するｋ−匿名化を行い、匿名化済み表データＤ２３を生成する。具体的には、匿名化処理部２２は、一般化階層木Ｄ２２における上位のグループの属性から順に、表データＤ２１のオーナーに対応付けられた各属性の単語系列が同一となるオーナー数がｋ以上となるよう各属性の文字列を匿名化する。 The anonymization processing unit 22 is based on the table data D21 and the generalized hierarchical tree D22 generated by the generation unit 21, and the number of users (owners) whose character string combinations for each attribute are the same for the table data D21. The anonymized table data D23 is generated by performing k-anonymization to anonymize the character string of each attribute so that becomes equal to or greater than k. Specifically, the anonymization processing unit 22 has, in order from the attribute of the higher group in the generalized hierarchical tree D22, the number of owners with the same word series of each attribute associated with the owner of the table data D21 is k or more. Anonymize the character string of each attribute so that

ログデータ変換部２３は、生成された匿名化済み表データＤ２３の各行をログデータに変換する。具体的には、ログデータ変換部２３は、まず、匿名化済み表データＤ２３に残った値を全て抽出する。次に、ログデータ変換部２３は、対応するオーナーの元のログデータＤ１０に対して抽出した値を用いて文字列置換処理を行う。より具体的には、元のログデータＤ１０の各行に対し、匿名化済み表データＤ２３より抽出した値の中からマッチする中で最も開示量が多くなる値を選択する。開示量が多くなる値とは、文字列長か単語長が最大の値である。次に、選択した値の［ＧＡＰ］にあたる対象行の文字列を置換する。置換方法は、すべて同じ文字列に置換する「墨塗り」、同じ置換箇所が同じ文字列になるように置換する「トークン化」、復号鍵を持つユーザだけが閲覧できる文字列に置換する「暗号化」等がある。この置換方法は、安全性・活用性を考慮した方法がユーザにより選択され、設定されているものとする。本実施形態では、「トークン化」を用いて置換するものとする。 The log data conversion unit 23 converts each row of the generated anonymized table data D23 into log data. Specifically, the log data conversion unit 23 first extracts all the values remaining in the anonymized table data D23. Next, the log data conversion unit 23 performs a character string replacement process using the extracted value for the original log data D10 of the corresponding owner. More specifically, for each row of the original log data D10, a value with the largest disclosure amount is selected from the values extracted from the anonymized table data D23. The value that increases the disclosed amount is a value that has the maximum character string length or word length. Next, the character string of the target line corresponding to [GAP] of the selected value is replaced. The replacement method is “sanitization” to replace all with the same character string, “tokenize” to replace the same replacement part with the same character string, and “encryption” to replace with a character string that only a user with a decryption key can view. ”. In this replacement method, it is assumed that a method considering safety and utilization is selected and set by the user. In the present embodiment, replacement is performed using “tokenization”.

出力部３０は、制御部２０により匿名化されたログデータＤ３０を出力する。具体的には、出力部３０は、ファイルなどの他、ディスプレイ（図示しない）への表示や、通信装置（図示しない）を介して接続する他の装置への送信などにより、ログデータＤ３０を出力する。 The output unit 30 outputs log data D30 that has been anonymized by the control unit 20. Specifically, the output unit 30 outputs the log data D30 by displaying on a display (not shown) or transmitting to another device connected via a communication device (not shown) in addition to a file or the like. To do.

ここで、情報匿名化装置１の動作の詳細を説明する。図２は、実施形態にかかる情報匿名化装置１の動作を例示するフローチャートである。 Here, the detail of operation | movement of the information anonymization apparatus 1 is demonstrated. FIG. 2 is a flowchart illustrating the operation of the information anonymization device 1 according to the embodiment.

図２に示すように、処理が開始されると、入力部１０は、ログデータＤ１０および閾値ｋの入力を受け付ける。次いで、表データ変換部２１１は、区切り文字での分割・文字種分割・形態素解析等の既存技術を用いて、受け付けたログデータＤ１０の単語分割を行う（Ｓ１）。 As shown in FIG. 2, when the process is started, the input unit 10 receives input of log data D10 and a threshold value k. Next, the table data conversion unit 211 performs word division of the received log data D10 using existing techniques such as division by a delimiter, character type division, and morpheme analysis (S1).

次いで、表データ変換部２１１は、単語分割したログデータＤ１０から全単語系列を列挙する（Ｓ２）。図３は、単語系列の列挙を説明する説明図である。図３に示すように、ログデータＤ１０における「ｎａｍｅ＝ｄｂ．」は、｛ｎａｍｅ，＝，ｄｂ，．｝、｛ｎａｍｅ，［ＧＡＰ］｝、…のように全単語系列が列挙される。 Next, the table data conversion unit 211 lists all word series from the log data D10 divided into words (S2). FIG. 3 is an explanatory diagram illustrating enumeration of word series. As shown in FIG. 3, “name = db.” In the log data D10 is {name, =, db,. }, {Name, [GAP]},...

次いで、表データ変換部２１１は、列挙した全単語系列と、入力された閾値ｋとをもとに、ｋオーナー以上に出現する単語系列を抽出し（Ｓ３）、抽出した単語系列を表データＤ２１における表の属性（列）に設定する（Ｓ４）。 Next, the table data conversion unit 211 extracts a word series that appears above the k owner on the basis of all the listed word series and the input threshold value k (S3), and the extracted word series is represented in the table data D21. Is set to the attribute (column) of the table in (S4).

図４は、表データＤ２１への変換を説明する説明図である。図４に示すように、オーナーＡ〜オーナーＤのログデータＤ１０と、ｋ＝２とが入力されているものとする。この場合、ｋ＝２オーナー以上に出現する単語系列として「Ｕｓｅｒ＝Ｓａｔｏ」、「Ａｉｓｓｔａｒｔｅｄ」、…が抽出され、表データＤ２１の属性（列）に設定される。 FIG. 4 is an explanatory diagram for explaining the conversion into the table data D21. As shown in FIG. 4, it is assumed that log data D10 of owner A to owner D and k = 2 are input. In this case, “User = Sato”, “A is started”,... Are extracted as word sequences appearing more than k = 2 owners, and set as attributes (columns) of the table data D21.

次いで、表データ変換部２１１は、各オーナーのログデータ（Ｄｉ（ｉ＝０〜ｕ））に対する第１ループ処理（Ｓ５〜Ｓ１０）を行う。第１ループ処理において、表データ変換部２１１は、表データＤ２１の表にＤｉ用の行（Ｒｉ）を追加し、ログデータ各行（Ｌｉ（ｉ＝０〜ｍ））すべてに対する第２ループ処理（Ｓ７〜Ｓ９）を行う。 Next, the table data conversion unit 211 performs a first loop process (S5 to S10) on the log data (Di (i = 0 to u)) of each owner. In the first loop processing, the table data conversion unit 211 adds a row (Ri) for Di to the table of the table data D21, and performs the second loop processing (all for each log data row (Li = 0 to m)) ( S7 to S9) are performed.

第２ループ処理において、表データ変換部２１１は、ログデータの行（Ｌｉ）をＤｉ用の行（Ｒｉ）の当てはまる属性に値として設定する（Ｓ８）。 In the second loop processing, the table data conversion unit 211 sets the log data row (Li) as a value to the attribute to which the Di row (Ri) is applied (S8).

図５は、表データへの値の設定を説明する説明図であり、より具体的には、図４に例示したログデータＤ１０および表データＤ２１における値設定を説明している図である。 FIG. 5 is an explanatory diagram for explaining setting of values in the table data. More specifically, FIG. 5 is a diagram explaining setting of values in the log data D10 and the table data D21 illustrated in FIG.

図５に示すように、表データ変換部２１１は、オーナーＡ〜オーナーＤのログデータに対する第１ループ処理を行うことで、オーナーＡ〜オーナーＤのログデータを表データＤ２１に順次設定する。例えば、オーナーＡについては、ログデータＤ１０として「Ｕｓｅｒ＝Ａｂｅ」とする行（Ｌ１）と、「ＡｉｓＳｔａｒｔｅｄ」とする行（Ｌ２）とがある。この行（Ｌ１）、（Ｌ２）については、第２ループ処理を行うことで、「Ｕｓｅｒ＝［ＧＡＰ］」、「［ＧＡＰ］ｉｓＳｔａｒｔｅｄ」などの当てはまる属性に値として設定される。 As illustrated in FIG. 5, the table data conversion unit 211 sequentially sets the log data of the owner A to the owner D in the table data D21 by performing the first loop process on the log data of the owner A to the owner D. For example, for the owner A, there are a line (L1) with “User = Abe” and a line (L2) with “A is Started” as the log data D10. The rows (L1) and (L2) are set as values in applicable attributes such as “User = [GAP]” and “[GAP] is Started” by performing the second loop processing.

なお、表データ変換部２１１は、既に値が設定されている属性は除外するとともに、当てはまる属性の中で、包括関係において最上位の属性に設定する。 The table data conversion unit 211 excludes attributes for which values have already been set, and sets them as the highest attribute in the comprehensive relationship among the applicable attributes.

図６Ａ、図６Ｂは、表データへの値の設定を説明する説明図である。より具体的には、図６Ａは既に値が設定されている属性は除外する場合を説明し、図６Ｂは包括関係において最上位の属性に設定する場合を説明している。 6A and 6B are explanatory diagrams for explaining the setting of values to the table data. More specifically, FIG. 6A illustrates a case where an attribute for which a value has already been set is excluded, and FIG. 6B illustrates a case where the highest attribute is set in a comprehensive relationship.

図６Ａに示すように、オーナーＡのログデータＤ１０における「Ｋｉｓｓｔａｒｔｅｄ」を設定する場合、当てはまる最上位の属性は「［ＧＡＰ］ｉｓｓｔａｒｔｅｄ」であることから、その属性に値が設定される。次いで、「Ａｉｓｓｔａｒｔｅｄ」を設定する場合、当てはまる最上位の属性は「［ＧＡＰ］ｉｓｓｔａｒｔｅｄ」であるが、この属性には既に値が設定されている。よって、「Ａｉｓｓｔａｒｔｅｄ」は、包括関係において次の属性である「Ａｉｓｓｔａｒｔｅｄ」に値が設定されることとなる。 As shown in FIG. 6A, when “K is started” in the log data D10 of the owner A is set, since the top-level attribute to be applied is “[GAP] is started”, a value is set for the attribute. . Next, when “A is started” is set, the highest-level attribute to be applied is “[GAP] is started”, but a value has already been set for this attribute. Therefore, the value of “A is started” is set to the next attribute “A is started” in the comprehensive relationship.

図６Ｂに示すように、オーナーＡのログデータＤ１０における「Ａｉｓｓｔａｒｔｅｄ」を設定する場合には、「Ａｉｓｓｔａｒｔｅｄ」、「Ａｉｓ［ＧＡＰ］」、「［ＧＡＰ］ｉｓｓｔａｒｔｅｄ」の３属性が当てはまる。ここで、３属性の包括関係を比較すると、「Ａｉｓ［ＧＡＰ］」は「Ａｉｓｓｔａｒｔｅｄ」より上位であり、「［ＧＡＰ］ｉｓｓｔａｒｔｅｄ」は「Ａｉｓｓｔａｒｔｅｄ」より上位である。なお、「Ａｉｓ［ＧＡＰ］」と、「［ＧＡＰ］ｉｓｓｔａｒｔｅｄ」との間には包括関係はない。ここで、表データ変換部２１１は、包括関係が最上位の属性を、「文字列長・単語数」や「出現オーナー数」等の基準によって選択する。図示例では、単語数によって、「［ＧＡＰ］ｉｓｓｔａｒｔｅｄ」を上位としている。 As shown in FIG. 6B, when setting “A is started” in the log data D10 of the owner A, three attributes “A is started”, “A is [GAP]”, and “[GAP] is started” are set. Is true. Here, comparing the comprehensive relationship of the three attributes, “A is [GAP]” is higher than “A is started”, and “[GAP] is started” is higher than “A is started”. Note that there is no comprehensive relationship between “A is [GAP]” and “[GAP] is started”. Here, the table data conversion unit 211 selects the attribute having the highest comprehensive relationship based on criteria such as “character string length / number of words” and “number of appearance owners”. In the illustrated example, “[GAP] is started” is ranked higher depending on the number of words.

図２に戻り、第１ループ処理（Ｓ５〜Ｓ１０）に次いで、一般化階層木生成部２１２は、表データＤ２１の各属性（Ａｊ（ｊ＝０〜ｒ））に対する第３ループ処理（Ｓ１１〜Ｓ１３）を行う。 Returning to FIG. 2, following the first loop processing (S5 to S10), the generalized hierarchical tree generating unit 212 performs third loop processing (S11 to S11) for each attribute (Aj (j = 0 to r)) of the table data D21. S13) is performed.

第３ループ処理において、一般化階層木生成部２１２は、表データＤ２１の属性（Ａｊ）を根ノードとし、その根ノードからの単語系列の包括関係のある属性をノードとする木構造の一般化階層木Ｄ２２を作成する（Ｓ１２）。 In the third loop process, the generalized hierarchical tree generation unit 212 generalizes the tree structure with the attribute (Aj) of the table data D21 as a root node and the attribute having a comprehensive relationship of word sequences from the root node as a node. A hierarchical tree D22 is created (S12).

図７Ａ、図７Ｂは、一般化階層木Ｄ２２の作成を説明する説明図である。より具体的には、図７Ａは包括関係が１段の一般化階層木Ｄ２２の作成を説明し、図７Ｂは包括関係が多段（２段）の一般化階層木Ｄ２２の作成を説明している。 7A and 7B are explanatory diagrams illustrating the creation of the generalized hierarchical tree D22. More specifically, FIG. 7A illustrates the creation of a generalized hierarchical tree D22 having a one-level inclusive relationship, and FIG. 7B illustrates the generation of a generalized hierarchical tree D22 having a multilevel (two-level) inclusive relationship. .

図７Ａに示すように、「Ｕｓｅｒ＝［ＧＡＰ］」を根ノードとする場合は、その根ノードよりも包括関係が下位の「Ｕｓｅｒ＝Ａｂｅ」、「Ｕｓｅｒ＝Ｓａｔｏ」および「Ｕｓｅｒ＝Ｏｄａ」を下位のノードとする一般化階層木Ｄ２２が作成される。 As shown in FIG. 7A, when “User = [GAP]” is set as a root node, “User = Abe”, “User = Sato”, and “User = Oda”, which have lower comprehensive relationships than the root node, are set. A generalized hierarchical tree D22 is created as a lower node.

また、図７Ｂに示すように、「［ＧＡＰ］ｉｓ［ＧＡＰ］」を根ノードとする場合は、その根ノードよりも包括関係が１段下位の「［ＧＡＰ］ｉｓｓｔａｒｔｅｄ」および「［ＧＡＰ］ｉｓｓｔｏｐｐｅｄ」を１段下位のノードとする一般化階層木Ｄ２２が作成される。また、「［ＧＡＰ］ｉｓｓｔａｒｔｅｄ」については、包括関係が１段下位の「Ａｉｓｓｔａｒｔｅｄ」がさらに１段下位のノードとされる。また、「［ＧＡＰ］ｉｓｓｔｏｐｐｅｄ」については、包括関係が１段下位の「Ｃｉｓｓｔｏｐｐｅｄ」および「Ｂｉｓｓｔｏｐｐｅｄ」がさらに１段下位のノードとされる。 Also, as shown in FIG. 7B, when “[GAP] is [GAP]” is the root node, “[GAP] is started” and “[GAP]” whose inclusive relationship is one level lower than the root node. A generalized hierarchical tree D22 having “is stopped” as a node one step lower is created. In addition, for “[GAP] is started”, “A is started”, which is one level lower in the comprehensive relationship, is a node one level lower. In addition, regarding “[GAP] is stopped”, “C is stopped” and “B is stopped” whose inclusion relationship is one level lower are nodes that are one level lower.

図２に戻り、第３ループ処理（Ｓ１１〜Ｓ１３）に次いで、匿名化処理部２２は、表データＤ２１と、一般化階層木Ｄ２２とをもとに、表データＤ２１に対して各属性の文字列の組み合わせが同一となるユーザ数がｋ以上となるよう各属性の文字列を匿名化するｋ−匿名化を行う（Ｓ１４）。そして、匿名化処理部２２は、表データＤ２１に対してｋ−匿名化を行った後の匿名化済み表データＤ２３を生成して出力する。 Returning to FIG. 2, following the third loop processing (S11 to S13), the anonymization processing unit 22 uses the table data D21 and the generalized hierarchical tree D22 to characterize each attribute with respect to the table data D21. K-anonymization is performed to anonymize the character string of each attribute so that the number of users having the same combination of columns is equal to or greater than k (S14). And the anonymization process part 22 produces | generates and outputs the anonymized table data D23 after performing k-anonymization with respect to the table data D21.

図８は、匿名化済み表データＤ２３の生成を説明する説明図である。図８に示すように、「Ｕｓｅｒ＝［ＧＡＰ］」、「［ＧＡＰ］＝ｉｓｓｔａｒｔｅｄ」などの属性において、ユーザ数がｋ以上となるように一般化階層木Ｄ２２を参照して匿名化する。例えば、「Ｕｓｅｒ＝［ＧＡＰ］」については、「Ｕｓｅｒ＝Ａｂｅ」、「Ｕｓｅｒ＝Ｏｄａ」がｋ＝２とするｋ−匿名性を満たさないことから、「Ｕｓｅｒ＝［ＧＡＰ］」に匿名化される。 FIG. 8 is an explanatory diagram for explaining the generation of the anonymized table data D23. As shown in FIG. 8, in the attributes such as “User = [GAP]” and “[GAP] = is started”, anonymization is performed with reference to the generalized hierarchical tree D22 so that the number of users becomes k or more. For example, “User = [GAP]” is anonymized to “User = [GAP]” because “User = Abe” and “User = Oda” do not satisfy k-anonymity with k = 2. The

図２に戻り、Ｓ１４に次いで、ログデータ変換部２３は、匿名化済み表データＤ２３の各行（Ｒｊ（ｊ＝０〜ｕ））に対する第４ループ処理（Ｓ１５〜Ｓ１７）を行う。 Returning to FIG. 2, following S14, the log data conversion unit 23 performs a fourth loop process (S15 to S17) for each row (Rj (j = 0 to u)) of the anonymized table data D23.

第４ループ処理において、ログデータ変換部２３は、匿名化済み表データＤ２３の行（Ｒｊ）をログデータに変換する（Ｓ１６）。具体的には、ログデータ変換部２３は、匿名化済み表データＤ２３の行（Ｒｊ）に対応するオーナーの元のログデータＤ１０に対して、Ｒｊの値を用いて文字列置換処理を行う。第４ループ処理（Ｓ１５〜Ｓ１７）に次いで、出力部３０は、匿名化済みのログデータＤ３０を出力する（Ｓ１８）。 In the fourth loop process, the log data conversion unit 23 converts the row (Rj) of the anonymized table data D23 into log data (S16). Specifically, the log data conversion unit 23 performs a character string replacement process on the original log data D10 of the owner corresponding to the row (Rj) of the anonymized table data D23 using the value of Rj. Following the fourth loop process (S15 to S17), the output unit 30 outputs the anonymized log data D30 (S18).

図９は、ログデータへの変換を説明する説明図である。図９に示すように、元のログデータＤ１０は「Ｂｉｓｓｔａｒｔｅｄ」であり、匿名化済み表データＤ２３のＲｊの値は「［ＧＡＰ］ｉｓｓｔａｒｔｅｄ」とする。この場合、ログデータ変換部２３は、「［ＧＡＰ］ｉｓｓｔａｒｔｅｄ」を用いて、「Ｂｉｓｓｔａｒｔｅｄ」を「ｔｏｋｅ０１ｉｓｓｔａｒｔｅｄ」と変換する。これにより、ログデータＤ１０は、「Ｂ」の文字列が「ｔｏｋｅ０１」へと匿名化されたログデータＤ３０に変換される。 FIG. 9 is an explanatory diagram illustrating conversion into log data. As shown in FIG. 9, the original log data D10 is “B is started”, and the value of Rj of the anonymized table data D23 is “[GAP] is started”. In this case, the log data conversion unit 23 converts “B is started” to “talk01 is started” using “[GAP] is started”. As a result, the log data D10 is converted into log data D30 in which the character string “B” is anonymized to “talk01”.

図１０は、ログデータへの変換を説明する説明図であり、より具体的には、ある行にマッチする「値」が複数ある場合の変換を説明する図である。図１０に示すように、「Ｋｉｓｓｔａｒｔｅｄ」がマッチする値は「［ＧＡＰ］ｉｓｓｔａｒｔｅｄ」である。よって、「Ｋｉｓｓｔａｒｔｅｄ」については、マッチする値を適用して「ｔｏｋｅ０１ｉｓｓｔａｒｔｅｄ」に変換される。「Ａｉｓｓｔａｒｔｅｄ」がマッチする値は、「［ＧＡＰ］ｉｓｓｔａｒｔｅｄ」および「Ａｉｓｓｔａｒｔｅｄ」である。この場合は、開示量が多い「Ａｉｓｓｔａｒｔｅｄ」を適用し、「Ａｉｓｓｔａｒｔｅｄ」のままとなる。 FIG. 10 is an explanatory diagram for explaining the conversion to log data. More specifically, FIG. 10 is a diagram for explaining the conversion when there are a plurality of “values” that match a certain line. As illustrated in FIG. 10, the value that “K is started” matches is “[GAP] is started”. Therefore, “K is started” is converted to “talk01 is started” by applying a matching value. Values that “A is started” matches are “[GAP] is started” and “A is started”. In this case, “A is started” with a large amount of disclosure is applied, and “A is started” remains.

また、図示した各部の各構成要素は、必ずしも物理的に図示の如く構成されていることを要しない。すなわち、各部の分散・統合の具体的形態は図示のものに限られず、その全部または一部を、各種の負荷や使用状況等に応じて、任意の単位で機能的または物理的に分散・統合して構成することができる。例えば、上記の実施形態の生成部２１、匿名化処理部２２、ログデータ変換部２３などのそれぞれを統合してもよい。 In addition, each component of each part illustrated does not necessarily need to be physically configured as illustrated. In other words, the specific form of distribution / integration of each unit is not limited to that shown in the figure, and all or a part thereof may be functionally or physically distributed / integrated in arbitrary units according to various loads or usage conditions. Can be configured. For example, each of the generation unit 21, the anonymization processing unit 22, the log data conversion unit 23, and the like of the above embodiment may be integrated.

さらに、各装置で行われる各種処理機能は、ＣＰＵ（またはＭＰＵ、ＭＣＵ（Micro Controller Unit）等のマイクロ・コンピュータ）上で、その全部または任意の一部を実行するようにしてもよい。また、各種処理機能は、ＣＰＵ（またはＭＰＵ、ＭＣＵ等のマイクロ・コンピュータ）で解析実行されるプログラム上、またはワイヤードロジックによるハードウェア上で、その全部または任意の一部を実行するようにしてもよいことは言うまでもない。 Furthermore, various processing functions performed by each device may be executed entirely or arbitrarily on a CPU (or a microcomputer such as an MPU or MCU (Micro Controller Unit)). In addition, various processing functions may be executed in whole or in any part on a program that is analyzed and executed by a CPU (or a microcomputer such as an MPU or MCU) or on hardware based on wired logic. Needless to say, it is good.

ところで、上記の実施形態で説明した各種の処理は、予め用意されたプログラムをコンピュータで実行することで実現できる。そこで、以下では、上記の実施例と同様の機能を有するプログラムを実行するコンピュータの一例を説明する。図１１は、情報匿名化プログラムを実行するコンピュータ３００の一例を示す説明図である。 By the way, the various processes described in the above embodiments can be realized by executing a program prepared in advance by a computer. Therefore, in the following, an example of a computer that executes a program having the same function as in the above embodiment will be described. FIG. 11 is an explanatory diagram illustrating an example of a computer 300 that executes an information anonymization program.

図１１が示すように、コンピュータ３００は、各種演算処理を実行するＣＰＵ３０１と、データ入力を受け付ける入力装置３０２と、モニタ３０３とを有する。また、コンピュータ３００は、記憶媒体からプログラム等を読み取る媒体読取装置３０４と、各種装置と接続するためのインタフェース装置３０５と、他の装置と有線または無線により接続するための通信装置３０６とを有する。また、コンピュータ３００は、各種情報を一時記憶するＲＡＭ３０７と、ハードディスク装置３０８とを有する。また、各装置３０１〜３０８は、バス３０９に接続される。 As illustrated in FIG. 11, the computer 300 includes a CPU 301 that executes various arithmetic processes, an input device 302 that receives data input, and a monitor 303. The computer 300 also includes a medium reading device 304 that reads a program and the like from a storage medium, an interface device 305 for connecting to various devices, and a communication device 306 for connecting to other devices by wire or wirelessly. The computer 300 also includes a RAM 307 that temporarily stores various types of information and a hard disk device 308. Each device 301 to 308 is connected to a bus 309.

ハードディスク装置３０８には、上記の実施形態で説明した各処理部と同様の機能を有する情報匿名化プログラムが記憶される。また、ハードディスク装置３０８には、情報匿名化プログラムを実現するための各種データが記憶される。入力装置３０２は、例えばユーザからの入力を受け付ける。モニタ３０３は、ユーザからの入力を受け付ける際の操作画面の表示や、各種情報の表示を行う。インタフェース装置３０５は、例えば印刷装置等が接続される。通信装置３０６は、例えばネットワークに接続される。 The hard disk device 308 stores an information anonymization program having the same function as each processing unit described in the above embodiment. The hard disk device 308 stores various data for realizing the information anonymization program. The input device 302 receives input from a user, for example. The monitor 303 displays an operation screen when receiving input from the user and displays various information. The interface device 305 is connected to, for example, a printing device. The communication device 306 is connected to a network, for example.

ＣＰＵ３０１は、ハードディスク装置３０８に記憶された各プログラムを読み出して、ＲＡＭ３０７に展開して実行することで、各種の処理を行う。また、これらのプログラムは、コンピュータ３００を上記の実施形態で説明した各処理部と同様に機能させることができる。 The CPU 301 reads out each program stored in the hard disk device 308, develops it in the RAM 307, and executes it to perform various processes. Further, these programs can cause the computer 300 to function in the same manner as each processing unit described in the above embodiment.

なお、上記の情報匿名化プログラムは、必ずしもハードディスク装置３０８に記憶されている必要はない。例えば、コンピュータ３００が読み取り可能な記憶媒体に記憶されたプログラムを、コンピュータ３００が読み出して実行するようにしてもよい。コンピュータ３００が読み取り可能な記憶媒体は、例えば、ＣＤ−ＲＯＭやＤＶＤディスク、ＵＳＢ（Universal Serial Bus）メモリ等の可搬型記録媒体、フラッシュメモリ等の半導体メモリ、ハードディスクドライブ等が対応する。また、公衆回線、インターネット、ＬＡＮ等に接続された装置にこの情報匿名化プログラムを記憶させておき、コンピュータ３００がこれらから情報匿名化プログラムを読み出して実行するようにしてもよい。 Note that the above-described information anonymization program is not necessarily stored in the hard disk device 308. For example, the computer 300 may read and execute a program stored in a storage medium readable by the computer 300. The storage medium readable by the computer 300 corresponds to, for example, a portable recording medium such as a CD-ROM, a DVD disk, a USB (Universal Serial Bus) memory, a semiconductor memory such as a flash memory, a hard disk drive, and the like. Alternatively, the information anonymization program may be stored in a device connected to a public line, the Internet, a LAN, or the like, and the computer 300 may read out and execute the information anonymization program therefrom.

以上のように、情報匿名化装置１は、複数のユーザのそれぞれに関するログデータから、ｋ以上の数のユーザに共通の文字列を抽出する。また、情報匿名化装置１は、抽出した文字列を包括関係に基づいてグループ化し、各グループに属性を割り当てる。また、情報匿名化装置１は、属性毎に、当該属性中の文字列を当該文字列を含むログデータのユーザに対応付ける。また、情報匿名化装置１は、各属性の文字列の組み合わせが同一となるユーザ数がｋ以上となるよう各属性の文字列を匿名化する。よって、情報匿名化装置１では、属性ごとに、置換の度合いを調整できるため、開示量が低減することを抑止可能とすることができる。 As described above, the information anonymization apparatus 1 extracts a character string common to k or more users from log data regarding each of a plurality of users. Moreover, the information anonymization apparatus 1 groups the extracted character strings based on the comprehensive relationship, and assigns an attribute to each group. Moreover, the information anonymization apparatus 1 matches the character string in the said attribute with the user of the log data containing the said character string for every attribute. Moreover, the information anonymization apparatus 1 anonymizes the character string of each attribute so that the number of users with the same combination of character strings of each attribute becomes k or more. Therefore, in the information anonymization apparatus 1, since the degree of replacement can be adjusted for each attribute, it is possible to suppress a reduction in the disclosed amount.

図１２は、匿名化において閾値（ｖ）による開示量の違いを説明する説明図である。ここで、閾値（ｖ）は、ｖ≧ｋとして設定される値であり、ｋ−匿名化を行う際に、段階的に絞り込みを行って中間データ（表）を生成するための閾値である。図１２では、入力されたログデータＤ１００からｋ＝２のｋ−匿名性を満たすコンピュータ３００を生成する場合を例示している。 FIG. 12 is an explanatory diagram for explaining a difference in disclosure amount depending on a threshold value (v) in anonymization. Here, the threshold value (v) is a value set as v ≧ k, and is a threshold value for generating intermediate data (table) by narrowing down in stages when k-anonymization is performed. In FIG. 12, the case where the computer 300 which satisfy | fills k-anonymity of k = 2 from the input log data D100 is illustrated.

図１２に示すように、ｖ＝２（＝ｋ）として中間データ（表）を生成し、その表をもとにｋ−匿名性を満たすように変換する場合よりも、ｖ＝３、４（＞ｋ）として中間データを生成してから変換する場合の方が、開示量が多くなる。具体的には、「＊＊＊」に対して「＊＊＊ｉｓｓｔａｒｔｅｄ」としている分、開示量が多くなっている。このように、ｖ≧ｋの閾値（ｖ）により中間データを生成してから匿名化を行う場合は、閾値（ｖ）により、匿名化後の開示量は変化することから、開示量を多くするための適切な閾値（ｖ）の設定が必要となる。 As shown in FIG. 12, when v = 2 (= k) is generated and intermediate data (table) is generated and converted to satisfy k-anonymity based on the table, v = 3, 4 ( The amount of disclosure increases when intermediate data is generated after conversion as> k). Specifically, the amount of disclosure is increased by “*** is started” with respect to “***”. As described above, when the anonymization is performed after the intermediate data is generated with the threshold value (v) of v ≧ k, the disclosed amount after the anonymization is changed according to the threshold value (v), so the disclosed amount is increased. Therefore, it is necessary to set an appropriate threshold value (v).

これに対し、本実施形態では、図８に示すように、一般化階層木Ｄ２２を用いることで、属性ごとに置換の度合いを調整できることから、閾値（ｖ）の設定を行うことなく、開示量の多い匿名化を行うことができる。具体的には、匿名化処理部２２を用いることで、「Ｕｓｅｒ＝［ＧＡＰ］」についてはｖ＝２の単語系列として匿名化し、［［ＧＡＰ］ｉｓｓｔａｒｔｅｄ］についてはｖ＝４の単語系列として匿名化している。 On the other hand, in the present embodiment, as shown in FIG. 8, the degree of replacement can be adjusted for each attribute by using the generalized hierarchical tree D22. Therefore, the disclosed amount can be set without setting the threshold value (v). Can be anonymized. Specifically, by using the anonymization processing unit 22, “User = [GAP]” is anonymized as a word sequence of v = 2, and [[GAP] is started] is a word sequence of v = 4. Anonymized.

１…情報匿名化装置
１０…入力部
２０…制御部
３０…出力部
２１…生成部
２２…匿名化処理部
２３…ログデータ変換部
３００…コンピュータ
Ｄ１０、Ｄ３０、Ｄ１００、Ｄ３００…ログデータ
Ｄ２１…表データ
Ｄ２２…一般化階層木
Ｄ２３…匿名化済み表データ
ｋ…閾値 DESCRIPTION OF SYMBOLS 1 ... Information anonymization apparatus 10 ... Input part 20 ... Control part 30 ... Output part 21 ... Generation part 22 ... Anonymization processing part 23 ... Log data conversion part 300 ... Computer D10, D30, D100, D300 ... Log data D21 ... Table Data D22 ... Generalized hierarchical tree D23 ... Anonymized table data k ... Threshold

Claims

Computer
Extract character strings common to k or more users from log data for each of multiple users,
Group the extracted character strings based on a comprehensive relationship, assign attributes to each group,
For each attribute, a character string in the attribute is associated with a user of the log data including the character string,
An information anonymization method characterized by executing a process of anonymizing a character string of each attribute so that the number of users having the same combination of character strings of each attribute becomes k or more.

In the assigning process, in the attribute character string, a comprehensive character string attribute is a higher group attribute than a general character string attribute, and the higher group attribute is a root node, and a lower group attribute is The information anonymization method according to claim 1, further comprising: generating a hierarchical tree that includes

3. The information anonymization according to claim 2, wherein the associating process associates a character string in the attribute with a user of the log data including the character string in order from an attribute of an upper group in the hierarchical tree. Method.

The anonymization process is performed so that, in order from the attribute of the upper group in the hierarchical tree, the number of users having the same combination of character strings of the attributes associated with the users of the log data is equal to or greater than k. The information anonymization method according to claim 3, wherein the character string is anonymized.

On the computer,
Extract character strings common to k or more users from log data for each of multiple users,
Group the extracted character strings based on a comprehensive relationship, assign attributes to each group,
For each attribute, a character string in the attribute is associated with a user of the log data including the character string,
An information anonymization program for executing a process of anonymizing a character string of each attribute so that the number of users having the same combination of character strings of each attribute becomes k or more.

A character string common to k or more users is extracted from log data relating to each of a plurality of users, the extracted character strings are grouped based on a comprehensive relationship, an attribute is assigned to each group, A processing unit for associating a character string in the attribute with a user of the log data including the character string;
An anonymization device comprising: an anonymization unit that anonymizes a character string of each attribute so that the number of users having the same combination of character strings of each attribute is equal to or greater than k.