JP6339403B2

JP6339403B2 - Information processing apparatus, information processing method, and information processing program

Info

Publication number: JP6339403B2
Application number: JP2014078134A
Authority: JP
Inventors: 秀暢小栗
Original assignee: 富士通クラウドテクノロジーズ株式会社
Priority date: 2014-04-04
Filing date: 2014-04-04
Publication date: 2018-06-06
Anticipated expiration: 2034-04-04
Also published as: JP2015200972A

Description

本発明は、個人情報を匿名化又は多様化するための情報処理技術に関する。 The present invention relates to information processing technology for anonymizing or diversifying personal information.

情報処理技術の発展に伴い、日常の多くの場面で情報が収集され、この収集された情報を用いた処理が行われている。例えば、消費者が店舗の会員となって商品を購入する場合、会員登録時に消費者の氏名、年齢、性別、住所、メールアドレス等を登録することが多い。そして、消費者が商品を購入すると、店舗側のシステムが、この消費者と購入した商品の情報を対応付けて記録する。このように購入した商品の情報を蓄積して分析すると、当該消費者の嗜好が推定でき、この消費者が好む新商品が発売されたような場合にダイレクトメールを発送するといったサービスを行うことができる。更に、多くの消費者の情報について分析することで、２０代女性の好む商品や関東エリアで好まれる商品といった情報を導くことができ、マーケティング等に利用される。 With the development of information processing technology, information is collected in many everyday situations, and processing using the collected information is performed. For example, when a consumer purchases a product as a member of a store, the consumer's name, age, gender, address, e-mail address, etc. are often registered at the time of membership registration. When a consumer purchases a product, the store-side system records the consumer and the purchased product information in association with each other. By accumulating and analyzing information on purchased products in this way, it is possible to estimate the consumer's preferences and perform a service such as sending a direct mail when a new product preferred by the consumer is released. it can. Furthermore, by analyzing information of many consumers, information such as products preferred by women in their 20s and products preferred in the Kanto area can be derived and used for marketing and the like.

また、これらの情報は、当該店舗だけでなく、商品を製造するメーカや、他の企業にとっても新商品の開発や安全性の向上などに用いることができ、価値を有することがある。 In addition, the information can be used not only for the store but also for the manufacturer of the product and other companies for the development of new products and the improvement of safety, and may have value.

しかし、店舗が有する消費者の個人情報を各消費者の許諾を得ずに、他者へ提供することはできない。このため、上記消費者に関する情報を他者へ提供する場合には、個人を特定できないように、匿名化する必要がある。 However, the consumer's personal information in the store cannot be provided to others without obtaining the consent of each consumer. For this reason, when providing information related to the consumer to others, it is necessary to anonymize so that individuals cannot be identified.

例えば、年齢が記載されている会員リストに２５歳の人が一人だけであると、２５歳の知人がその会員であることを知った時点で、その人を特定できることになる。即ち、２５歳の会員という属性を持つ人が一人だけであると、他の情報と照らし合わせることで、個人を特定できる可能性が高い。 For example, if there is only one person 25 years old in the member list in which the age is described, the person can be identified when he / she knows that the 25-year-old acquaintance is the member. That is, if there is only one person with the attribute of a 25-year-old member, there is a high possibility that an individual can be specified by comparing with other information.

そこで、会員リストの年齢の記載を１０歳区切りに抽象化し、２０代が３人のように同じ属性を持つ人が複数人となるようにすれば、３人のうちの誰であるかを特定できなくなる。このように同じ属性を持つ人がｋ人以上いる状態を、「ｋ−匿名性」を満たすと称し、そのようにデータを加工することを「k-匿名化」と称する。 Therefore, if the age description in the member list is abstracted into 10-year breaks, and there are multiple people with the same attribute, such as three in their 20s, who of the three is identified become unable. A state in which there are k or more people having the same attribute in this way is referred to as “k-anonymity” and processing such data is referred to as “k-anonymization”.

特開２０１２−１３３４５１号公報JP 2012-133451 A 特開２０１１−１０８１９５号公報JP 2011-108195 A 特開２０１１−１２８８６２号公報JP 2011-128862 A 特開２０１２−７８９３２号公報JP 2012-78932 A

図１９は、ユーザがＩＣカードを用いて駅の自動改札を出入りし、乗車料金を決済した場合に、管理サーバ側に記録される履歴データ（フローデータ）の一例を示す図である。図１９の履歴データ９１は、ユーザＩＤや、利用日時、利用駅、利用内容、料金等が対応付けられている。この履歴データ９１は、ユーザＩＤとユーザの姓、年齢、性別を対応付けたユーザ情報９２を参照することで、履歴データの各ユーザが識別できる。 FIG. 19 is a diagram illustrating an example of history data (flow data) recorded on the management server side when a user enters and exits an automatic ticket gate of a station using an IC card and settles a boarding fee. The history data 91 in FIG. 19 is associated with a user ID, use date and time, use station, use contents, fee, and the like. The history data 91 can identify each user of the history data by referring to the user information 92 in which the user ID is associated with the user's last name, age, and gender.

この履歴データ９１を他の事業者へ提供する場合、ユーザＩＤとユーザの姓等とを対応
付けるユーザ情報９２を削除する、或は参照できないように管理することで、ユーザＩＤから個人を識別できないようにすること（仮名化状態とすること）が考えられる。 When this history data 91 is provided to other business operators, the user information 92 that associates the user ID with the user's last name is deleted or managed so that it cannot be referred to, so that the individual cannot be identified from the user ID. It can be considered to be in a pseudonymized state.

しかし、仮名化状態の場合、ユーザＩＤから氏名が特定できないとしても、ユーザＩＤと対応付けられた利用駅等の情報が一個人に限定されている場合、即ち、他に利用駅等の情報が一致するユーザがいない場合、利用駅等の情報から再識別できる可能性がある。例えば、ＩＤ＝Ａ００１のユーザが新宿駅、秋葉原駅、人形町を利用していた場合に、同じように駅を利用した人が他にいなければ、ＩＤ＝Ａ００１のユーザの行動を知る人であれば、この履歴データからＩＤ＝Ａ００１のユーザを再識別できる。 However, in the kana conversion state, even if the name cannot be specified from the user ID, if the information such as the use station associated with the user ID is limited to one individual, that is, the other information such as the use station is the same If there is no user to do, there is a possibility that it can be re-identified from information such as the station used. For example, when a user with ID = A001 uses Shinjuku Station, Akihabara Station, and Ningyocho, if there is no other person who uses the station in the same way, a person who knows the behavior of the user with ID = A001. If there is, the user with ID = A001 can be re-identified from the history data.

例えば、ｎ＝４２４７万人のユーザが、ｍ＝９２６２の駅を一様分布で選択した場合に、再識別できる駅の数ｓを式１によって求めると、
ｍＳ＝ｎ・・・（式１）
Ｓ＝２．２３７となり、履歴データに３駅記録されていれば、再識別できることが分かる。 For example, when n = 4,247,000 users select m = 9262 stations in a uniform distribution, the number of stations s that can be re-identified is calculated by Equation 1,
mS = n (Formula 1)
It becomes S = 2.237, and it can be understood that re-identification is possible if three stations are recorded in the history data.

また、ＩＣカードの履歴データには、この他にもショッピングの情報が含まれることがあり、この場合再識別の可能性が更に高くなる。 In addition, the history data of the IC card may include other shopping information. In this case, the possibility of re-identification is further increased.

このため、各項目の値を抽象化して、各項目の値の組み合わせが一個人に限定されないように匿名化することが考えられるが、行動履歴のようなデータは、データ量が非常に多くなり易く、例えば１０万人を超えるユーザの行動履歴の場合、所謂ビッグデータの場合、抽象化を人手で行うのは現実的ではない。 For this reason, it is conceivable that the values of each item are abstracted and anonymized so that the combination of the values of each item is not limited to one individual, but data such as action history tends to be very large in data amount. For example, in the case of so-called big data in the case of the action history of more than 100,000 users, it is not realistic to perform abstraction manually.

また、機械的に抽象化を行うことも考えられるが、機械的に抽象化を行うと、抽象化した結果が例え匿名性を満たしたとしても、有用なデータになるとは限らない。例えば項目の値の組み合わせが一個人に限定されなくなるまで抽象化した結果、利用価値が無くなるほど抽象的な項目の値（語）になってしまった場合、匿名性を満たしても意味が無い。このため機械的に抽象化を行う場合でも抽象化の結果を人が確認し、有用なデータになっていなければ、抽象化する項目を変える等の設定を変更して抽象化の処理をやり直すといった試行の繰り返しになる。 Although abstraction can be performed mechanically, if abstraction is performed mechanically, even if the abstracted result satisfies anonymity, it is not always useful data. For example, if the combination of item values is abstracted until it is not limited to one individual, and the value (word) of the item is so abstract that there is no use value, it does not make sense to satisfy anonymity. For this reason, even when performing abstraction mechanically, the result of the abstraction is confirmed by a person, and if it is not useful data, the setting of changing items to be abstracted is changed and the abstraction process is restarted. Repeated trials.

しかし、単に試行を繰り返すのは非効率であり、特にビッグデータの場合、抽象化の処理や匿名性を検定する処理に多大な時間がかかってしまうため、充分に試行を行うことが困難であった。 However, simply repeating trials is inefficient, especially in the case of big data, it takes a lot of time to process abstraction and anonymity, making it difficult to perform trials sufficiently. It was.

また、上記のような行動履歴（フローデータ）の場合、各ユーザの行動に伴って随時データが挿入され、挿入されるデータの数もタイミングも異なるため、匿名性を厳密に行うことが困難であった。 In addition, in the case of the action history (flow data) as described above, data is inserted as needed along with each user's action, and the number of inserted data is different in timing, so it is difficult to strictly perform anonymity. there were.

そこで本発明は、複数のユーザの行動履歴について匿名化の検定を可能にする技術を提供する。 Therefore, the present invention provides a technique that enables anonymity testing of a plurality of user behavior histories.

本発明に係る情報処理装置は、
ユーザを識別するユーザ識別情報と当該ユーザに係る情報とを対応付けたデータを対象データとし、前記対象データに含まれる複数の語を抽出し、各語の出現数に基づいて、前記複数の語の少なくとも一部をデータ項目の候補とする項目候補生成部と、
統計情報に基づいて前記データ項目の候補から所定数の候補を選択する項目選択部と、
前記選択された前記データ項目の値を前記対象データの前記ユーザ識別情報と対応付け
られたデータから求めて、前記ユーザ識別情報毎に前記データ項目の値を対応付けて匿名候補データとする匿名候補生成部と、
を備える。 An information processing apparatus according to the present invention includes:
Data that associates user identification information for identifying a user with information related to the user is used as target data, and a plurality of words included in the target data are extracted, and the plurality of words are based on the number of occurrences of each word. An item candidate generation unit that makes at least a part of the data item candidates;
An item selection unit for selecting a predetermined number of candidates from the data item candidates based on statistical information;
Anonymous candidates that obtain the value of the selected data item from the data associated with the user identification information of the target data and associate the value of the data item with the user identification information as anonymous candidate data A generator,
Is provided.

前記項目候補生成部は、前記対象データから抽出した前記語を抽象化し、抽象化した語の出現数に基づいて、前記複数の語及び前記抽象化した語の少なくとも一部をデータ項目の候補としても良い。 The item candidate generation unit abstracts the word extracted from the target data, and based on the number of appearances of the abstracted word, sets the plurality of words and at least a part of the abstracted word as data item candidates. Also good.

前記匿名候補生成部は、前記対象データの時間、地域、又は所定カテゴリ毎に、前記匿名候補データを生成しても良い。 The anonymous candidate generator may generate the anonymous candidate data for each time, region, or predetermined category of the target data.

前記情報処理装置は、前記匿名候補データの項目の値の組み合わせが、前記対象データの一個人に限定されないことを条件として検定する検定部を備えても良い。 The information processing apparatus may include a testing unit that tests on condition that a combination of values of items of the anonymous candidate data is not limited to one individual of the target data.

本発明に係る情報処理方法は、
ユーザを識別するユーザ識別情報と当該ユーザに係る情報とを対応付けたデータを対象データとし、前記対象データに含まれる複数の語を抽出し、各語の出現数に基づいて、前記複数の語の少なくとも一部をデータ項目の候補とするステップと、
統計情報に基づいて前記データ項目の候補から所定数の候補を選択するステップと、
前記選択された前記データ項目の値を前記対象データの前記ユーザ識別情報と対応付けられたデータから求めて、前記ユーザ識別情報毎に前記データ項目の値を対応付けて匿名候補データとするステップと、
をコンピュータが実行する。 An information processing method according to the present invention includes:
Data that associates user identification information for identifying a user with information related to the user is used as target data, and a plurality of words included in the target data are extracted, and the plurality of words are based on the number of occurrences of each word. Making at least a part of the data item candidates,
Selecting a predetermined number of candidates from the data item candidates based on statistical information;
Obtaining the value of the selected data item from the data associated with the user identification information of the target data, and associating the value of the data item with each user identification information as anonymous candidate data; ,
Is executed by the computer.

また、本発明は、上記情報処理方法をコンピュータに実行させるための匿名化プログラムであっても良い。更に、前記匿名化プログラムは、コンピュータが読み取り可能な記録媒体に記録されていても良い。 Further, the present invention may be an anonymization program for causing a computer to execute the information processing method. Furthermore, the anonymization program may be recorded on a computer-readable recording medium.

ここで、コンピュータが読み取り可能な記録媒体とは、データやプログラム等の情報を電気的、磁気的、光学的、機械的、または化学的作用によって蓄積し、コンピュータから読み取ることができる記録媒体をいう。このような記録媒体の内コンピュータから取り外し可能なものとしては、例えばフレキシブルディスク、光磁気ディスク、CD-ROM、CD-R/W、DVD、DAT、８mmテープ、メモリカード等がある。 Here, the computer-readable recording medium refers to a recording medium that accumulates information such as data and programs by electrical, magnetic, optical, mechanical, or chemical action and can be read from the computer. . Examples of such a recording medium that can be removed from the computer include a flexible disk, a magneto-optical disk, a CD-ROM, a CD-R / W, a DVD, a DAT, an 8 mm tape, and a memory card.

また、コンピュータに固定された記録媒体としてハードディスクやＲＯＭ（リードオンリーメモリ）等がある。 Further, there are a hard disk, a ROM (read only memory), and the like as a recording medium fixed to the computer.

本発明は、複数のユーザの行動履歴について匿名化の検定を可能にする技術を提供できる。 INDUSTRIAL APPLICABILITY The present invention can provide a technique that enables anonymity testing for a plurality of user behavior histories.

図１は、匿名化処理の説明図である。FIG. 1 is an explanatory diagram of anonymization processing. 図２は、ユーザの行動履歴の一例を示す図である。FIG. 2 is a diagram illustrating an example of a user's behavior history. 図３は、匿名化装置の概略構成を示す図である。FIG. 3 is a diagram illustrating a schematic configuration of the anonymization device. 図４は情報処理装置のハードウェア構成を示す図である。FIG. 4 is a diagram illustrating a hardware configuration of the information processing apparatus. 図５は、対象データを解析して項目を選択する処理を示す図である。FIG. 5 is a diagram illustrating a process of selecting target items by analyzing target data. 図６は、匿名候補データを生成する処理を示す図である。FIG. 6 is a diagram illustrating a process of generating anonymous candidate data. 図７は、匿名性を検定する処理を示す図である。FIG. 7 is a diagram illustrating a process for testing anonymity. 図８は、対象データを自然語解析して語の出現数を求める処理の一例を示す図である。FIG. 8 is a diagram illustrating an example of a process for obtaining the number of appearances of words by analyzing the target data in a natural language. 図９は、対象データを自然語解析して語の出現数を求める処理の一例を示す図である。FIG. 9 is a diagram illustrating an example of a process for obtaining the number of occurrences of a word by analyzing the target data in a natural language. 図１０は、対象データを自然語解析して語の出現数を求める処理の一例を示す図である。FIG. 10 is a diagram illustrating an example of a process for obtaining the number of appearances of a word by analyzing the target data in a natural language. 図１１は、対象データを自然語解析して語の出現数を求める処理の一例を示す図である。FIG. 11 is a diagram illustrating an example of a process for obtaining the number of occurrences of a word by analyzing the target data in a natural language. 図１２は、対象データを自然語解析して語の出現数を求める処理の一例を示す図である。FIG. 12 is a diagram showing an example of a process for obtaining the number of appearances of words by analyzing the target data in a natural language. 図１３は、対象データを自然語解析して語の出現数を求める処理の一例を示す図である。FIG. 13 is a diagram illustrating an example of a process for obtaining the number of appearances of words by analyzing the target data in a natural language. 図１４は、対象データを自然語解析して語の出現数を求める処理の一例を示す図である。FIG. 14 is a diagram illustrating an example of processing for obtaining the number of appearances of words by analyzing the target data in a natural language. 図１５は、対象データを解析して項目を選択する処理を示す図である。FIG. 15 is a diagram illustrating processing for selecting target items by analyzing target data. 図１６は、匿名候補データを生成する処理を示す図である。FIG. 16 is a diagram illustrating processing for generating anonymous candidate data. 図１７は、１月で区分したデータ、２月で区分したデータを示す図である。FIG. 17 is a diagram illustrating data classified in January and data classified in February. 図１８は、匿名化データを複数の区分で作成した例を示す図である。FIG. 18 is a diagram illustrating an example in which anonymized data is created in a plurality of sections. 図１９は、ユーザの行動履歴の一例を示す図である。FIG. 19 is a diagram illustrating an example of a user's behavior history.

以下、図面を参照して本発明を実施するための形態について説明する。以下の実施の形態の構成は例示であり、本発明は実施の形態の構成に限定されない。 Hereinafter, embodiments for carrying out the present invention will be described with reference to the drawings. The configuration of the following embodiment is an exemplification, and the present invention is not limited to the configuration of the embodiment.

〈実施形態１〉
図１は、匿名化処理の説明図である。図１（Ａ）は、姓、年齢、性別の項目を含む会員情報から姓の項目を削除した例を示す。図１（Ａ）に示すように年齢が記載されている会員情報に１６歳の女性が一人だけであると、１６歳の女性が、この会員であることが分かった時点で、その人を特定できる。即ち、１６歳・女性という属性を持つ人が一人だけであると、他の情報と照らし合わせることで、個人を特定できる可能性がある。 <Embodiment 1>
FIG. 1 is an explanatory diagram of anonymization processing. FIG. 1A shows an example in which the last name item is deleted from the member information including the last name, age, and gender items. As shown in Fig. 1 (A), if there is only one 16-year-old woman in the member information in which the age is described, when the 16-year-old woman is found to be this member, the person is identified. it can. That is, if there is only one person with the attribute of 16 years old and female, there is a possibility that an individual can be identified by comparing with other information.

図１（Ｂ）では、会員リストの年齢の記載を抽象化し、０代（１０歳未満）、１０代、２０代のように年代別とした。しかし、この場合でも１０代女性は一人だけであり、図１（Ａ）と同様に個人が特定できてしまい匿名化としては不十分である。 In FIG. 1 (B), the description of the age in the member list is abstracted and classified by age, such as 0's (under 10 years), 10's, and 20's. However, even in this case, there is only one female teenager, and an individual can be identified as in FIG. 1A, which is insufficient for anonymization.

そこで、図１（Ｃ）では、更に抽象化し、１０代以下（１９歳以下）と２０代のように年代の区切りを変更した。図１（Ｃ）の場合、１０代以下の女性が２人であり、［１０代以下］及び［女性］という属性が単一では無くなる。このため前述のように１６歳の女性が、この会員であることが分かったとしても、どちらが当該１６歳女性のデータであるかは特定できない。このように同じ属性を持つ人がｋ人以上いる状態を、「ｋ−匿名性」を満たすと称し、そのようにデータを加工することを「k-匿名化」と称する。 Therefore, in FIG. 1 (C), it was further abstracted and the age divisions were changed to those in their teens (under 19 years old) and those in their 20s. In the case of FIG. 1C, there are two women in their teens or less, and the attributes of “10 or less” and [female] are not single. For this reason, even if it turns out that a 16-year-old woman is this member as mentioned above, it cannot be specified which is the data of the 16-year-old woman. A state in which there are k or more people having the same attribute in this way is referred to as “k-anonymity” and processing such data is referred to as “k-anonymization”.

図１のデータは、ユーザ毎に、当該ユーザに係る情報の各項目の値が対応付けられている。換言すると、図１のデータは、各ユーザのデータが各行に記録され、当該ユーザに係る情報の各項目の値が各列に記録されている。即ち、図１のデータは、ユーザ毎の行と項目毎の列とからなるスプレッドシート型のデータである。 In the data of FIG. 1, the value of each item of information related to the user is associated with each user. In other words, in the data of FIG. 1, the data of each user is recorded in each row, and the value of each item of information relating to the user is recorded in each column. That is, the data in FIG. 1 is spreadsheet-type data including a row for each user and a column for each item.

図２は、ユーザの行動履歴の一例を示す図である。図２のデータは、ユーザを識別するユーザＩＤ（ユーザ識別情報）と当該ユーザに係る情報（行動データ）とを対応付けている。図２の例では、この行動データとして、ユーザが商品を購入した日付、店舗、内容を
有している。このように図２のデータは、ユーザの行動データが各行に記録され、当該ユーザに係る情報の各項目の値が各列に記録されている。即ち、図２のデータは、ユーザの行動データ毎に記録されたフロー型のデータである。 FIG. 2 is a diagram illustrating an example of a user's behavior history. The data in FIG. 2 associates a user ID (user identification information) for identifying a user with information (behavior data) related to the user. In the example of FIG. 2, the behavior data includes a date, a store, and a content when a user purchases a product. As described above, in the data of FIG. 2, user behavior data is recorded in each row, and values of each item of information related to the user are recorded in each column. That is, the data in FIG. 2 is flow-type data recorded for each user's behavior data.

図２に示すフロー型のデータでは、ユーザＩＤと対応付けられた行動データは、随時追加され、一ユーザのユーザＩＤと対応付けられた行動データが複数行に存在し、その数の追加されるタイミングも異なるため、図１のスプレッドシート型のデータのように匿名性を検定することができない。 In the flow type data shown in FIG. 2, the action data associated with the user ID is added as needed, and the action data associated with the user ID of one user exists in a plurality of rows, and the number of the action data is added. Since the timing is also different, anonymity cannot be tested like the spreadsheet-type data in FIG.

そこで、本実施形態１の情報処理装置（匿名化装置）１では、フローデータ型の対象データをスプレッドシート型に変換することで匿名性の検定を可能にする。以下に、この匿名化装置１０について説明する。 Therefore, the information processing apparatus (anonymization apparatus) 1 according to the first embodiment enables anonymity test by converting the flow data type target data into a spreadsheet type. Below, this anonymization apparatus 10 is demonstrated.

図３は、匿名化装置１０の概略構成を示す図である。匿名化装置１０は、図３に示すように、項目候補生成部１１、項目選択部１２、匿名候補生成部１３、匿名検定部１４、データ出力部１５を備えている。 FIG. 3 is a diagram illustrating a schematic configuration of the anonymization device 10. As shown in FIG. 3, the anonymization device 10 includes an item candidate generation unit 11, an item selection unit 12, an anonymous candidate generation unit 13, an anonymous test unit 14, and a data output unit 15.

項目候補生成部１１は、ユーザを識別するユーザ識別情報と当該ユーザに係る情報とを対応付けたデータを対象データとし、前記対象データに含まれる複数の語を抽出し、各語の出現数に基づいて、前記複数の語の少なくとも一部をデータ項目の候補とする。また、項目候補生成部１１は、対象データから抽出した語を抽象化し、抽象化した語の出現数に基づいて、複数の語及び抽象化した語の少なくとも一部をデータ項目の候補とする。 The item candidate generation unit 11 uses, as target data, data in which user identification information for identifying a user and information related to the user are associated, extracts a plurality of words included in the target data, and determines the number of occurrences of each word. Based on this, at least a part of the plurality of words is set as a data item candidate. In addition, the item candidate generation unit 11 abstracts the words extracted from the target data, and sets a plurality of words and at least a part of the abstracted words as data item candidates based on the number of appearances of the abstracted words.

項目選択部１２は、統計情報（価値データ）に基づいて前記データ項目の候補から所定数の候補を選択する。項目選択部１２は、価値データを検索情報蓄積ＤＢ４２から取得する。また、項目選択部１２は、検索情報蓄積ＤＢ４２に前記価値データが登録されていない場合に、他の装置にリクエストし、取得した価値データを検索情報蓄積ＤＢ４２に登録する機能（データリクエスト）や、定期的に他の装置を巡回して最新の価値データを取得し、検索情報蓄積ＤＢ４２に登録されている価値データを更新する機能（データクローラ）を有する。本実施形態では、この価値データとして検索エンジン９０から各ワードの統計情報を受信する。ここで、各ワードの統計情報は、例えばＳＥＭの広告単価（クリック単価）や、クリック率、平均掲載順位、１日の表示回数、１日のクリック数等である。なお、価値の取得先は、検索エンジンに限らず、ウェブページやＳＮＳ等であっても良い。このようにワードの統計情報を価値データとする場合、例えばウェブページやＳＮＳにおける各ワードの使用頻度を価値データとしても良い。 The item selection unit 12 selects a predetermined number of candidates from the data item candidates based on statistical information (value data). The item selection unit 12 acquires value data from the search information accumulation DB 42. Further, the item selection unit 12 makes a request to another device when the value data is not registered in the search information storage DB 42 and registers the acquired value data in the search information storage DB 42 (data request), It has a function (data crawler) that periodically visits other devices to acquire the latest value data and updates the value data registered in the search information storage DB 42. In this embodiment, the statistical information of each word is received from the search engine 90 as this value data. Here, the statistical information of each word includes, for example, an SEM advertising unit price (unit price per click), a click rate, an average ranking, the number of display times per day, the number of clicks per day, and the like. Note that the value acquisition destination is not limited to a search engine, and may be a web page, an SNS, or the like. Thus, when using the statistical information of a word as value data, it is good also considering the usage frequency of each word in a web page or SNS as value data, for example.

匿名候補生成部１３は、選択されたデータ項目の値を対象データのユーザ識別情報と対応付けられたデータから求めて、ユーザ識別情報毎にデータ項目の値を対応付けて匿名候補データとする。 The anonymous candidate generating unit 13 obtains the value of the selected data item from the data associated with the user identification information of the target data, and associates the value of the data item for each user identification information to obtain anonymous candidate data.

匿名検定部１４は、匿名候補データの一個人と対応する項目の値の組み合わせが、当該匿名候補データ中で単一でないことを条件として検定する。例えば匿名検定部１４は、匿名候補データがｋ−匿名性を満たしているか、ｌ−多様性を満たしているかを検定する。匿名候補データの匿名性を検定し、匿名性を満たした匿名候補データを匿名化情報とする。 Anonymous assay portion 14, the combination of the values of items corresponding to a single individual of the anonymous candidate data is assayed under the condition that not a single in the anonymous candidate data. For example, the anonymous testing unit 14 tests whether the anonymous candidate data satisfies k-anonymity or l-diversity. The anonymity of the anonymous candidate data is tested, and the anonymous candidate data that satisfies the anonymity is set as anonymized information.

データ出力部１５は、匿名性を満たした匿名化情報を読み出して出力する。ここで、匿名化情報の出力とは、表示装置による表示出力や、プリンタによる印刷出力、他のコンピュータへの送信、記憶媒体への書き込み等である。 The data output unit 15 reads and outputs anonymization information that satisfies anonymity. Here, the output of anonymization information includes display output by a display device, print output by a printer, transmission to another computer, writing to a storage medium, and the like.

図４は情報処理装置のハードウェア構成を示す図である。匿名化装置１０は、ＣＰＵ１、メモリ２、通信制御部３、記憶装置４、入出力インタフェース５を有する所謂コンピュータである。 FIG. 4 is a diagram illustrating a hardware configuration of the information processing apparatus. The anonymization device 10 is a so-called computer having a CPU 1, a memory 2, a communication control unit 3, a storage device 4, and an input / output interface 5.

ＣＰＵ１は、メモリ２に実行可能に展開されたプログラムを実行することで、前述の項目候補生成部１１、項目選択部１２、匿名候補生成部１３、匿名検定部１４、データ出力部１５の機能を提供する。 The CPU 1 executes the program expanded in an executable manner in the memory 2, thereby enabling the functions of the item candidate generation unit 11, item selection unit 12, anonymous candidate generation unit 13, anonymous test unit 14, and data output unit 15. provide.

メモリ２は、主記憶装置（メインメモリ）ということもできる。メモリ２は、例えば、ＣＰＵ１が実行するプログラムや、通信制御部３を介して受信したデータ、記憶装置４から読み出したデータ、その他のデータ等を記憶する。 The memory 2 can also be called a main storage device (main memory). The memory 2 stores, for example, a program executed by the CPU 1, data received via the communication control unit 3, data read from the storage device 4, other data, and the like.

通信制御部（ＣＣＵ：Communication Control Unit）３は、ネットワークを介して他の装置と接続し、当該装置との通信を制御する。入出力インタフェース５は、表示装置やプリンタ等の出力手段や、キーボードやポインティングデバイス等の入力手段、ドライブ装置等の入出力手段が適宜接続される。ドライブ装置は、着脱可能な記憶媒体の読み書き装置であり、例えば、フラッシュメモリカードの入出力装置、ＵＳＢメモリを接続するＵＳＢのアダプタ等である。また、着脱可能な記憶媒体は、例えば、ＣＤ（Compact Disc）、ＤＶＤ（Digital Versatile Disk）、ブルーレイディスク（Blu-ray(登録商標) Disc）等のディスク媒体であってもよい。ドライブ装置は、着脱可能な記憶媒体からプログラムを読み出し、記憶装置４に格納する。 A communication control unit (CCU: Communication Control Unit) 3 is connected to another device via a network and controls communication with the device. The input / output interface 5 is appropriately connected to output means such as a display device and a printer, input means such as a keyboard and pointing device, and input / output means such as a drive device. The drive device is a removable storage medium read / write device, such as an input / output device for a flash memory card, a USB adapter for connecting a USB memory, or the like. The removable storage medium may be a disk medium such as a CD (Compact Disc), a DVD (Digital Versatile Disk), or a Blu-ray (registered trademark) disc. The drive device reads the program from the removable storage medium and stores it in the storage device 4.

記憶装置４は、外部記憶装置ということもできる。記憶装置４としては、ＳＳＤ（Solid State Drive）やＨＤＤ等であってもよい。記憶装置４は、ドライブ装置との間で、デ
ータを授受する。例えば、記憶装置４は、ドライブ装置からインストールされる情報処理プログラム等を記憶する。また、記憶装置４は、プログラムを読み出し、メモリ２に引き渡す。本実施形態では、記憶装置４が対象データや検索情報蓄積ＤＢ４２、匿名化検定用ＤＢ４５を格納している。 The storage device 4 can also be called an external storage device. The storage device 4 may be an SSD (Solid State Drive), an HDD, or the like. The storage device 4 exchanges data with the drive device. For example, the storage device 4 stores an information processing program installed from the drive device. The storage device 4 reads out the program and delivers it to the memory 2. In the present embodiment, the storage device 4 stores target data, a search information accumulation DB 42, and an anonymization test DB 45.

図５−図７は、匿名化装置１０が、本実施形態の情報処理プログラムに従って実行する情報処理方法の説明図であり、図５は、対象データを解析して項目を選択する処理を示す図、図６は、匿名候補データを生成する処理を示す図、図７は、匿名性を検定する処理を示す図である。 5 to 7 are explanatory diagrams of an information processing method executed by the anonymization device 10 according to the information processing program of the present embodiment, and FIG. 5 is a diagram illustrating processing for analyzing the target data and selecting an item FIG. 6 is a diagram illustrating a process for generating anonymous candidate data, and FIG. 7 is a diagram illustrating a process for testing anonymity.

匿名化装置１０は、定期的或いは操作者の指示等を契機に図５，図６の処理を匿名化の事前処理として実行する。先ず、匿名化装置１０は、他のコンピュータ或いは記憶装置から対象データを取得し、形態素解析により対象データに含まれる語を抽出し、当該語の出現数や出現率を求める（ステップＳ１０）。なお、語の抽出は、全対象データから抽出しても良いが、これに限らず、最初の１万行や、全体の５％など、対処データの一部から抽出するものでも良い。 The anonymization device 10 executes the processes of FIGS. 5 and 6 as an anonymization pre-processes periodically or triggered by an operator's instruction or the like. First, the anonymization device 10 acquires target data from another computer or storage device, extracts words included in the target data by morphological analysis, and obtains the number of appearances and the appearance rate of the words (step S10). The extraction of words may be extracted from all target data, but is not limited thereto, and may be extracted from a part of the handling data such as the first 10,000 lines or 5% of the whole.

次に、匿名化装置１０は、ステップＳ１０で抽出した語を抽象化し（ステップＳ２０）、この抽象化した語の出現数や出現率を求める（ステップＳ３０）。また、匿名化装置１０は、出現数や出現率が必要とされる所定値未満、例えばｋ値未満の項目候補を削除し、足きりを行う（ステップＳ３５）。 Next, the anonymization device 10 abstracts the word extracted in step S10 (step S20), and obtains the number of appearances and the appearance rate of the abstracted word (step S30). Further, the anonymization device 10 deletes item candidates that are less than a predetermined value, for example, less than the k value, for which the number of appearances and the appearance rate are required, and performs a sufficient step (step S35).

そして、匿名化装置１０は、ステップＳ１０で抽出した語、及びステップＳ２０で抽象化した語を項目候補とし、検索情報蓄積ＤＢ４２から価値データを取得して各項目候補の重み付けを行い（ステップＳ４０）、この重み付けに基づいて所定数の項目を選択する（ステップＳ５０）。例えば、各語の出現数又は出現率に価値データ（ＳＥＭ価格）を乗じ
て指数化し、この指数の高い順に所定数の項目を選択する。 Then, the anonymization device 10 uses the word extracted in step S10 and the word abstracted in step S20 as item candidates, acquires value data from the search information accumulation DB 42, and weights each item candidate (step S40). Based on this weighting, a predetermined number of items are selected (step S50). For example, the number of occurrences or the appearance rate of each word is indexed by multiplying the value data (SEM price), and a predetermined number of items are selected in descending order of the index.

次に匿名化装置１０は、図６の処理を実行し、対象データから一行分のデータを読み出して（ステップＳ６０）、形態素解析により、この行に含まれる語を抽出する（ステップＳ７０）。匿名化装置１０は、この抽出した語の中から、図５の処理で生成した項目の値に該当する語を索出し（ステップＳ８０）、この項目の値をステップＳ６０で読み出した行のユーザＩＤと対応付けて匿名化検定用ＤＢ４５に記憶する（ステップＳ９０）。そして、匿名化装置１０は、次の処理があるか否かを判定し（ステップＳ１００）、次の処理があればステップＳ６０に戻り、次の行のデータを読み出してステップＳ６０〜ステップＳ１００を繰り返し、ステップＳ１００で次の処理が無ければ、図６の処理を終了する。このように各行のユーザＩＤと対応付けて、各列の項目の値をそれぞれ対象データから抽出して匿名化検定用ＤＢ４５に記憶したものが匿名候補データである。 Next, the anonymization device 10 executes the process of FIG. 6, reads one line of data from the target data (step S <b> 60), and extracts words included in this line by morphological analysis (step S <b> 70). The anonymization device 10 searches the extracted word for a word corresponding to the value of the item generated in the process of FIG. 5 (step S80), and the user ID of the line from which the value of this item is read in step S60. And stored in the anonymization test DB 45 (step S90). Then, the anonymization device 10 determines whether or not there is a next process (step S100). If there is a next process, the process returns to step S60, reads the data in the next line, and repeats steps S60 to S100. If there is no next process in step S100, the process of FIG. 6 is terminated. Thus in correspondence with the user ID of each row, those stored in the anonymous assay DB45 and extracts the value of the item in each column from the target data, respectively is anonymous candidate data.

次に匿名化装置１０は、図７の処理を実行し、図６の処理で作成した匿名候補データを匿名化検定用ＤＢ４５から読み出し、匿名候補データの一個人と対応する項目の値の組み合わせが、当該匿名候補データ中で所定数以下、例えば、１でないことを条件として検定
する（ステップＳ１１０）。即ち、匿名候補データがｋ−匿名性を満たしているか否かを判定する。ここで、匿名候補データが匿名性を満たしていなければ、ステップＳ１１０に戻って他の匿名候補データを読み出して検定を繰り返す（ステップＳ１１０〜Ｓ１２０）。 Then anonymizing apparatus 10 executes the processing of FIG. 7, reads the anonymous candidate data generated in the process of FIG. 6 DB45 for anonymizing test, the value of the item and the corresponding one individual of anonymous candidate data Is tested under the condition that the number of combinations is not a predetermined number or less, for example, 1 in the anonymous candidate data (step S110). That is, it is determined whether the anonymous candidate data satisfies k-anonymity. Here, if anonymous candidate data meets the anonymity, repeated tests by reading the other anonymous candidate data returns to step S110 (step S110 to S120).

一方、ステップＳ１２０で、匿名性を満たしていれば匿名化装置１０は、次の処理があるか否かを判定し（ステップＳ１３０）、次の処理があればステップＳ１１０に戻り、次の匿名候補データを読み出してステップＳ１１０〜ステップＳ１３０を繰り返し、ステップＳ１３０で次の処理が無ければ、検定をパスした匿名候補データを匿名化データとして出力し（ステップＳ１４０）、図７の処理を終了する。 On the other hand, in step S120, anonymizing apparatus 10 if they meet the anonymity, determines whether there is a next process (step S130), the process returns to step S110 if there is a next process, the next anonymous Repeat step S110~ step S130 reads the candidate data, if the next processing is no in step S130, outputs the anonymous candidate data have passed the test as anonymous data (step S140), and terminates the processing in FIG. 7 .

次に匿名候補データを作成する処理について、図８〜図１４を用いて具体的に説明する。 Next, processing for creating the anonymous candidate data will be specifically described with reference to FIGS. 8 to 14.

図８は、対象データを自然語解析して語の出現数を求める処理の一例を示す図である。図８において、対象データ５１は、ユーザを識別するユーザＩＤ（ユーザ識別情報）と当該ユーザに係る情報（行動データ）とを対応付け、この行動データとして、ユーザが商品を購入した日付、店舗、内容を有している。対象データ５１には、Ａブック−中野、Ｂストロ−新宿、Ｃレストラン−新宿のように、自然語のデータが記録されている。 FIG. 8 is a diagram illustrating an example of a process for obtaining the number of appearances of words by analyzing the target data in a natural language. In FIG. 8, target data 51 associates a user ID (user identification information) for identifying a user with information (behavior data) relating to the user, and as this behavior data, the date on which the user purchased the product, a store, Has content. In the target data 51, natural language data is recorded as A book-Nakano, B Stroke-Shinjuku, C Restaurant-Shinjuku.

対象データ５１を自然言語解析し、対象データ５１に含まれる語を抽出して、当該語の出現数を求めた結果がデータ５２である。データ５２は、対象データ５１から抜き出された新宿、ネット、Ｃレストラン等の語と、その品詞及び出現数を対応付けている。なお、データ５２では、各語と出現数を対応付けているが、出現数に代えて出現率としても良い、例えば、ある語の出現数がｎで、語の総数がｓの場合に、出現率Ａｑをｎ／ｓとする。また、出現率比（ｔｆ(term frequency)／ｉｄｆ(inverse document frequency)）を用いても良い。 Data 52 is the result of performing natural language analysis on the target data 51, extracting words included in the target data 51, and determining the number of appearances of the words. The data 52 associates words such as Shinjuku, net, and C restaurant extracted from the target data 51 with their parts of speech and the number of appearances. In the data 52, each word and the number of appearances are associated with each other. However, an appearance rate may be used instead of the number of appearances. For example, when the number of occurrences of a word is n and the total number of words is s, Let the rate Aq be n / s. Alternatively, an appearance ratio (tf (term frequency) / idf (inverse document frequency)) may be used.

また、基となる語と、この語を抽象化した語（上位概念）とを対応付けた抽象化辞書５３を用いて、抽出した語を上位概念に置き換えて抽象化する。例えば新宿を東京、ラーメンを中華料理、Ｅストアをコンビニのように変換する。そして、抽象化した語についても出現数を求めて、データ５２に加えたものがデータ５４である。 Further, using the abstract dictionary 53 in which the base word is associated with the abstract word (superordinate concept), the extracted word is replaced with the superordinate concept and abstracted. For example, Shinjuku is converted to Tokyo, ramen to Chinese food, and E store to a convenience store. The data 54 is obtained by obtaining the number of appearances of the abstracted word and adding it to the data 52.

なお、形態素解析を行う際、対象データ５１の属性に応じて自然語解析を行っても良い
。例えば、図９に示すように、対象データ５２の店舗の項目のデータについては、店舗名を重点的に登録した店舗用の辞書を用いて自然語解析を行い、データ５２ａとし、対象データ５１の内容の項目のデータについては、商品名を重点的に登録した購買内容用の辞書を用いて自然語解析を行い、データ５２ｂとし、これらをまとめて、解析結果５２としても良い。 When performing morphological analysis, natural language analysis may be performed according to the attribute of the target data 51. For example, as shown in FIG. 9, the data of the store item of the target data 52 is subjected to natural language analysis using a store dictionary in which store names are preferentially registered to obtain data 52 a. The content item data may be subjected to natural language analysis using a purchase content dictionary in which product names are preferentially registered to form data 52b, which may be combined into analysis results 52.

更に、図１０，図１１に示すように、店舗属性の形態素解析結果５２ａを店舗用抽象化辞書５３ａで抽象化して抽象化データ５５ａを作成し、形態素解析結果５２ａの語と抽象化データ５５ａの語とを掛け合わせる。同様に、内容属性の形態素解析結果５２ｂを内容用抽象化辞書（不図示）で抽象化して抽象化データ５５ｂを作成し、形態素解析結果５２ｂの語と抽象化データ５５ｂの語とを掛け合わせる。
例えば、
（１）店舗属性の形態素解析結果５２ａと内容属性の形態素解析結果５２ｂ、
（２）店舗属性の形態素解析結果５２ａと内容属性の抽象化結果５５ｂ、
（３）店舗属性の抽象化結果５５ａと内容属性の抽象化結果５５ｂ、
（４）店舗属性の形態素解析結果５２ａと内容属性の形態素解析結果５２ｂ、
の４種類のパターンを作成する。データ５６は、この掛け合わせ結果の一部である。 Furthermore, as shown in FIGS. 10 and 11, the store attribute morphological analysis result 52a is abstracted by the store abstract dictionary 53a to create abstract data 55a, and the words of the morphological analysis result 52a and the abstract data 55a Multiply with words. Similarly, the content attribute morpheme analysis result 52b is abstracted by a content abstraction dictionary (not shown) to create abstract data 55b, and the words of the morpheme analysis result 52b and the abstract data 55b are multiplied.
For example,
(1) Store attribute morpheme analysis result 52a and content attribute morpheme analysis result 52b,
(2) Store attribute morphological analysis result 52a and content attribute abstraction result 55b,
(3) Store attribute abstraction result 55a and content attribute abstraction result 55b,
(4) Store attribute morphological analysis result 52a and content attribute morphological analysis result 52b,
4 types of patterns are created. Data 56 is a part of this multiplication result.

そして、店舗属性の形態素解析結果５２ａ、内容属性の形態素解析結果５２ｂ、店舗属性の抽象化結果５５ａ、内容属性の抽象化結果５５ｂ、掛け合わせ結果５６をあわせたものがデータ５７（図１２）である。 The data 57 (FIG. 12) is a combination of the store attribute morpheme analysis result 52a, the content attribute morpheme analysis result 52b, the store attribute abstraction result 55a, the content attribute abstraction result 55b, and the multiplication result 56. is there.

なお、図１２のデータ５７では、各語と出現数を対応付けているが、出現数に代えて出現率としても良い、例えば、ある語の出現数がｎで、語の総数がｓの場合に、出現率Ａｑをｎ／ｓとする。図１２において、レベルは、語の加工の程度、換言すれば元データとの距離であり、本例において、レベル１は元の語、レベル２は抽象化した語、レベル３は掛け合わせた語である。
また、掛け合わせた語の場合、
属性１の出現数Ａｎ(1)
属性１の出現率Ａｑ(1)
属性２の出現数Ａｎ(1)
属性２の出現率Ａｑ(2)
Ａｎ(1) ＞Ａｎ(2)
Ａｎ(1)×(Ａｎ(1)／Ａｎ(1))とする。
また、Ａｎ(1)×(（Ａｎ(1)／ｓ）×（Ａｑ(2)／ｓ）)としても良い。 In the data 57 of FIG. 12, each word is associated with the number of appearances, but the appearance rate may be used instead of the number of appearances. For example, when the number of occurrences of a word is n and the total number of words is s. In addition, the appearance rate Aq is set to n / s. In FIG. 12, the level is the degree of word processing, in other words, the distance from the original data. In this example, level 1 is the original word, level 2 is the abstracted word, and level 3 is the multiplied word. It is.
In the case of a multiplied word,
Number of occurrences of attribute 1 An (1)
Appearance rate of attribute 1 Aq (1)
Number of occurrences of attribute 2 An (1)
Appearance rate of attribute 2 Aq (2)
An (1)> An (2)
It is assumed that An (1) × (An (1) / An (1)).
Alternatively, An (1) × ((An (1) / s) × (Aq (2) / s)) may be used.

なお出現率の計算は、一例であり、これに限定されるものではない。 Note that the calculation of the appearance rate is an example, and the present invention is not limited to this.

項目候補を含むデータ５７の出現数は、この情報群の中における出現数の大小を示している。そのため、２つ以上の情報を組み合わせて匿名性を検証する必要がある場合、ｋ匿名性を満たすためには、最低でも２以上の出現数がなければ、複数要素の選択肢を組み合わせることが出来ないことが解る。また、２以上であっても出現数が少ない場合、ｋ匿名性を満たす可能性が低くなる。そのため、必要とされるｋの値で足きりを行い、必要とされる情報群を作成する。例えば、図１３の例では、最低出現数を３以上とし、出現数が３未満の候補を除外（足きり）したものが、データ５８である。また、データ５８の出現数に価値データを乗じて指数化し、指数テーブル５９を作成している。なお、出現数に乗ずる価値データは、ＳＥＭ価格に限らず、ＰＯＳの売り上げデータや、ユーザや操作者による重み付け係数であっても良い。 The number of appearances of the data 57 including the item candidates indicates the size of the number of appearances in this information group. Therefore, when it is necessary to verify anonymity by combining two or more pieces of information, in order to satisfy k anonymity, it is impossible to combine multiple element choices unless there is at least two occurrences. I understand that. Moreover, even if it is 2 or more, when the number of appearances is small, the possibility of satisfying k-anonymity decreases. Therefore, the necessary value of k is used to create a necessary information group. For example, in the example of FIG. 13, data 58 is obtained by excluding (adding) candidates having a minimum appearance number of 3 or more and an appearance number of less than 3. In addition, an index table 59 is created by multiplying the number of appearances of the data 58 by the value data and indexing it. The value data multiplied by the number of appearances is not limited to the SEM price, but may be POS sales data or a weighting coefficient by a user or an operator.

図１４のデータ６１に示すように、スプレッドシートの各行を各ユーザのデータとし、
スプレッドシートの各列に、指数テーブル５９の指数の高い順に項目候補を割り当てる。 As shown in the data 61 of FIG. 14, each row of the spreadsheet is used as each user's data.
Item candidates are assigned to each column of the spreadsheet in descending order of the index in the index table 59.

そして、対象データ５１を形態素解析し、語を抽出し、また、この語を抽象化して、データ６１の対応する項目に入力する。例えば、データ５１のＩＤ＝Ａ００１のＡブック−中野というデータから、中野を抽出し、抽象化辞書を参照して東京とし、データ６１のＩＤ＝Ａ００１の東京の項目をインクリメントする。なお、データ６１に入力する項目の値は、対象データ５１で該当するデータの数をそのまま入力しても良いし、所定の閾値を越えたら１、超えなければ０とするフラグとして入力しても良い。また、１〜５、６〜１０のように、数値の範囲で入力しても良い。 Then, the morphological analysis is performed on the target data 51 to extract words, and the words are abstracted and input to corresponding items of the data 61. For example, Nakano is extracted from the data of data 51 with ID = A001 A book-Nakano, referring to the abstract dictionary as Tokyo, and the data 61 with ID = A001 in Tokyo is incremented. The value of the item to be input to the data 61 may be input as the number of corresponding data in the target data 51 as it is, or may be input as a flag which is 1 when exceeding a predetermined threshold and 0 otherwise. good. Moreover, you may input in the range of a numerical value like 1-5, 6-10.

以上のように本実施形態１によれば、対象データがフローデータ型であってもスプレッドシート型のデータに変換することで、匿名化の検定を確実に行うことができる。 As described above, according to the first embodiment, even if the target data is a flow data type, the anonymization test can be reliably performed by converting the data into spreadsheet type data.

また、本実施形態１によれば、スプレッドシート型のデータに変換する際の項目を語の出現数や価値データに基づいて選択しているので、機械的に変換を行っても、適切な変換結果が得られる。 In addition, according to the first embodiment, the items for conversion to spreadsheet-type data are selected based on the number of occurrences of words and value data. Therefore, even if mechanical conversion is performed, appropriate conversion is performed. Results are obtained.

更に、スプレッドシート型のデータに変換する際、対象データを全て処理しなくても項目候補を選択することができ、処理の負荷が軽減される。 Furthermore, when converting to spreadsheet-type data, item candidates can be selected without processing all the target data, and the processing load is reduced.

《実施形態２》
本実施形態２は、前述の実施形態１と比べて、主に匿名候補データを複数の区分に区切って作成した構成が異なる。なお、前述の実施形態１と同一の要素には同符号を付す等して再度の説明を省略している。 << Embodiment 2 >>
Embodiment 2 is different from the first embodiment discussed above mainly constituted created by separating the anonymous candidate data into a plurality of sections is different. Note that the same elements as those in the first embodiment are denoted by the same reference numerals, and the description thereof is omitted.

匿名化装置１０は、定期的或いは操作者の指示等を契機に図１５，図１６の処理を匿名化の事前処理として実行する。先ず、匿名化装置１０は、他のコンピュータ或いは記憶装置から対象データを取得し、形態素解析により対象データに含まれる語を抽出し、当該語の出現数や出現率を求める（ステップＳ１０）。なお、語の抽出は、全対象データから抽出しても良いが、これに限らず、最初の１万行や、全体の５％など、対処データの一部から抽出するものでも良い。 The anonymization device 10 executes the processes of FIGS. 15 and 16 as anonymization pre-processes periodically or triggered by an operator's instruction or the like. First, the anonymization device 10 acquires target data from another computer or storage device, extracts words included in the target data by morphological analysis, and obtains the number of appearances and the appearance rate of the words (step S10). The extraction of words may be extracted from all target data, but is not limited thereto, and may be extracted from a part of the handling data such as the first 10,000 lines or 5% of the whole.

次に、匿名化装置１０は、ステップＳ１０で抽出した語を抽象化し（ステップＳ２０）、この抽象化した語の出現数や出現率を求める（ステップＳ３０）。また、匿名化装置１０は、必要とされるｋ値未満の項目候補を削除し、足きりを行う（ステップＳ３５）。 Next, the anonymization device 10 abstracts the word extracted in step S10 (step S20), and obtains the number of appearances and the appearance rate of the abstracted word (step S30). Moreover, the anonymization apparatus 10 deletes the item candidate below k value required, and performs a step (step S35).

そして、匿名化装置１０は、ステップＳ１０で抽出した語、及びステップＳ２０で抽象化した語を項目候補とし、検索情報蓄積ＤＢ４２から価値データを取得して各項目候補の重み付けを行う（ステップＳ４０）。 Then, the anonymization device 10 uses the word extracted in step S10 and the word abstracted in step S20 as item candidates, acquires value data from the search information accumulation DB 42, and weights each item candidate (step S40). .

次に、匿名化装置１０は、操作者（分析担当者）による、区分の設定を受ける（ステップＳ４３）。例えば、年/月/日、時刻、時間帯、季節などの時間の区分や、都道府県など地域の区分、食品/書籍など所定カテゴリによる区分等が設定される。 Next, the anonymization device 10 receives a classification setting by an operator (analyzer) (step S43). For example, a time category such as year / month / day, time, time zone, season, a region category such as a prefecture, a category based on a predetermined category such as food / book, and the like are set.

また、既存データへの追加処理化、新規作成処理かが指定される（ステップＳ４６）。例えば、新規に２月の匿名候補データを作成する場合や、過去に東京の区分で作成した匿名候補データにデータを追加する等の処理を選択する。 Further, it is designated whether to add to existing data or create a new process (step S46). For example, if you create the anonymous candidate data of February new, it selects a process such as to add data to the anonymous candidate data that was created in Tokyo of the division in the past.

この設定に従い、既存の匿名候補データ又は新規の匿名候補データに、ステップＳ４０で定めた重み付けに基づいて所定数の項目を設定する（ステップＳ５０）。例えば、各語
の出現数又は出現率に価値データ（ＳＥＭ価格）を乗じて指数化し、この指数の高い順に所定数の項目を設定する。 In accordance with this setting, a predetermined number of items are set in the existing anonymous candidate data or new anonymous candidate data based on the weighting determined in step S40 (step S50). For example, the number of occurrences or the appearance rate of each word is indexed by multiplying value data (SEM price), and a predetermined number of items are set in descending order of the index.

次に匿名化装置１０は、図１６の処理を実行し、対象データから一行分のデータを読み出して（ステップＳ６０）、形態素解析により、この行に含まれる語を抽出する（ステップＳ７０）。匿名化装置１０は、ステップＳ４３で設定された区分の条件を読み出し、利用する匿名候補データを決定する（ステップＳ７６）。 Next, the anonymization device 10 executes the process of FIG. 16, reads one line of data from the target data (step S <b> 60), and extracts words included in this line by morphological analysis (step S <b> 70). Anonymizing apparatus 10 reads out the set classified condition at step S43, determines the anonymous candidate data used (step S76).

匿名化装置１０は、この抽出した語の中から、図１５の処理で生成した項目の値に該当する語を索出し（ステップＳ８０）、この項目の値をステップＳ６０で読み出した行のユーザＩＤと対応付けて記憶する（ステップＳ９０）。そして、匿名化装置１０は、次の処理があるか否かを判定し（ステップＳ１００）、次の処理がある場合、現在の行のデータを並行して書き込む他の匿名候補データがあるか否かを判定する（ステップＳ１０３）。ここで、他の匿名候補データへ書き込む処理があればステップＳ７６へ移行して利用する匿名候補データを決定し、ステップＳ８０へ移行する。 The anonymization device 10 searches the extracted word for a word corresponding to the value of the item generated in the process of FIG. 15 (step S80), and the user ID of the row from which the value of this item is read in step S60. And stored in association with each other (step S90). The anonymizing apparatus 10 determines whether there is a next process (step S100), if there is a next process, there are other anonymous candidate data to be written in parallel data of the current row Whether or not (step S103). Here, to determine the anonymous candidate data utilizing the operation proceeds to Step S76 if there is processing of writing to another anonymous candidate data, the process proceeds to step S80.

一方、ステップＳ１０３で他の匿名候補データへ書き込む処理がなければ、ステップＳ６０へ戻ってステップＳ６０〜１０３の処理を繰り返す。そして、ステップＳ１００で次の処理が無ければ、匿名候補データの管理テーブルを作成して、図１６の処理を終了する。 On the other hand, if there is no process of writing in step S103 to another anonymous candidate data, it repeats the process of step S60~103 returns to step S60. Then, if there is no next processing at step S100, it creates a management table of the anonymous candidate data, and ends the process in FIG. 16.

図１７は、１月で区分したデータ６２と、２月で区分したデータ６３を示す図である。本実施形態の対象データ５１は、常に増加するフローデータであるため、例えば月ごとに新しい情報が追加される場合、過去との整合性を確保するテーブル群を追加して管理することが可能となる。 FIG. 17 is a diagram showing data 62 classified in January and data 63 classified in February. Since the target data 51 of this embodiment is constantly increasing flow data, for example, when new information is added every month, it is possible to add and manage a table group that ensures consistency with the past. Become.

匿名化処理は、時期によって大きく性質が異なるため、例えば１カ月分のデータでは匿名化できないが、１年分のデータを集計すれば匿名化できる場合などが多く存在する。このため、月ごとに匿名化データを累積して行くことで、１年分のデータを改めて計算する手間をかけることなく、長期にわたる知見を得ることができる。 Since anonymization processing differs greatly depending on the time, for example, data for one month cannot be anonymized, but there are many cases where it can be anonymized by collecting data for one year. Therefore, by accumulating anonymized data every month, it is possible to obtain long-term knowledge without taking the trouble of calculating the data for one year again.

また、地域によってテーブルを分類するなどすることで、その後の統計処理を簡便にすることができる。 Further, the subsequent statistical processing can be simplified by classifying the table according to the region.

更に、図１８に示すように、例えば１月の匿名化データを作成し、２月には、その項目名をそのまま用いた上で、対象データから各ユーザのデータを入力することで、１月に多く存在した数字が、季節要因的なものなのか、定常的なものであるのかの比較が可能となる。 Furthermore, as shown in FIG. 18, for example, anonymized data for January is created, and in February, the item name is used as it is, and the data of each user is input from the target data. It is possible to compare whether the numbers that existed in the past are seasonal or stationary.

定常的に存在する情報は、出現数が大きい＝匿名性が高いため、情報として細かく区分することが出来る。本実施形態では、語の出現数に基づいて匿名候補データの項目を決定するので、出現数が大きい語を無駄に抽象化することなく、適切に匿名候補データを作成できる。 Since information that exists regularly has a large number of appearances = high anonymity, it can be finely classified as information. In this embodiment, because it determines the entry of anonymous candidate data based on the number of occurrences of a word, without wastefully abstract the number of occurrences is high word can be properly create anonymous candidate data.

また、１月の上位に作成した匿名候補データと別に、２月に上位に出現した語による匿名候補データを作成し、先月との語の相違レベルの比較などに利用することも可能となる。このような多次元の匿名候補データを管理するテーブルも存在すると、匿名化処理のための指針とすることが出来る。 In addition, separately from the anonymous candidate data that was created in January of higher-level, to create the anonymous candidate data by word that appeared in the top in February, it is also to be used, such as the word level of difference comparison of the last month It becomes possible. When the table also exists to manage the anonymous candidate data of such a multi-dimensional, can be a guideline for the anonymity processing.

〈その他〉
本発明は、上述の図示例にのみ限定されるものではなく、本発明の要旨を逸脱しない範囲内において種々変更を加え得ることは勿論である。 <Others>
The present invention is not limited to the illustrated examples described above, and various modifications can be made without departing from the scope of the present invention.

２メモリ
３通信制御部
４記憶装置
５入出力インタフェース
１０匿名化装置
１１項目候補生成部
１２項目選択部
１３匿名候補生成部
１４匿名検定部
１５データ出力部 2 Memory 3 Communication control unit 4 Storage device 5 Input / output interface 10 Anonymization device 11 Item candidate generation unit 12 Item selection unit 13 Anonymous candidate generation unit 14 Anonymous verification unit 15 Data output unit

Claims

Data that associates user identification information for identifying a user with information related to the user is used as target data, and a plurality of words included in the target data are extracted, and the plurality of words are based on the number of occurrences of each word. An item candidate generation unit that makes at least a part of the data item candidates;
An item selection unit for selecting a predetermined number of candidates from the data item candidates based on statistical information;
Anonymous candidates that obtain the value of the selected data item from the data associated with the user identification information of the target data and associate the value of the data item with the user identification information as anonymous candidate data A generator,
An information processing apparatus comprising:

The item candidate generation unit abstracts the word extracted from the target data, and based on the number of appearances of the abstracted word, the plurality of words and at least a part of the abstracted word are defined as data item candidates. The information processing apparatus according to claim 1.

The information processing apparatus according to claim 1 or 2, wherein the anonymous candidate generation unit generates the anonymous candidate data for each time, region, or predetermined category of the target data.

The information processing apparatus according to any one of claims 1 to 3, further comprising a test unit that tests on condition that a combination of values of items of the anonymous candidate data is not limited to one individual of the target data.

Data that associates user identification information for identifying a user with information related to the user is used as target data, and a plurality of words included in the target data are extracted, and the plurality of words are based on the number of occurrences of each word. Making at least a part of the data item candidates,
Selecting a predetermined number of candidates from the data item candidates based on statistical information;
Obtaining the value of the selected data item from the data associated with the user identification information of the target data, and associating the value of the data item with each user identification information as anonymous candidate data; ,
An information processing method in which a computer executes.

Data that associates user identification information for identifying a user with information related to the user is used as target data, and a plurality of words included in the target data are extracted, and the plurality of words are based on the number of occurrences of each word. Making at least a part of the data item candidates,
Selecting a predetermined number of candidates from the data item candidates based on statistical information;
Obtaining the value of the selected data item from the data associated with the user identification information of the target data, and associating the value of the data item with each user identification information as anonymous candidate data; ,
Processing program for causing a computer to execute.