JP7053987B2

JP7053987B2 - Data processing equipment, data processing methods and data processing programs

Info

Publication number: JP7053987B2
Application number: JP2017255119A
Authority: JP
Inventors: 裕司山岡
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2017-12-29
Filing date: 2017-12-29
Publication date: 2022-04-13
Anticipated expiration: 2037-12-29
Also published as: JP2019121138A

Description

本発明はデータ加工装置、データ加工方法およびデータ加工プログラムに関する。 The present invention relates to a data processing apparatus, a data processing method and a data processing program.

ある組織が他の組織にとって価値の高いデータを収集し保有していることがある。その場合、ある組織は特定の他の組織に対して、外部に漏洩しないよう機密管理を行うことを条件にデータを提供することがある。例えば、様々な生物についてＤＮＡ（Deoxyribonucleic Acid）の塩基配列を示すＤＮＡデータを収集している事業者は、ＤＮＡを分析する研究機関に対してＤＮＡデータを提供することがある。 One organization may collect and hold data that is of high value to another. In that case, one organization may provide data to a specific other organization on the condition that it is confidentially controlled so as not to be leaked to the outside. For example, a business operator that collects DNA data showing a base sequence of DNA (Deoxyribonucleic Acid) for various organisms may provide DNA data to a research institute that analyzes DNA.

ある組織は複数の他の組織に対して同様のデータを提供することがある。一方、何れかの提供先の組織においてデータの機密管理が適切に行われなかった結果、データが契約に反して外部に漏洩してしまう可能性がある。提供元の組織は、インターネットなど提供先の組織以外の情報源からデータを入手したことにより、漏洩の事実に気付くことがある。その場合、提供元の組織は、複数の提供先の組織のうち何れの組織からデータが漏洩したかを特定できることが好ましい。データを漏洩させた組織自身も漏洩の事実に気付いていないことがあるため、漏洩元を特定することは再発防止のために有用である。 An organization may provide similar data to multiple other organizations. On the other hand, as a result of improper data confidentiality management in any of the organizations to which the data is provided, there is a possibility that the data will be leaked to the outside in violation of the contract. The provider's organization may become aware of the fact of the leak by obtaining data from sources other than the provider's organization, such as the Internet. In that case, it is preferable that the providing organization can identify from which of the plurality of providing organizations the data was leaked. Identifying the source of the leak is useful for preventing recurrence, as the organization that leaked the data may not be aware of the fact of the leak.

しかし、同一のデータをそのまま複数の他の組織に対して提供してしまうと、漏洩したデータから漏洩元を特定することが困難となる。そこで、データ提供時に提供先の組織に応じて異なる態様でデータを加工する方法が提案されている。 However, if the same data is provided to a plurality of other organizations as it is, it becomes difficult to identify the leak source from the leaked data. Therefore, there has been proposed a method of processing data in a different manner depending on the organization to which the data is provided when the data is provided.

例えば、マルチキャスト配信されたデータの漏洩元を電子透かしを用いて特定できるようにする送信装置が提案されている。提案の送信装置は、オリジナルデータをコピーして異なる電子透かしを付加した２通りのコピーデータを生成し、各コピーデータを時系列に複数のデータ区間に分割する。送信装置は、複数のデータ区間それぞれについて２通りのコピーデータの何れか一方を選択することで、異なる受信装置に対して異なる電子透かしの時系列パターンを割り当てるようにする。 For example, a transmission device has been proposed that enables the leakage source of multicast-distributed data to be identified by using a digital watermark. The proposed transmission device copies the original data to generate two types of copy data with different digital watermarks, and divides each copy data into a plurality of data sections in chronological order. The transmitting device selects one of the two types of copy data for each of the plurality of data sections, thereby assigning different digital watermarking time-series patterns to different receiving devices.

また、身長や体重などの数値型カラムを含む表形式データの漏洩元を特定できるようにする情報処理装置が提案されている。提案の情報処理装置は、表形式データの提供時に、数値型カラムに記載されている複数の数値に対して和がゼロになるようなノイズを付加する。情報処理装置は、提供先の識別子からノイズを生成することで、異なる複数の提供先と異なる複数のノイズパターンとを関連付けている。 In addition, an information processing device has been proposed that makes it possible to identify the source of leakage of tabular data including numerical columns such as height and weight. The proposed information processing apparatus adds noise such that the sum becomes zero for a plurality of numerical values described in a numerical value column when tabular data is provided. The information processing apparatus associates a plurality of different providers with a plurality of different noise patterns by generating noise from the identifier of the provider.

また、文字情報に対して不正使用発見のための情報を埋め込む文字情報編集装置が提案されている。提案の文字情報編集装置は、形状が類似しており文字コードが異なる同形文字ペアを示す文字辞書を保持する。文字情報編集装置は、文字情報の中から同形文字ペアの一方の文字を検索し、検索された文字を同形文字ペアの他方の文字に置換する。 Further, a character information editing device for embedding information for detecting unauthorized use in character information has been proposed. The proposed character information editing device holds a character dictionary showing isomorphic character pairs having similar shapes and different character codes. The character information editing device searches for one character of the isomorphic character pair from the character information, and replaces the searched character with the other character of the isomorphic character pair.

また、文書画像に不可視な制御情報を埋め込む文書処理装置が提案されている。提案の文書処理装置は、制御情報を二進数のビット列で表現し、１ビットを隣接する２つの文字の間の空白に割り当てる。文書処理装置は、ビットの値が「０」であるか「１」であるかによって空白の長さが変わるように文字間隔を調整する。 Further, a document processing device for embedding invisible control information in a document image has been proposed. The proposed document processing apparatus expresses control information as a binary bit string and allocates one bit to a space between two adjacent characters. The document processing device adjusts the character spacing so that the length of the blank changes depending on whether the bit value is "0" or "1".

特開２００９－２１２７９９号公報Japanese Unexamined Patent Publication No. 2009-21279 特開２０１３－１９１１２１号公報Japanese Unexamined Patent Publication No. 2013-191121 特開２０００－３５２９２８号公報Japanese Unexamined Patent Publication No. 2000-352928 特開２０１０－１２４４５１号公報Japanese Unexamined Patent Publication No. 2010-124451

しかし、提供されるデータにノイズが付加されると、データに含まれる値の真正性が保証されないことになり、データの特性や使用方法によっては提供先にとってのデータの価値が大きく低下してしまうことがある。例えば、ＤＮＡデータでは一部の塩基配列が改変されると、ＤＮＡデータ全体の価値が大きく低下するおそれがある。 However, if noise is added to the provided data, the authenticity of the values contained in the data will not be guaranteed, and the value of the data to the provider will be greatly reduced depending on the characteristics and usage of the data. Sometimes. For example, in DNA data, if a part of the base sequence is modified, the value of the entire DNA data may be significantly reduced.

１つの側面では、本発明は、漏洩対策におけるデータの有用性への影響を低減するデータ加工装置、データ加工方法およびデータ加工プログラムを提供することを目的とする。 In one aspect, it is an object of the present invention to provide a data processing apparatus, a data processing method, and a data processing program that reduce the influence on the usefulness of data in measures against leakage.

１つの態様では、記憶部と処理部とを有するデータ加工装置が提供される。処理部は、複数のレコードを含むデータの中から削除されるレコードを示す複数の削除パターンを生成し、複数の削除パターンそれぞれに基づいてデータに含まれるレコードを削除することにより、データの複数のサブセットを生成する。記憶部は、複数のサブセットそれぞれと当該サブセットの提供先を示す提供先識別子とを対応付けた対応情報を記憶する。 In one aspect, a data processing apparatus having a storage unit and a processing unit is provided. The processing unit generates a plurality of deletion patterns indicating records to be deleted from the data containing a plurality of records, and deletes the records contained in the data based on each of the multiple deletion patterns, thereby causing a plurality of data. Generate a subset. The storage unit stores correspondence information in which each of the plurality of subsets is associated with the destination identifier indicating the destination of the subset.

また、１つの態様では、コンピュータが実行するデータ加工方法が提供される。また、１つの態様では、コンピュータに実行させるデータ加工プログラムが提供される。 Also, in one aspect, a computer-executed data processing method is provided. Further, in one embodiment, a data processing program to be executed by a computer is provided.

１つの側面では、漏洩対策におけるデータの有用性への影響が低減される。 On one side, the impact on the usefulness of the data in anti-leakage measures is reduced.

第１の実施の形態のデータ加工装置を説明する図である。It is a figure explaining the data processing apparatus of 1st Embodiment. 第２の実施の形態の情報処理装置のハードウェア例を示す図である。It is a figure which shows the hardware example of the information processing apparatus of 2nd Embodiment. 情報処理装置の機能例を示すブロック図である。It is a block diagram which shows the functional example of an information processing apparatus. 元データテーブルの例を示す図である。It is a figure which shows the example of the original data table. パラメータテーブルの例を示す図である。It is a figure which shows the example of a parameter table. 度数分布テーブルの例を示す図である。It is a figure which shows the example of the frequency distribution table. グループテーブルの例を示す図である。It is a figure which shows the example of a group table. 識別可能データテーブルの例を示す図である。It is a figure which shows the example of the identifiable data table. 対応表の例を示す図である。It is a figure which shows the example of the correspondence table. 修正度数分布テーブルの例を示す図である。It is a figure which shows the example of the correction frequency distribution table. 提供データテーブルの例を示す図である。It is a figure which shows the example of the provided data table. 漏洩データテーブルの例を示す図である。It is a figure which shows the example of the leakage data table. 元データ分析の手順例を示すフローチャートである。It is a flowchart which shows the procedure example of the original data analysis. 提供データ生成の手順例を示すフローチャートである。It is a flowchart which shows the procedure example of the provided data generation. 漏洩元推定の手順例を示すフローチャートである。It is a flowchart which shows the procedure example of the leakage source estimation.

以下、本実施の形態を図面を参照して説明する。
［第１の実施の形態］
第１の実施の形態を説明する。 Hereinafter, the present embodiment will be described with reference to the drawings.
[First Embodiment]
The first embodiment will be described.

図１は、第１の実施の形態のデータ加工装置を説明する図である。
第１の実施の形態のデータ加工装置１０は、あるデータから複数の提供先に提供する複数のサブセットを生成する。また、データ加工装置１０は、加工されたデータが提供された後、データの漏洩が発見された場合に当該複数の提供先の中から漏洩元を推定する。ただし、データの加工と漏洩元の推定を異なる装置で行うこともできる。データ加工装置１０は、クライアントコンピュータでもよいしサーバコンピュータでもよい。 FIG. 1 is a diagram illustrating a data processing apparatus according to the first embodiment.
The data processing apparatus 10 of the first embodiment generates a plurality of subsets to be provided to a plurality of destinations from a certain data. Further, when the data leakage is found after the processed data is provided, the data processing apparatus 10 estimates the leakage source from the plurality of provision destinations. However, data processing and leakage source estimation can be performed by different devices. The data processing device 10 may be a client computer or a server computer.

データ加工装置１０は、記憶部１１および処理部１２を有する。記憶部１１は、ＲＡＭ（Random Access Memory）などの揮発性の半導体メモリでもよいし、ＨＤＤ（Hard Disk Drive）などの不揮発性のストレージでもよい。処理部１２は、例えば、ＣＰＵ（Central Processing Unit）やＤＳＰ（Digital Signal Processor）などのプロセッサである。ただし、処理部１２は、ＡＳＩＣ（Application Specific Integrated Circuit）やＦＰＧＡ（Field Programmable Gate Array）などの特定用途の電子回路を含んでもよい。プロセッサは、ＲＡＭなどのメモリに記憶されたプログラムを実行する。複数のプロセッサの集合を「マルチプロセッサ」または単に「プロセッサ」と言うことがある。 The data processing device 10 has a storage unit 11 and a processing unit 12. The storage unit 11 may be a volatile semiconductor memory such as a RAM (Random Access Memory) or a non-volatile storage such as an HDD (Hard Disk Drive). The processing unit 12 is, for example, a processor such as a CPU (Central Processing Unit) or a DSP (Digital Signal Processor). However, the processing unit 12 may include an electronic circuit for a specific purpose such as an ASIC (Application Specific Integrated Circuit) or an FPGA (Field Programmable Gate Array). The processor executes a program stored in a memory such as RAM. A collection of multiple processors may be referred to as a "multiprocessor" or simply a "processor."

記憶部１１は、対応情報１３を記憶する。対応情報１３は、データの複数のサブセットと、提供先識別子１４ａ，１４ｂ，１４ｃなどの複数の提供先識別子とを対応付ける。データの複数のサブセットは、例えば、削除パターン１５ａ，１５ｂ，１５ｃなどの複数の削除パターンによって特定される。各削除パターンは、複数のレコードを含むデータの中から削除されるレコードを示す。削除されるレコードは、データの中の一部のレコードであって、例えば、データの中で各レコードを識別するためのレコード識別子を用いて特定される。各提供先識別子は、データ提供先の個人または組織を示す。 The storage unit 11 stores the correspondence information 13. The correspondence information 13 associates a plurality of subsets of data with a plurality of destination identifiers such as the destination identifiers 14a, 14b, and 14c. The plurality of subsets of data is specified by the plurality of deletion patterns, for example, the deletion patterns 15a, 15b, 15c. Each deletion pattern indicates a record to be deleted from data containing a plurality of records. The record to be deleted is a part of the record in the data, and is specified by using, for example, a record identifier for identifying each record in the data. Each destination identifier indicates an individual or organization to which the data is provided.

処理部１２は、データの中から削除されるレコードを決定して、異なる複数の削除パターンを生成する。処理部１２は、複数の削除パターンそれぞれに基づいてデータに含まれる一部のレコードを削除することにより、データの複数のサブセットを生成する。 The processing unit 12 determines a record to be deleted from the data and generates a plurality of different deletion patterns. The processing unit 12 generates a plurality of subsets of the data by deleting some records included in the data based on each of the plurality of deletion patterns.

ある提供先識別子が示す提供先には、当該提供先識別子に対応付けられた削除パターンに基づいてデータから生成されるサブセットが提供される。提供されるサブセットには、削除パターンが削除対象として指定するレコードが含まれていない。異なる提供先識別子には異なる削除パターンが対応付けられていることが好ましい。ただし、異なる削除パターンは、削除対象のレコードの集合が完全に一致していなければよく、一部の削除対象のレコードが重複していてもよい。 The destination indicated by a destination identifier is provided with a subset generated from the data based on the deletion pattern associated with the destination identifier. The subset provided does not contain the records that the delete pattern specifies for deletion. It is preferable that different destination identifiers are associated with different deletion patterns. However, different deletion patterns may be such that the set of records to be deleted does not completely match, and some records to be deleted may be duplicated.

例えば、提供先識別子１４ａに対して削除パターン１５ａが対応付けられる。削除パターン１５ａは、データの中のレコード＃４，＃５，＃１２を削除対象に指定している。また、提供先識別子１４ｂに対して削除パターン１５ｂが対応付けられる。削除パターン１５ｂは、データの中のレコード＃３，＃７，＃１０を削除対象に指定している。また、提供先識別子１４ｃに対して削除パターン１５ｃが対応付けられる。削除パターン１５ｃは、データの中のレコード＃２，＃８，＃９を削除対象に指定している。 For example, the deletion pattern 15a is associated with the provision destination identifier 14a. The deletion pattern 15a specifies records # 4, # 5, and # 12 in the data as deletion targets. Further, the deletion pattern 15b is associated with the provision destination identifier 14b. The deletion pattern 15b specifies records # 3, # 7, and # 10 in the data as deletion targets. Further, the deletion pattern 15c is associated with the provision destination identifier 14c. The deletion pattern 15c specifies records # 2, # 8, and # 9 in the data as deletion targets.

処理部１２は、漏洩データ１６を取得する。漏洩データ１６は、例えば、データ加工装置１０の外部から入力されて記憶部１１に記憶される。漏洩データ１６は、元のデータに含まれる複数のレコードのうち一部のレコードを含む。漏洩データ１６は、複数の提供先のうち何れか１つの提供先から漏洩したことが疑われるレコードの集合であり、当該１つの提供先に提供されたサブセットの全体または一部分である。漏洩データ１６は、例えば、インターネットや名簿屋など提供先とは異なる情報源から入手される。 The processing unit 12 acquires the leaked data 16. The leaked data 16 is, for example, input from the outside of the data processing apparatus 10 and stored in the storage unit 11. The leaked data 16 includes a part of the plurality of records included in the original data. The leaked data 16 is a set of records suspected of being leaked from any one of a plurality of destinations, and is a whole or a part of a subset provided to the one destination. The leaked data 16 is obtained from a source different from the information source such as the Internet or a list shop.

漏洩データ１６を取得すると、処理部１２は、対応情報１３が示す複数の削除パターンの中から、漏洩データ１６に含まれる何れのレコードも削除対象に指定していない削除パターンを検索する。削除対象に指定されたレコードは提供されていないため漏洩し得ないからである。処理部１２は、検索された削除パターンに対応付けられた提供先識別子を対応情報１３から抽出し、抽出した提供先識別子が示す提供先を漏洩データ１６の漏洩元であると推定する。処理部１２は、推定した漏洩元を示す漏洩元情報１７を生成して出力する。漏洩元情報１７は、例えば、抽出された提供先識別子を含む。 When the leaked data 16 is acquired, the processing unit 12 searches for a deletion pattern in which none of the records included in the leaked data 16 is designated as a deletion target from the plurality of deletion patterns indicated by the corresponding information 13. This is because the record specified as the deletion target is not provided and cannot be leaked. The processing unit 12 extracts the provider identifier associated with the searched deletion pattern from the corresponding information 13, and estimates that the provider indicated by the extracted provider identifier is the leak source of the leaked data 16. The processing unit 12 generates and outputs leak source information 17 indicating an estimated leak source. The leak source information 17 includes, for example, an extracted provider identifier.

例えば、漏洩データ１６にレコード＃２，＃４，＃５，＃８が含まれている。対応情報１３に登録された削除パターン１５ａはレコード＃４，＃５を削除対象に指定している。よって、削除パターン１５ａに対応付けられた提供先識別子１４ａの提供先は、漏洩データ１６を漏洩し得ないため漏洩元の候補から除外される。また、削除パターン１５ｃはレコード＃２，＃８を削除対象に指定している。よって、削除パターン１５ｃに対応付けられた提供先識別子１４ｃの提供先は、漏洩データ１６を漏洩し得ないため漏洩元の候補から除外される。一方、削除パターン１５ｂはレコード＃２，＃４，＃５，＃８の何れも削除対象に指定していない。よって、削除パターン１５ｂに対応付けられた提供先識別子１４ｂの提供先は、漏洩データ１６を漏洩し得るため漏洩元と推定される。 For example, the leaked data 16 contains records # 2, # 4, # 5, # 8. The deletion pattern 15a registered in the correspondence information 13 specifies records # 4 and # 5 as deletion targets. Therefore, the provider of the provider identifier 14a associated with the deletion pattern 15a is excluded from the candidates for the leak source because the leaked data 16 cannot be leaked. Further, the deletion pattern 15c specifies records # 2 and # 8 as deletion targets. Therefore, the provider of the provider identifier 14c associated with the deletion pattern 15c is excluded from the candidates for the leak source because the leaked data 16 cannot be leaked. On the other hand, the deletion pattern 15b does not specify any of the records # 2, # 4, # 5, and # 8 as the deletion target. Therefore, the provider of the provider identifier 14b associated with the deletion pattern 15b is presumed to be the leak source because the leaked data 16 can be leaked.

ただし、漏洩元の推定は、データ加工装置１０が有する処理部１２以外のユニットが行ってもよいし、データ加工装置１０以外の情報処理装置が行ってもよい。
第１の実施の形態のデータ加工装置１０によれば、複数の提供先にデータが提供される場合に、複数のサブセットと複数の提供先識別子との対応関係を示す対応情報１３が生成される。漏洩データ１６が取得されると、対応情報１３に基づいて、漏洩データ１６に含まれる何れのレコードも削除対象に指定されていない提供先識別子が検索され、検索された提供先識別子が漏洩データ１６の漏洩元を示す情報として出力される。 However, the estimation of the leakage source may be performed by a unit other than the processing unit 12 of the data processing device 10, or may be performed by an information processing device other than the data processing device 10.
According to the data processing apparatus 10 of the first embodiment, when data is provided to a plurality of destinations, correspondence information 13 indicating a correspondence relationship between a plurality of subsets and a plurality of destination identifiers is generated. .. When the leaked data 16 is acquired, a provider identifier that is not designated as a deletion target for any record included in the leaked data 16 is searched based on the correspondence information 13, and the searched provider identifier is the leaked data 16. It is output as information indicating the source of the leak.

これにより、データ提供先が複数存在する場合であっても、漏洩データ１６から漏洩元を推定することが可能となる。よって、漏洩元に再発防止を要求するなどデータの保護を強化することが可能となる。また、レコードの値にノイズを付加する方法と比べて、レコードの真正性を確保することができ、データの有用性への影響を低減することができる。 This makes it possible to estimate the leakage source from the leakage data 16 even when there are a plurality of data provision destinations. Therefore, it is possible to strengthen the protection of data by requesting the leakage source to prevent recurrence. Further, as compared with the method of adding noise to the value of the record, the authenticity of the record can be ensured and the influence on the usefulness of the data can be reduced.

［第２の実施の形態］
次に、第２の実施の形態を説明する。
第２の実施の形態の情報処理装置１００は、組織が保有する元データを加工して、複数の他の組織それぞれに対して提供する提供データを生成する。また、情報処理装置１００は、漏洩データを取得した場合、漏洩データを分析して漏洩元の組織を推定する。第２の実施の形態の説明では、元データとして塩基配列のレコードを複数含むＤＮＡデータを想定する。提供元の組織としては大量のＤＮＡデータを収集する事業者を想定し、提供先の組織としてはＤＮＡデータを利用した研究開発を行う研究機関などを想定する。ＤＮＡデータは、外部に漏洩させないことを条件に有償または無償で提供される。 [Second Embodiment]
Next, a second embodiment will be described.
The information processing apparatus 100 of the second embodiment processes the original data held by the organization to generate the provided data to be provided to each of the plurality of other organizations. Further, when the leaked data is acquired, the information processing apparatus 100 analyzes the leaked data and estimates the organization of the leak source. In the description of the second embodiment, DNA data including a plurality of base sequence records is assumed as the original data. The provider organization is assumed to be a business operator that collects a large amount of DNA data, and the provider organization is assumed to be a research institution that conducts research and development using DNA data. DNA data is provided for a fee or free of charge, provided that it is not leaked to the outside.

図２は、第２の実施の形態の情報処理装置のハードウェア例を示す図である。
情報処理装置１００は、ＣＰＵ１０１、ＲＡＭ１０２、ＨＤＤ１０３、画像信号処理部１０４、入力信号処理部１０５、媒体リーダ１０６および通信インタフェース１０７を有する。情報処理装置１００は、第１の実施の形態のデータ加工装置１０に対応する。ＣＰＵ１０１は、第１の実施の形態の処理部１２に対応する。ＲＡＭ１０２またはＨＤＤ１０３は、第１の実施の形態の記憶部１１に対応する。なお、処理部１２は処理に応じて、ネットワーク等で互いに接続された複数の情報処理装置のＣＰＵで実現されてもよい。 FIG. 2 is a diagram showing a hardware example of the information processing apparatus according to the second embodiment.
The information processing apparatus 100 includes a CPU 101, a RAM 102, an HDD 103, an image signal processing unit 104, an input signal processing unit 105, a medium reader 106, and a communication interface 107. The information processing device 100 corresponds to the data processing device 10 of the first embodiment. The CPU 101 corresponds to the processing unit 12 of the first embodiment. The RAM 102 or the HDD 103 corresponds to the storage unit 11 of the first embodiment. The processing unit 12 may be realized by the CPUs of a plurality of information processing devices connected to each other via a network or the like, depending on the processing.

ＣＰＵ１０１は、プログラムの命令を実行するプロセッサである。ＣＰＵ１０１は、ＨＤＤ１０３に記憶されたプログラムやデータの少なくとも一部をＲＡＭ１０２にロードし、プログラムを実行する。なお、ＣＰＵ１０１は複数のプロセッサコアを含んでもよく、情報処理装置１００は複数のプロセッサを有してもよく、以下で説明する処理を複数のプロセッサまたはプロセッサコアを用いて並列に実行してもよい。また、複数のプロセッサの集合を「マルチプロセッサ」または単に「プロセッサ」と言うことがある。 The CPU 101 is a processor that executes a program instruction. The CPU 101 loads at least a part of the programs and data stored in the HDD 103 into the RAM 102 and executes the program. The CPU 101 may include a plurality of processor cores, the information processing unit 100 may have a plurality of processors, and the processes described below may be executed in parallel using the plurality of processors or processor cores. .. Also, a set of multiple processors may be referred to as a "multiprocessor" or simply a "processor".

ＲＡＭ１０２は、ＣＰＵ１０１が実行するプログラムやＣＰＵ１０１が演算に用いるデータを一時的に記憶する揮発性の半導体メモリである。なお、情報処理装置１００は、ＲＡＭ以外の種類のメモリを備えてもよく、複数個のメモリを備えてもよい。 The RAM 102 is a volatile semiconductor memory that temporarily stores a program executed by the CPU 101 and data used by the CPU 101 for calculation. The information processing apparatus 100 may include a type of memory other than the RAM, or may include a plurality of memories.

ＨＤＤ１０３は、ＯＳ（Operating System）やミドルウェアなどのソフトウェアのプログラム、および、データを記憶する不揮発性の記憶装置である。なお、情報処理装置１００は、フラッシュメモリやＳＳＤ（Solid State Drive）などの他の種類の記憶装置を備えてもよく、複数の不揮発性の記憶装置を備えてもよい。 The HDD 103 is a non-volatile storage device that stores software programs such as an OS (Operating System) and middleware, and data. The information processing device 100 may be provided with other types of storage devices such as a flash memory and an SSD (Solid State Drive), or may be provided with a plurality of non-volatile storage devices.

画像信号処理部１０４は、ＣＰＵ１０１からの命令に従って、情報処理装置１００に接続されたディスプレイ１１１に画像を出力する。ディスプレイ１１１としては、ＣＲＴ（Cathode Ray Tube）ディスプレイ、液晶ディスプレイ（ＬＣＤ：Liquid Crystal Display）、プラズマディスプレイ、有機ＥＬ（ＯＥＬ：Organic Electro-Luminescence）ディスプレイなど、任意の種類のディスプレイを用いることができる。 The image signal processing unit 104 outputs an image to the display 111 connected to the information processing apparatus 100 in accordance with a command from the CPU 101. As the display 111, any kind of display such as a CRT (Cathode Ray Tube) display, a liquid crystal display (LCD: Liquid Crystal Display), a plasma display, and an organic EL (OEL: Organic Electro-Luminescence) display can be used.

入力信号処理部１０５は、情報処理装置１００に接続された入力デバイス１１２から入力信号を取得し、ＣＰＵ１０１に出力する。入力デバイス１１２としては、マウス、タッチパネル、タッチパッド、トラックボール、キーボード、リモートコントローラ、ボタンスイッチなど、任意の種類の入力デバイスを用いることができる。また、情報処理装置１００に、複数の種類の入力デバイスが接続されていてもよい。 The input signal processing unit 105 acquires an input signal from the input device 112 connected to the information processing device 100 and outputs the input signal to the CPU 101. As the input device 112, any kind of input device such as a mouse, a touch panel, a touch pad, a trackball, a keyboard, a remote controller, and a button switch can be used. Further, a plurality of types of input devices may be connected to the information processing apparatus 100.

媒体リーダ１０６は、記録媒体１１３に記録されたプログラムやデータを読み取る読み取り装置である。記録媒体１１３として、例えば、磁気ディスク、光ディスク、光磁気ディスク（ＭＯ：Magneto-Optical disk）、半導体メモリなどを使用できる。磁気ディスクには、フレキシブルディスク（ＦＤ：Flexible Disk）やＨＤＤが含まれる。光ディスクには、ＣＤ（Compact Disc）やＤＶＤ（Digital Versatile Disc）が含まれる。 The medium reader 106 is a reading device that reads programs and data recorded on the recording medium 113. As the recording medium 113, for example, a magnetic disk, an optical disk, a magneto-optical disk (MO: Magneto-Optical disk), a semiconductor memory, or the like can be used. The magnetic disk includes a flexible disk (FD) and an HDD. Optical discs include CDs (Compact Discs) and DVDs (Digital Versatile Discs).

媒体リーダ１０６は、例えば、記録媒体１１３から読み取ったプログラムやデータを、ＲＡＭ１０２やＨＤＤ１０３などの他の記録媒体にコピーする。読み取られたプログラムは、例えば、ＣＰＵ１０１によって実行される。なお、記録媒体１１３は可搬型記録媒体であってもよく、プログラムやデータの配布に用いられることがある。また、記録媒体１１３やＨＤＤ１０３を、コンピュータ読み取り可能な記録媒体と言うことがある。 The medium reader 106, for example, copies a program or data read from the recording medium 113 to another recording medium such as the RAM 102 or the HDD 103. The read program is executed by, for example, the CPU 101. The recording medium 113 may be a portable recording medium and may be used for distribution of programs and data. Further, the recording medium 113 and the HDD 103 may be referred to as a computer-readable recording medium.

通信インタフェース１０７は、ネットワーク１１４を介して他の情報処理装置と通信を行うインタフェースである。通信インタフェース１０７は、スイッチやルータなどの有線通信装置に接続される有線通信インタフェースでもよいし、基地局やアクセスポイントなどの無線通信装置に接続される無線通信インタフェースでもよい。 The communication interface 107 is an interface that communicates with another information processing device via the network 114. The communication interface 107 may be a wired communication interface connected to a wired communication device such as a switch or a router, or may be a wireless communication interface connected to a wireless communication device such as a base station or an access point.

図３は、情報処理装置の機能例を示すブロック図である。
情報処理装置１００は、元データ記憶部１２１、パラメータ記憶部１２２、提供データ記憶部１２３、対応表記憶部１２４および漏洩データ記憶部１２５を有する。これらの記憶部は、例えば、ＲＡＭ１０２またはＨＤＤ１０３の記憶領域を用いて実装される。また、情報処理装置１００は、削除レコード数算出部１３１、識別可能レコード抽出部１３２、グルーピング部１３３、削除パターン生成部１３４、レコード削除部１３５、レコード度数低減部１３６および漏洩元推定部１３７を有する。これらの処理部は、例えば、ＣＰＵ１０１が実行するプログラムを用いて実装される。 FIG. 3 is a block diagram showing a functional example of the information processing apparatus.
The information processing apparatus 100 includes an original data storage unit 121, a parameter storage unit 122, a provision data storage unit 123, a correspondence table storage unit 124, and a leaked data storage unit 125. These storage units are implemented using, for example, the storage area of the RAM 102 or the HDD 103. Further, the information processing apparatus 100 has a deleted record number calculation unit 131, an identifiable record extraction unit 132, a grouping unit 133, a deletion pattern generation unit 134, a record deletion unit 135, a record frequency reduction unit 136, and a leakage source estimation unit 137. .. These processing units are implemented using, for example, a program executed by the CPU 101.

元データ記憶部１２１は、ＤＮＡの塩基配列を示すレコードを複数含む元データを記憶する。ＤＮＡの塩基配列は、Ａ（アデニン）、Ｇ（グアニン）、Ｃ（シトシン）、Ｔ（チミン）という４種類の塩基を含む配列である。元データにおける複数の塩基配列の長さは同一でなくてもよい。元データでは各塩基配列が上記の４種類の文字を用いて表現されていてもよいし、対応する数値や他の記号を用いて表現されていてもよい。 The original data storage unit 121 stores original data including a plurality of records showing the base sequence of DNA. The base sequence of DNA is a sequence containing four types of bases, A (adenine), G (guanine), C (cytosine), and T (thymine). The lengths of the plurality of base sequences in the original data do not have to be the same. In the original data, each base sequence may be represented using the above four types of characters, or may be represented using the corresponding numerical values or other symbols.

パラメータ記憶部１２２は、元データから提供データを生成するときに使用されるパラメータ値を記憶する。パラメータ値は、例えば、ユーザによって情報処理装置１００に対して入力される。情報処理装置１００は、元データから一部のレコードを削除することで元データのサブセットを提供データとして生成する。パラメータ値は、削除レコード数を決定するために使用される。パラメータ値は、想定される提供先の最大数である最大提供先数と、漏洩データから漏洩元を一意に特定できる確率を示す推定能力とを含む。最大提供先数が大きいほど削除レコード数を大きくすることになる。また、ユーザが希望する推定能力が高いほど削除レコード数を大きくすることになる。ただし、削除レコード数をパラメータ値としてユーザが直接入力することも可能である。 The parameter storage unit 122 stores the parameter value used when generating the provided data from the original data. The parameter value is input to the information processing apparatus 100 by the user, for example. The information processing apparatus 100 generates a subset of the original data as the provided data by deleting some records from the original data. The parameter value is used to determine the number of records to delete. The parameter value includes the maximum number of destinations, which is the maximum number of possible destinations, and the estimation ability that indicates the probability that the leak source can be uniquely identified from the leaked data. The larger the maximum number of destinations, the larger the number of deleted records. In addition, the higher the estimation ability desired by the user, the larger the number of deleted records. However, it is also possible for the user to directly input the number of deleted records as a parameter value.

提供データ記憶部１２３は、元データから一部のレコードを削除することで生成された提供データを記憶する。提供先によって提供データは異なる。提供データ記憶部１２３は、複数の提供先に対応する複数セットの提供データを記憶し得る。提供データ記憶部１２３に記憶された提供データは、ネットワーク１１４を介して提供先の情報処理装置に対して送信されてもよい。また、提供データ記憶部１２３に記憶された提供データは、可搬記録媒体に書き込まれて提供先に対して送付されてもよい。 The provision data storage unit 123 stores the provision data generated by deleting some records from the original data. The data provided differs depending on the destination. The provision data storage unit 123 can store a plurality of sets of provision data corresponding to a plurality of provision destinations. The provided data stored in the provided data storage unit 123 may be transmitted to the information processing device of the providing destination via the network 114. Further, the provided data stored in the provided data storage unit 123 may be written in the portable recording medium and sent to the providing destination.

対応表記憶部１２４は、提供先を識別する提供先識別子と元データから削除したレコードを示す削除パターンとを対応付けた対応表を記憶する。対応表は提供データの生成時に作成されて保持され、漏洩元の推定時に参照される。 The correspondence table storage unit 124 stores a correspondence table in which the provision destination identifier that identifies the provision destination and the deletion pattern indicating the record deleted from the original data are associated with each other. The correspondence table is created and retained when the provided data is generated, and is referred to when the source of the leak is estimated.

漏洩データ記憶部１２５は、提供先の組織以外の情報源から入手された漏洩データを記憶する。漏洩データは、例えば、ユーザによって入手されて情報処理装置１００に入力される。漏洩データは、インターネットなどのネットワーク上に公開されたサーバ装置から入手されることがある。また、漏洩データは、いわゆる名簿屋などの情報提供業者から入手されることがある。漏洩データは、複数の提供先のうちの何れか１つから漏洩したものであり、当該提供先に対応する提供データの一部または全部に相当する。提供データは、提供先の内部の従業員による不正行為によって漏洩することがある。また、提供データは、提供先の外部からのセキュリティ攻撃によって漏洩することがある。漏洩データに含まれるレコードが多い方が、漏洩元の推定が容易となる。 The leaked data storage unit 125 stores leaked data obtained from an information source other than the organization to which the data is provided. The leaked data is, for example, obtained by the user and input to the information processing apparatus 100. The leaked data may be obtained from a server device published on a network such as the Internet. In addition, the leaked data may be obtained from an information provider such as a so-called list shop. The leaked data is leaked from any one of a plurality of provision destinations, and corresponds to a part or all of the provision data corresponding to the provision destination. The provided data may be leaked due to fraudulent activity by employees inside the provider. In addition, the provided data may be leaked by a security attack from the outside of the provider. The more records included in the leaked data, the easier it is to estimate the source of the leak.

削除レコード数算出部１３１は、パラメータ記憶部１２２に記憶されたパラメータ値を用いて削除レコード数を算出する。例えば、提供データの中から任意の数のレコードがランダムに漏洩すると仮定する。すなわち、提供データから抽出可能なサブセットの全てのパターンが、等確率で漏洩対象となるものと仮定する。すると、削除レコード数ｒ、最大提供先数ｐおよび推定能力ｃの間に、近似的に次の不等式が成立すると考えることができる。ｒ≧ｌｏｇ_２（（ｐ－１）／（１－ｃ））。削除レコード数算出部１３１は、削除レコード数ｒとして、この不等式を満たす最小の整数を採用すればよい。削除レコード数算出部１３１は、算出した削除レコード数をグルーピング部１３３および削除パターン生成部１３４に通知する。 The deleted record number calculation unit 131 calculates the number of deleted records using the parameter value stored in the parameter storage unit 122. For example, assume that any number of records are randomly leaked from the provided data. That is, it is assumed that all patterns of the subset that can be extracted from the provided data are subject to leakage with equal probability. Then, it can be considered that the following inequality is approximately established between the number of deleted records r, the maximum number of destinations p, and the estimation capacity c. r ≧ log ₂ ((p-1) / (1-c)). The deleted record number calculation unit 131 may adopt the smallest integer satisfying this inequality as the deleted record number r. The deleted record number calculation unit 131 notifies the grouping unit 133 and the deletion pattern generation unit 134 of the calculated number of deleted records.

識別可能レコード抽出部１３２は、元データ記憶部１２１に記憶された元データに含まれる複数のレコードを比較し、同じ塩基配列の出現数（度数）をカウントして度数分布を生成する。識別可能レコード抽出部１３２は、度数が１である塩基配列をもつレコード、すなわち、元データの中で１回しか出現しない塩基配列をもつレコードを識別可能レコードとして取り扱う。識別可能レコードは、塩基配列自体によって一意に識別できる。一方、識別可能レコード抽出部１３２は、識別可能レコード以外のレコード、すなわち、度数が１より大きい塩基配列をもつレコードを識別不能レコードとして取り扱う。 The identifiable record extraction unit 132 compares a plurality of records included in the original data stored in the original data storage unit 121, counts the number of appearances (frequency) of the same base sequence, and generates a frequency distribution. The identifiable record extraction unit 132 handles a record having a base sequence having a frequency of 1, that is, a record having a base sequence that appears only once in the original data as an identifiable record. The identifiable record can be uniquely identified by the base sequence itself. On the other hand, the identifiable record extraction unit 132 treats a record other than the identifiable record, that is, a record having a base sequence having a frequency greater than 1, as an unidentifiable record.

識別可能レコード抽出部１３２は、識別可能レコードをグルーピング部１３３に通知する。また、識別可能レコード抽出部１３２は、生成した度数分布をレコード削除部１３５に通知する。なお、ＤＮＡデータに含まれる各塩基配列は長いことが多く、完全に同一の塩基配列をもつレコードが出現する確率は小さい。よって、元データに含まれるレコードの多くは識別可能レコードであり、識別不能レコードは少数であると期待される。 The identifiable record extraction unit 132 notifies the grouping unit 133 of the identifiable records. Further, the identifiable record extraction unit 132 notifies the record deletion unit 135 of the generated frequency distribution. It should be noted that each base sequence contained in the DNA data is often long, and the probability that a record having the same base sequence will appear is small. Therefore, it is expected that most of the records included in the original data are identifiable records and the number of unidentifiable records is small.

グルーピング部１３３は、識別可能レコード抽出部１３２が元データから抽出した識別可能レコードを複数のグループに分類する。グルーピングの基準は予めグルーピング部１３３に設定されていてもよいし、ユーザが情報処理装置１００に適宜入力してもよい。 The grouping unit 133 classifies the identifiable records extracted from the original data by the identifiable record extraction unit 132 into a plurality of groups. The grouping reference may be set in advance in the grouping unit 133, or may be appropriately input by the user to the information processing apparatus 100.

グルーピングでは、提供先の組織が興味をもつ可能性のある識別可能レコードの集合を１つのグループとすることが好ましい。すなわち、提供先の組織は、提供データに含まれる全てのレコードに興味があるとは限らず、提供データから一部のレコードを抽出して研究開発などに使用することがある。抽出しなかったレコードは提供先の組織において破棄されることがあり、抽出したレコードに比べて漏洩する可能性が低い。そこで、グルーピング部１３３は、特定のグループのみからレコードが漏洩した場合でも漏洩元を推定できるように、識別可能レコードを提供先の興味を基準にしてグルーピングする。 In grouping, it is preferable to group a set of identifiable records that may be of interest to the organization to which they are provided. That is, the organization to which the information is provided may not be interested in all the records contained in the data provided, and may extract some records from the data provided and use them for research and development. Records that have not been extracted may be discarded by the organization to which they are provided, and are less likely to be leaked than the extracted records. Therefore, the grouping unit 133 groups the identifiable records based on the interest of the provider so that the leakage source can be estimated even if the records are leaked from only a specific group.

例えば、ある種の研究開発では塩基配列の先頭の塩基が重要な意味をもつため、塩基配列の先頭が特定の塩基であるレコードのみ使用される可能性があると仮定する。この場合、グルーピング部１３３は、識別可能レコードを塩基配列の先頭の塩基に基づいて最大４つのグループに分類することが考えられる。すなわち、塩基配列の先頭がＡのグループ、Ｇのグループ、ＣのグループおよびＴのグループが生成され得る。このようなレコードの使用方法に関する知見は、例えば、ユーザによって入力される。 For example, suppose that in some research and development, the base at the beginning of the base sequence is important, so only records with a specific base at the beginning of the base sequence may be used. In this case, the grouping unit 133 may classify the identifiable records into a maximum of four groups based on the first base of the base sequence. That is, a group whose base sequence starts with A, a group G, a group C, and a group T can be generated. Knowledge of how to use such records is entered, for example, by the user.

グルーピング部１３３は、グルーピング結果と削除レコード数算出部１３１から通知された削除レコード数とに基づいて、複数の識別可能レコードをソートする。識別可能レコードのソートでは、できる限り同じグループに属する識別可能レコードが連続して出現しないように並べ替えられる。これにより、ある提供先に対応する提供データを生成する際に、できる限り異なるグループに属する識別可能レコードが削除対象レコードとして選択されるようになる。グルーピング部１３３は、ソートした識別可能レコードのリストを削除パターン生成部１３４に通知する。 The grouping unit 133 sorts a plurality of identifiable records based on the grouping result and the number of deleted records notified from the deleted record number calculation unit 131. The sort of identifiable records sorts the identifiable records that belong to the same group so that they do not appear consecutively as much as possible. As a result, when generating the provision data corresponding to a certain provision destination, the identifiable records belonging to different groups as much as possible are selected as the records to be deleted. The grouping unit 133 notifies the deletion pattern generation unit 134 of the sorted list of identifiable records.

削除パターン生成部１３４は、提供先の組織を識別する提供先識別子を取得する。提供先識別子は、例えば、ユーザが情報処理装置１００に入力する。削除パターン生成部１３４は、グルーピング部１３３から通知された識別可能レコードのリストの中から、削除レコード数算出部１３１から通知された削除レコード数だけ識別可能レコードを選択する。例えば、削除パターン生成部１３４は、識別可能レコードのリストの先頭から順に、削除レコード数だけ識別可能レコードを選択する。削除パターン生成部１３４は、選択した識別可能レコードを削除対象レコードとして決定し、削除対象レコードの集合を示す削除パターンを生成する。削除パターン生成部１３４は、提供先識別子と削除パターンとを対応付けて、対応表記憶部１２４に記憶された対応表に登録する。また、削除パターン生成部１３４は、生成した削除パターンをレコード削除部１３５に通知する。 The deletion pattern generation unit 134 acquires a provider identifier that identifies the provider organization. The provider identifier is, for example, entered by the user into the information processing apparatus 100. The deletion pattern generation unit 134 selects as many identifiable records as the number of deleted records notified by the deletion record number calculation unit 131 from the list of identifiable records notified by the grouping unit 133. For example, the deletion pattern generation unit 134 selects identifiable records by the number of deleted records in order from the beginning of the list of identifiable records. The deletion pattern generation unit 134 determines the selected identifiable record as the deletion target record, and generates a deletion pattern indicating a set of deletion target records. The deletion pattern generation unit 134 associates the provision destination identifier with the deletion pattern and registers it in the correspondence table stored in the correspondence table storage unit 124. Further, the deletion pattern generation unit 134 notifies the record deletion unit 135 of the generated deletion pattern.

レコード削除部１３５は、塩基配列と度数を対応付けた度数分布を識別可能レコード抽出部１３２から取得し、削除パターン生成部１３４から通知された削除パターンが示す塩基配列を、取得した度数分布から削除して修正度数分布を生成する。レコード削除部１３５は、修正度数分布をレコード度数低減部１３６に通知する。 The record deletion unit 135 acquires the frequency distribution in which the base sequence and the frequency are associated with each other from the identifiable record extraction unit 132, and deletes the base sequence indicated by the deletion pattern notified from the deletion pattern generation unit 134 from the acquired frequency distribution. To generate a modified frequency distribution. The record deletion unit 135 notifies the record frequency reduction unit 136 of the correction frequency distribution.

レコード度数低減部１３６は、レコード削除部１３５から修正度数分布を取得し、修正度数分布の中から度数が１より大きい塩基配列、すなわち、識別不能レコードの塩基配列を検出する。レコード度数低減部１３６は、検出した塩基配列の度数を、識別可能レコードの削除割合と整合するように低減させる。識別可能レコードのみが削除対象となり識別不能レコードが削除対象にならないとすると、提供データに含まれる識別不能レコードの割合が元データよりも大きくなってしまうためである。識別不能レコードに相当する塩基配列の度数を低減させることで、識別可能レコードの削除の影響を軽減でき、提供データがもつ特性を元データの特性に近付けることができる。 The record frequency reduction unit 136 acquires a correction frequency distribution from the record deletion unit 135, and detects a base sequence having a frequency greater than 1 from the correction frequency distribution, that is, a base sequence of an unidentifiable record. The record frequency reduction unit 136 reduces the frequency of the detected base sequence so as to match the deletion rate of the identifiable record. This is because if only the identifiable records are to be deleted and the unidentifiable records are not to be deleted, the ratio of the unidentifiable records included in the provided data will be larger than that of the original data. By reducing the frequency of the base sequence corresponding to the unidentifiable record, the influence of the deletion of the identifiable record can be reduced, and the characteristics of the provided data can be brought closer to the characteristics of the original data.

例えば、元データに含まれる識別可能レコードの数をｎ、削除レコード数をｒ、ある塩基配列の度数をｎ_ｔ（ｎ_ｔは１より大きい整数）とすると、低減度数ｒ_ｔが次のように算出される。ｒ_ｔ＝ｒ／ｎ×ｎ_ｔ。ただし、この等式の右辺が割り切れない場合、四捨五入や切り捨てや切り上げなどにより低減度数ｒ_ｔを整数にする。レコード度数低減部１３６は、修正度数分布における度数ｎ_ｔをｎ_ｔ－ｒ_ｔに変更する。レコード度数低減部１３６は、度数調整後の修正度数分布を提供データとして提供データ記憶部１２３に格納する。このとき、修正度数分布にレコードＩＤが含まれている場合はレコードＩＤが削除される。 For example, assuming that the number of identifiable records contained in the original data is n, the number of deleted records is r, and the frequency of a certain base sequence is _nt ( _nt is an integer larger than 1), the reduction frequency _rt is as follows. It is calculated. _{rt = r / n × n t} _. However, if the right-hand side of this equation is not divisible, the reduction frequency _rt is made an integer by rounding, rounding down, or rounding up. The record frequency reduction unit 136 changes the frequency _nt in the modified frequency distribution to _nt _−rt . The record frequency reduction unit 136 stores the modified frequency distribution after frequency adjustment in the provided data storage unit 123 as provided data. At this time, if the record ID is included in the correction frequency distribution, the record ID is deleted.

なお、削除レコード数算出部１３１、識別可能レコード抽出部１３２およびグルーピング部１３３の処理は、１つの元データに対して１回実行すればよい。一方、削除パターン生成部１３４、レコード削除部１３５およびレコード度数低減部１３６の処理は、１つの提供先識別子に対して１回実行される。３つの提供先が存在する場合、削除パターン生成部１３４、レコード削除部１３５およびレコード度数低減部１３６の処理が３回実行され、３つの提供先に対応する３セットの提供データが生成される。 The processes of the deleted record number calculation unit 131, the identifiable record extraction unit 132, and the grouping unit 133 may be executed once for one original data. On the other hand, the processing of the deletion pattern generation unit 134, the record deletion unit 135, and the record frequency reduction unit 136 is executed once for one provider identifier. When three provision destinations exist, the processing of the deletion pattern generation unit 134, the record deletion unit 135, and the record frequency reduction unit 136 is executed three times, and three sets of provision data corresponding to the three provision destinations are generated.

漏洩元推定部１３７は、漏洩データ記憶部１２５に記憶された漏洩データを分析して漏洩元を推定する。漏洩元推定部１３７は、元データ記憶部１２１に記憶された元データまたは識別可能レコード抽出部１３２が生成した度数分布を参照して、漏洩データに含まれる識別可能レコードを特定する。漏洩元推定部１３７は、対応表記憶部１２４に記憶された対応表に含まれる削除パターンと、漏洩データに出現する識別可能レコードとを照合して、漏洩元の候補を絞り込む。提供データに含まれていなかった識別可能レコードは当該提供データを受け取った組織からは漏洩し得ない。このため、漏洩データに出現する識別可能レコードが削除パターンに含まれている提供先は漏洩元から除外することができる。 The leak source estimation unit 137 analyzes the leak data stored in the leak data storage unit 125 and estimates the leak source. The leakage source estimation unit 137 identifies the identifiable record included in the leakage data by referring to the original data stored in the original data storage unit 121 or the frequency distribution generated by the identifiable record extraction unit 132. The leak source estimation unit 137 collates the deletion pattern included in the correspondence table stored in the correspondence table storage unit 124 with the identifiable record appearing in the leak data, and narrows down the leak source candidates. Identifiable records that were not included in the provided data cannot be leaked from the organization that received the provided data. Therefore, the provider whose deletion pattern includes the identifiable record appearing in the leaked data can be excluded from the leak source.

漏洩元推定部１３７は、推定した漏洩元を示す情報を出力する。例えば、漏洩元推定部１３７は、漏洩元と推定された提供先を示す提供先識別子を出力する。漏洩元推定部１３７は、漏洩元を示す情報をディスプレイ１１１に表示してもよいし、漏洩元を示す情報をネットワーク１１４を介して他の情報処理装置に送信してもよい。なお、漏洩元を１つに絞り込むことができなかった場合、漏洩元推定部１３７は、推定失敗を示すメッセージを出力してもよいし、漏洩元の候補を示す情報を出力してもよい。 The leak source estimation unit 137 outputs information indicating the estimated leak source. For example, the leak source estimation unit 137 outputs a provider identifier indicating a provider estimated to be a leak source. The leak source estimation unit 137 may display information indicating the leak source on the display 111, or may transmit information indicating the leak source to another information processing device via the network 114. If the leak source cannot be narrowed down to one, the leak source estimation unit 137 may output a message indicating an estimation failure or may output information indicating a leak source candidate.

なお、図３では、漏洩元推定部１３７は情報処理装置１００に組み込まれているが、情報処理装置１００から切り離し、例えば、漏洩元を推定する装置として他の情報処理装置に組み込まれ、ネットワーク経由で情報処理装置１００に接続されてもよい。 In FIG. 3, the leak source estimation unit 137 is incorporated in the information processing device 100, but is separated from the information processing device 100 and incorporated into another information processing device as a device for estimating the leak source, for example, via the network. May be connected to the information processing apparatus 100.

図４は、元データテーブルの例を示す図である。
元データテーブル１４１は、元データ記憶部１２１に記憶される。元データテーブル１４１は、レコードＩＤおよび塩基配列の項目を含む。レコードＩＤの項目には、レコードを識別する識別番号が登録される。塩基配列の項目には、塩基配列の各塩基をＡ，Ｇ，Ｃ，Ｔまたはこれに対応する数字や記号を用いて表した文字列が登録される。 FIG. 4 is a diagram showing an example of the original data table.
The original data table 141 is stored in the original data storage unit 121. The original data table 141 includes items of record ID and base sequence. An identification number for identifying a record is registered in the record ID item. In the base sequence item, a character string representing each base of the base sequence using A, G, C, T or a corresponding number or symbol is registered.

ここでは説明を簡単にするため、全ての塩基配列を長さ３の配列としている。ただし、塩基配列は長さがより大きくてもよく、長さが統一されていなくてもよい。また、ここではレコードＩＤとして連番の識別番号を使用している。ただし、レコードＩＤとして塩基配列のハッシュ値など他の数値や記号を使用してもよい。また、ここでは元データテーブル１４１の各レコードがレコードＩＤを含んでいる。ただし、レコードＩＤと塩基配列とが対応付けられていれば、各レコードが明示的にレコードＩＤを含まなくてもよい。例えば、元データテーブル１４１の先頭からのオフセットなど、各レコードの位置から当該レコードのレコードＩＤを特定できるようにしてもよい。 Here, for the sake of simplicity, all the base sequences are sequences of length 3. However, the base sequence may have a larger length and may not have a uniform length. Further, here, the identification number of the serial number is used as the record ID. However, other numerical values or symbols such as the hash value of the base sequence may be used as the record ID. Further, here, each record in the original data table 141 includes a record ID. However, as long as the record ID and the base sequence are associated with each other, each record does not have to explicitly include the record ID. For example, the record ID of the record may be specified from the position of each record, such as the offset from the beginning of the original data table 141.

また、第２の実施の形態では元データとしてＤＮＡデータを使用しているが、他の種類の元データを情報処理装置１００が扱うことも可能である。元データとしては、公開されることが好ましくない秘密情報が使用され得る。例えば、年齢、性別、住所などの複数の個人情報項目を含み、１つのレコードが１人分の個人情報を表している個人情報セットが挙げられる。また、経度、緯度、時刻などの複数の測定情報項目を含み、１つのレコードが人または物の一時点の位置を表している位置情報セットが挙げられる。 Further, although the DNA data is used as the original data in the second embodiment, the information processing apparatus 100 can handle other types of original data. Confidential information that is not desirable to be disclosed may be used as the original data. For example, a personal information set containing a plurality of personal information items such as age, gender, and address, in which one record represents personal information for one person. It also includes a set of position information that includes a plurality of measurement information items such as longitude, latitude, and time, and one record represents the position of a person or object at one point in time.

元データは、好ましくは、対象ドメインを構成する全ての人または物に関する情報を網羅的に含んでいる全数データではなく、一部の人または物に関する情報をサンプルとして含んでいるサンプルデータである。元データがサンプルデータであれば、元データから少数のレコードを削除しても提供データの価値は低下しないと期待される。 The original data is preferably sample data that includes information about some people or things as a sample, rather than 100% data that comprehensively contains information about all people or things that make up the target domain. If the original data is sample data, it is expected that the value of the provided data will not decrease even if a small number of records are deleted from the original data.

図５は、パラメータテーブルの例を示す図である。
パラメータテーブル１４２は、パラメータ記憶部１２２に記憶される。パラメータテーブル１４２は、パラメータ名とパラメータ値とを対応付けて記憶する。パラメータには、最大提供先数ｐ、推定能力ｃおよび削除レコード数ｒが含まれる。 FIG. 5 is a diagram showing an example of a parameter table.
The parameter table 142 is stored in the parameter storage unit 122. The parameter table 142 stores the parameter name and the parameter value in association with each other. The parameters include the maximum number of destinations p, the estimation capacity c, and the number of deleted records r.

最大提供先数ｐと推定能力ｃはユーザによって入力される。削除レコード数ｒは最大提供先数ｐと推定能力ｃから算出され、ｒ≧ｌｏｇ_２（（ｐ－１）／（１－ｃ））を満たす最小の整数である。ただし、ユーザが削除レコード数ｒを直接入力してもよい。一例として、最大提供先数ｐ＝３、推定能力ｃ＝０．７５がユーザから入力される。これは、提供先の組織が最大で３つ存在すること、および、７５％の確率で漏洩データから漏洩元を特定できる推定能力が要求されることを示している。この場合、最大提供先数ｐ＝３のもとで推定能力ｃ＝０．７５を達成するために削除レコード数ｒ＝３が決定される。 The maximum number of destinations p and the estimation capacity c are input by the user. The number of deleted records r is calculated from the maximum number of destinations p and the estimation capacity c, and is the smallest integer satisfying r ≧ log ₂ ((p-1) / (1-c)). However, the user may directly input the number of deleted records r. As an example, the maximum number of destinations p = 3 and the estimation capacity c = 0.75 are input by the user. This indicates that there are up to three organizations to which the data is provided, and that there is a 75% probability that the leak source can be identified from the leaked data. In this case, the number of deleted records r = 3 is determined in order to achieve the estimation capacity c = 0.75 under the maximum number of providers p = 3.

図６は、度数分布テーブルの例を示す図である。
度数分布テーブル１４３は、識別可能レコード抽出部１３２によって元データテーブル１４１から生成される。度数分布テーブル１４３は、塩基配列、度数およびレコードＩＤの項目を含む。塩基配列の項目には、塩基配列を示す文字列が登録される。度数の項目には、元データテーブル１４１に当該塩基配列が出現した回数が登録される。レコードＩＤの項目には、当該塩基配列を含むレコードのレコードＩＤが登録される。ただし、度数が１である塩基配列を含む識別可能レコードのレコードＩＤを登録すればよく、度数が１より大きい塩基配列を含む識別不能レコードのレコードＩＤは省略してもよい。 FIG. 6 is a diagram showing an example of a frequency distribution table.
The frequency distribution table 143 is generated from the original data table 141 by the identifiable record extraction unit 132. The frequency distribution table 143 includes items of base sequence, frequency and record ID. A character string indicating the base sequence is registered in the base sequence item. In the frequency item, the number of times the base sequence appears in the original data table 141 is registered. In the record ID item, the record ID of the record including the base sequence is registered. However, the record ID of the identifiable record including the base sequence having a frequency of 1 may be registered, and the record ID of the unidentifiable record containing the base sequence having a frequency greater than 1 may be omitted.

例えば、元データテーブル１４１に含まれるレコード＃１～＃１２のうち、レコード＃１，＃６，＃１１はレコードＩＤ以外の値が同一である識別不能レコードである。レコード＃１，＃６，＃１１は塩基配列「ＡＡＡ」を含むため、塩基配列「ＡＡＡ」の度数は３である。一方、レコード＃１～＃１２のうち、レコード＃２～＃５，＃７～＃１０，＃１２はレコードＩＤ以外の値も同一のものが存在しない識別可能レコードである。これら９個のレコードそれぞれに含まれる塩基配列の度数は１である。よって、度数分布テーブル１４３には、度数３の塩基配列１個と、度数１の塩基配列９個が登録される。 For example, among the records # 1 to # 12 included in the original data table 141, the records # 1, # 6, and # 11 are unidentifiable records having the same value other than the record ID. Since the records # 1, # 6 and # 11 include the base sequence "AAA", the frequency of the base sequence "AAA" is 3. On the other hand, among the records # 1 to # 12, records # 2 to # 5, # 7 to # 10, and # 12 are identifiable records in which the same value other than the record ID does not exist. The frequency of the base sequence contained in each of these nine records is 1. Therefore, one base sequence of frequency 3 and nine base sequences of frequency 1 are registered in the frequency distribution table 143.

図７は、グループテーブルの例を示す図である。
グループテーブル１４４は、グルーピング部１３３によって度数分布テーブル１４３から生成される。グループテーブル１４４は、グループＩＤ、レコードＩＤおよび塩基配列の項目を含む。グループＩＤの項目には、グループを識別する識別子が登録される。レコードＩＤの項目には、識別可能レコードのレコードＩＤが登録される。塩基配列の項目には、識別可能レコードに含まれる塩基配列を示す文字列が登録される。 FIG. 7 is a diagram showing an example of a group table.
The group table 144 is generated from the frequency distribution table 143 by the grouping unit 133. The group table 144 includes items of group ID, record ID and base sequence. An identifier that identifies the group is registered in the item of the group ID. The record ID of the identifiable record is registered in the record ID item. In the base sequence item, a character string indicating the base sequence included in the identifiable record is registered.

グルーピングでは、度数分布テーブル１４３から度数が１である識別可能レコードのみ抽出され、識別可能レコードが複数のグループに分類される。１つのグループは１以上の識別可能レコードを含む。よって、グループテーブル１４４では、１つのグループＩＤに対して１以上のレコードＩＤが対応付けられる。 In the grouping, only the identifiable records having a frequency of 1 are extracted from the frequency distribution table 143, and the identifiable records are classified into a plurality of groups. One group contains one or more identifiable records. Therefore, in the group table 144, one or more record IDs are associated with one group ID.

ここでは、塩基配列の先頭の塩基に基づいて識別可能レコードを分類するよう、グルーピング部１３３が設定されていると仮定する。グループＧ_Ｃは、先頭の塩基が「Ｃ」である塩基配列を示すグループであり、レコード＃２，＃４，＃８，＃１０の４個の識別可能レコードを含む。グループＧ_Ｇは、先頭の塩基が「Ｇ」である塩基配列を示すグループであり、レコード＃３，＃９，＃１２の３個の識別可能レコードを含む。グループＧ_Ｔは、先頭の塩基が「Ｔ」である塩基配列を示すグループであり、レコード＃５，＃７の２個の識別可能レコードを含む。ここでは、先頭の塩基が「Ａ」である塩基配列を含む識別可能レコードが存在しないため、「Ａ」に対応するグループは生成されていない。 Here, it is assumed that the grouping unit 133 is set so as to classify the identifiable records based on the base at the beginning of the base sequence. Group GC is a group showing a base sequence in which the first base is " _C ", and includes four identifiable records of records # 2, # 4, # 8, and # 10. The group GG is a group showing a base sequence in which the first base is " _G ", and includes three identifiable records of records # 3, # 9, and # 12. The group GT is a group showing a base sequence in which the first base is " _T ", and includes two identifiable records of records # 5 and # 7. Here, since there is no identifiable record containing the base sequence in which the first base is "A", the group corresponding to "A" is not generated.

なお、グルーピング部１３３は、元データの種類に応じて様々なグルーピング基準を採用することが可能である。ユーザの知見に基づいて、適切なグルーピング基準が選択される。例えば、個人情報セットの識別可能レコードは、１０代、２０代、３０代といった年齢層で分類することもできるし、住所に含まれる行政区画名で分類することもできる。また、例えば、位置情報セットの識別可能レコードは、１０時、１１時、１２時といった時間帯で分類することもできるし、経度および緯度の範囲で分類することもできる。 The grouping unit 133 can adopt various grouping criteria according to the type of the original data. Appropriate grouping criteria are selected based on the user's knowledge. For example, the identifiable records of the personal information set can be classified by the age group such as teens, 20s, and 30s, or can be classified by the administrative division name included in the address. Further, for example, the identifiable records of the location information set can be classified by a time zone such as 10 o'clock, 11 o'clock, 12 o'clock, or can be classified by a range of longitude and latitude.

図８は、識別可能データテーブルの例を示す図である。
識別可能データテーブル１４５は、グルーピング部１３３によってグループテーブル１４４から生成される。識別可能データテーブル１４５は、識別可能レコードについてレコードＩＤおよび塩基配列を含む。識別可能データテーブル１４５では、グループテーブル１４４に基づいて複数の識別可能レコードがソートされている。できる限り同じグループに属する識別可能レコードが連続しないように並べられている。 FIG. 8 is a diagram showing an example of an identifiable data table.
The identifiable data table 145 is generated from the group table 144 by the grouping unit 133. The identifiable data table 145 contains a record ID and a base sequence for identifiable records. In the identifiable data table 145, a plurality of identifiable records are sorted based on the group table 144. As much as possible, the identifiable records belonging to the same group are arranged so as not to be consecutive.

例えば、識別可能データテーブル１４５の先頭から順に、グループＧ_Ｃのレコード＃４、グループＧ_Ｇのレコード＃１２、グループＧ_Ｔのレコード＃５、グループＧ_Ｃのレコード＃１０、グループＧ_Ｇのレコード＃３、グループＧ_Ｔのレコード＃７と並んでいる。これは、グループＧ_Ｃ，Ｇ_Ｇ，Ｇ_Ｔから１つずつ順番に（巡回的に）識別可能レコードを選択したものである。これに続いて、グループＧ_Ｃのレコード＃８、グループＧ_Ｇのレコード＃９、グループＧ_Ｃのレコード＃２と並んでいる。 For example, from the beginning of the identifiable data table 145, the record # 4 of the group _GC , the record # 12 of the group _GG , the record # 5 of the group _GT , the record # 10 of the group _GC , and the record # of the group _GG . 3. It is lined up with record # 7 of group _GT . This is a selection of identifiable records (circularly) one by one from the groups _GC , _GG , and _GT . This is followed by record # 8 in group _GC , record # 9 in group _GG , and record # 2 in group _GC .

図９は、対応表の例を示す図である。
対応表１４６は、対応表記憶部１２４に記憶される。対応表１４６は、提供先識別子および削除パターンの項目を含む。提供先識別子の項目には、ユーザから入力された識別子が登録される。提供先識別子として提供先の組織の名称を用いてもよい。削除パターンの項目には、削除する識別可能レコードのレコードＩＤが列挙される。 FIG. 9 is a diagram showing an example of a correspondence table.
The correspondence table 146 is stored in the correspondence table storage unit 124. Correspondence table 146 includes items of destination identifier and deletion pattern. The identifier input by the user is registered in the item of the provider identifier. The name of the organization of the provider may be used as the identifier of the provider. The record ID of the identifiable record to be deleted is listed in the deletion pattern item.

削除パターンは、識別可能データテーブル１４５の上から順に削除レコード数ｒだけ識別可能レコードを抽出することで生成できる。例えば、最初に提供先識別子Ｘが入力されると、識別可能データテーブル１４５の上から順にレコード＃４，＃１２，＃５が選択され、削除パターン「４，５，１２」が生成される。次に、提供先識別子Ｙが入力されると、識別可能データテーブル１４５の上から順にレコード＃１０，＃３，＃７が選択され、削除パターン「３，７，１０」が生成される。次に、提供先識別子Ｚが入力されると、識別可能データテーブル１４５の上から順にレコード＃８，＃９，＃２が選択され、削除パターン「２，８，９」が生成される。 The deletion pattern can be generated by extracting identifiable records by the number of deleted records r in order from the top of the identifiable data table 145. For example, when the destination identifier X is first input, records # 4, # 12, and # 5 are selected in order from the top of the identifiable data table 145, and the deletion pattern “4,5,12” is generated. Next, when the destination identifier Y is input, records # 10, # 3, # 7 are selected in order from the top of the identifiable data table 145, and the deletion pattern “3, 7, 10” is generated. Next, when the provider identifier Z is input, records # 8, # 9, and # 2 are selected in order from the top of the identifiable data table 145, and the deletion pattern “2,8,9” is generated.

なお、第２の実施の形態では異なる提供先の間で削除対象レコードが重複しないように削除パターンを生成している。ただし、異なる提供先の間で削除パターンが完全に一致しなければよく、一部の削除対象レコードが重複してもよい。また、識別可能レコード数ｎが十分に大きい場合、すなわち、識別可能レコード数ｎが削除レコード数ｒと最大提供先数ｐの積よりも十分に大きい場合、簡易的な方法で削除対象レコードを選択することもできる。例えば、ｎ個の識別可能レコードの中からｒ個をランダムに選択する方法によっても、異なる提供先の間で削除対象レコードが重複しない可能性が高い。 In the second embodiment, the deletion pattern is generated so that the deletion target records do not overlap between different providers. However, it is sufficient that the deletion patterns do not exactly match between different providers, and some records to be deleted may be duplicated. Further, when the number of identifiable records n is sufficiently large, that is, when the number of identifiable records n is sufficiently larger than the product of the number of deleted records r and the maximum number of destinations p, the record to be deleted is selected by a simple method. You can also do it. For example, even by a method of randomly selecting r records from n identifiable records, there is a high possibility that the records to be deleted do not overlap between different providers.

図１０は、修正度数分布テーブルの例を示す図である。
修正度数分布テーブル１４７は、レコード削除部１３５によって度数分布テーブル１４３と対応表１４６から生成される。修正度数分布テーブル１４７は、度数分布テーブル１４３をコピーして幾つかの識別可能レコードを削除したものである。修正度数分布テーブル１４７は提供先識別子毎に生成される。ユーザから提供先識別子Ｘが入力されたとき、度数分布テーブル１４３がコピーされると共に、対応表１４６から提供先識別子Ｘに対応する削除パターン「４，５，１２」が検索される。度数分布テーブル１４３のコピーからレコード＃４，＃５，＃１２を削除したものが修正度数分布テーブル１４７となる。 FIG. 10 is a diagram showing an example of a correction frequency distribution table.
The modified frequency distribution table 147 is generated from the frequency distribution table 143 and the correspondence table 146 by the record deletion unit 135. The modified frequency distribution table 147 is a copy of the frequency distribution table 143 with some identifiable records deleted. The correction frequency distribution table 147 is generated for each provider identifier. When the provider identifier X is input by the user, the frequency distribution table 143 is copied, and the deletion pattern "4,5,12" corresponding to the provider identifier X is searched from the correspondence table 146. The modified frequency distribution table 147 is obtained by deleting records # 4, # 5, and # 12 from the copy of the frequency distribution table 143.

図１１は、提供データテーブルの例を示す図である。
提供データテーブル１４８は、レコード度数低減部１３６によって修正度数分布テーブル１４７から生成され、提供データ記憶部１２３に格納される。提供データテーブル１４８は提供先識別子毎に生成される。修正度数分布テーブル１４７から提供データテーブル１４８を生成するにあたって、塩基配列と度数を残してレコードＩＤが削除される。ただし、提供データテーブル１４８にレコードＩＤを残すようにしてもよい。 FIG. 11 is a diagram showing an example of the provided data table.
The provided data table 148 is generated from the modified frequency distribution table 147 by the record frequency reducing unit 136 and stored in the provided data storage unit 123. The provision data table 148 is generated for each provision destination identifier. In generating the provided data table 148 from the modified frequency distribution table 147, the record ID is deleted leaving the base sequence and the frequency. However, the record ID may be left in the provided data table 148.

また、修正度数分布テーブル１４７から提供データテーブル１４８を生成するにあたって、識別不能レコードに相当する塩基配列の度数が修正される。ある塩基配列の度数ｎ_ｔ、識別可能レコード数ｎ、削除レコード数ｒから低減度数ｒ_ｔ＝ｒ／ｎ×ｎ_ｔが算出され、度数ｎ_ｔがｎ_ｔ－ｒ_ｔに変更される。例えば、塩基配列「ＡＡＡ」の度数ｎ_ｔ＝３、識別可能レコード数ｎ＝９、削除レコード数ｒ＝３から低減度数ｒ_ｔ＝１が算出され、度数ｎ_ｔ＝３がｎ_ｔ－ｒ_ｔ＝３－１＝２に変更される。識別可能レコードについての塩基配列と度数は、修正度数分布テーブル１４７と同じである。 Further, in generating the provided data table 148 from the modified frequency distribution table 147, the frequency of the base sequence corresponding to the unidentifiable record is modified. The reduction frequency _rt = _r / n × n _t is calculated from the frequency nt of a certain base sequence, the number of identifiable records n, and the number of deleted records r, and the frequency _nt is changed to _nt _−rt . For example, the reduction frequency _rt = 1 is calculated from the frequency n _t = 3, the number of identifiable records n = 9, the number of deleted records r = 3, and the frequency nt = 3 of the base sequence “AAA”, and the frequency n _t = 3 is n _t _−rt . = 3-1 = 2 is changed. The base sequence and frequency of the identifiable record are the same as those of the modified frequency distribution table 147.

図１２は、漏洩データテーブルの例を示す図である。
漏洩データテーブル１４９は、漏洩データ記憶部１２５に記憶される。漏洩データテーブル１４９は、塩基配列、度数およびレコードＩＤの項目を含む。ただし、レコードＩＤは入手した漏洩データ自体には含まれておらず、情報処理装置１００が塩基配列から特定して付加したものである。漏洩元推定部１３７は、元データテーブル１４１または度数分布テーブル１４３から塩基配列に対応するレコードＩＤを検索して、漏洩データテーブル１４９の塩基配列に対して付加する。塩基配列に対応するレコードＩＤが複数検索された場合、すなわち、識別不能レコードに含まれる塩基配列である場合、当該塩基配列に対しては度数分布テーブル１４３と同様にレコードＩＤを省略することができる。 FIG. 12 is a diagram showing an example of a leaked data table.
The leaked data table 149 is stored in the leaked data storage unit 125. The leaked data table 149 includes items of base sequence, frequency and record ID. However, the record ID is not included in the obtained leaked data itself, but is specified and added by the information processing apparatus 100 from the base sequence. The leak source estimation unit 137 searches the original data table 141 or the frequency distribution table 143 for the record ID corresponding to the base sequence, and adds the record ID to the base sequence of the leak data table 149. When a plurality of record IDs corresponding to the base sequence are searched, that is, when the base sequence is included in the indistinguishable record, the record ID can be omitted for the base sequence as in the frequency distribution table 143. ..

漏洩データテーブル１４９に出現する識別可能レコードと対応表１４６から、漏洩元を推定することができる。例えば、漏洩データテーブル１４９には識別可能レコードとしてレコード＃２，＃４，＃５，＃８が出現している。提供先識別子Ｘに対応する削除パターンにはレコード＃４，＃５が含まれる。よって、提供先識別子Ｘが示す提供先は漏洩元である可能性が低い。同様に、提供先識別子Ｚに対応する削除パターンにはレコード＃２，＃８が含まれる。よって、提供先識別子Ｚが示す提供先は漏洩元である可能性が低い。一方、提供先識別子Ｙに対応する削除パターンにはレコード＃２，＃４，＃５，＃８の何れも含まれていない。よって、提供先識別子Ｙが示す提供先はこれらの識別可能レコードを漏洩させることが可能であり、漏洩元である可能性がある。 The source of the leak can be estimated from the identifiable records appearing in the leaked data table 149 and the correspondence table 146. For example, records # 2, # 4, # 5, # 8 appear as identifiable records in the leaked data table 149. The deletion pattern corresponding to the destination identifier X includes records # 4 and # 5. Therefore, it is unlikely that the destination indicated by the destination identifier X is the leak source. Similarly, the deletion pattern corresponding to the destination identifier Z includes records # 2 and # 8. Therefore, the destination indicated by the provider identifier Z is unlikely to be the leak source. On the other hand, the deletion pattern corresponding to the provider identifier Y does not include any of the records # 2, # 4, # 5, # 8. Therefore, the destination indicated by the destination identifier Y can leak these identifiable records, and may be the leak source.

このように、情報処理装置１００は漏洩データ自体から漏洩元を推定することができる。また、上記の漏洩データは、提供先識別子Ｙが示す提供先が受け取ったレコード＃２，＃４，＃５，＃８，＃９，＃１２のうちの一部のみ含んでいる。しかし、漏洩データが提供データのサブセットであっても漏洩元を推定し得る。 In this way, the information processing apparatus 100 can estimate the leakage source from the leakage data itself. Further, the above-mentioned leaked data includes only a part of the records # 2, # 4, # 5, # 8, # 9, and # 12 received by the provider indicated by the provider identifier Y. However, the source of the leak can be estimated even if the leaked data is a subset of the provided data.

次に、情報処理装置１００の処理手順について説明する。
図１３は、元データ分析の手順例を示すフローチャートである。
元データ分析は、１セットの元データに対して１回実行される。 Next, the processing procedure of the information processing apparatus 100 will be described.
FIG. 13 is a flowchart showing an example of the procedure for analyzing the original data.
The source data analysis is performed once for a set of source data.

（Ｓ１０）削除レコード数算出部１３１は、パラメータ記憶部１２２に記憶されたパラメータテーブル１４２から最大提供先数ｐと推定能力ｃを取得する。
（Ｓ１１）削除レコード数算出部１３１は、ステップＳ１０で取得した最大提供先数ｐと推定能力ｃから削除レコード数ｒを決定する。削除レコード数ｒは、例えば、ｒ≧ｌｏｇ_２（（ｐ－１）／（１－ｃ））を満たす最小の整数である。 (S10) The deleted record number calculation unit 131 acquires the maximum number of provided destinations p and the estimation capacity c from the parameter table 142 stored in the parameter storage unit 122.
(S11) The deleted record number calculation unit 131 determines the deleted record number r from the maximum number of provided destinations p acquired in step S10 and the estimation capacity c. The number of deleted records r is, for example, the smallest integer satisfying r ≧ log ₂ ((p-1) / (1-c)).

（Ｓ１２）識別可能レコード抽出部１３２は、元データ記憶部１２１に記憶された元データテーブル１４１から度数分布テーブル１４３を生成する。このとき、識別可能レコード抽出部１３２は、元データテーブル１４１に含まれるレコード同士を比較して塩基配列毎の度数をカウントする。また、識別可能レコード抽出部１３２は、度数が１である塩基配列に対しては当該塩基配列を含むレコードのレコードＩＤを付加する。 (S12) The identifiable record extraction unit 132 generates the frequency distribution table 143 from the original data table 141 stored in the original data storage unit 121. At this time, the identifiable record extraction unit 132 compares the records included in the original data table 141 with each other and counts the frequency for each base sequence. Further, the identifiable record extraction unit 132 adds the record ID of the record including the base sequence to the base sequence having a frequency of 1.

（Ｓ１３）識別可能レコード抽出部１３２は、度数分布テーブル１４３から度数が１である塩基配列とレコードＩＤを識別可能レコードとして抽出する。
（Ｓ１４）グルーピング部１３３は、所定のグルーピング基準に従って、ステップＳ１３で抽出された識別可能レコードを複数のグループに分類し、グループテーブル１４４を生成する。例えば、グルーピング部１３３は、塩基配列の先頭の塩基が同一である識別可能レコードを集めてグループを形成する。 (S13) The identifiable record extraction unit 132 extracts the base sequence having a frequency of 1 and the record ID from the frequency distribution table 143 as identifiable records.
(S14) The grouping unit 133 classifies the identifiable records extracted in step S13 into a plurality of groups according to a predetermined grouping standard, and generates a group table 144. For example, the grouping unit 133 collects identifiable records having the same first base in the base sequence to form a group.

（Ｓ１５）グルーピング部１３３は、グループテーブル１４４に識別可能レコードが残っているか、すなわち、ステップＳ１４で生成した複数のグループの少なくとも１つに識別可能レコードが残っているか判断する。グループテーブル１４４に識別可能レコードがある場合はステップＳ１６に処理が進み、無い場合は元データ分析が終了する。 (S15) The grouping unit 133 determines whether an identifiable record remains in the group table 144, that is, whether an identifiable record remains in at least one of the plurality of groups generated in step S14. If there is an identifiable record in the group table 144, the process proceeds to step S16, and if not, the original data analysis ends.

（Ｓ１６）グルーピング部１３３は、ステップＳ１４で生成した複数のグループを残っている識別可能レコードの数の降順にソートする。グルーピング部１３３は、識別可能レコード数が多い順にグループ名をＧ_１，Ｇ_２，…と変更する。 (S16) The grouping unit 133 sorts the plurality of groups generated in step S14 in descending order of the number of remaining identifiable records. The grouping unit ₁₃₃ changes the group names to G1, G2, ... _In descending order of the number of identifiable records.

（Ｓ１７）グルーピング部１３３は、変数ｉ＝１に初期化する。
（Ｓ１８）グルーピング部１３３は、変数ｉの値がステップＳ１１で決定された削除レコード数ｒ以下であり、かつ、グループＧ_ｉに識別可能レコードが残っているか（グループＧ_ｉが空集合でないか）判断する。この条件を満たす場合、ステップＳ１９に処理が進む。この条件を満たさない場合、すなわち、変数ｉの値が削除レコード数ｒより大きいかまたはグループＧ_ｉが空集合である場合、ステップＳ１５に処理が進む。 (S17) The grouping unit 133 is initialized to the variable i = 1.
(S18) In the grouping unit 133, is the value of the variable i equal to or less than the number of deleted records r determined in step S11, and is there any identifiable record remaining in the group Gi (whether the group G _i is an empty set ₎ ? to decide. If this condition is satisfied, the process proceeds to step S19. If this condition is not satisfied, that is, if the value of the variable _i is larger than the number of deleted records r or the group Gi is an empty set, the process proceeds to step S15.

（Ｓ１９）グルーピング部１３３は、グループＧ_ｉから識別可能レコードを１つ抽出し、抽出した識別可能レコードを識別可能データテーブル１４５の末尾に登録する。グループＧ_ｉから抽出する識別可能レコードは、例えば、ランダムに選択される。 ( _S19 ) The grouping unit 133 extracts one identifiable record from the group Gi and registers the extracted identifiable record at the end of the identifiable data table 145. The identifiable records to be _extracted from the group Gi are selected at random, for example.

（Ｓ２０）グルーピング部１３３は、変数ｉの値を１だけ増加させる。そして、ステップＳ１８に処理が進む。
例えば、図７のグループテーブル１４４の場合、｜Ｇ_Ｃ｜＝４，｜Ｇ_Ｇ｜＝３，｜Ｇ_Ｔ｜＝２であるため、Ｇ_１＝Ｇ_Ｃ，Ｇ_２＝Ｇ_Ｇ，Ｇ_３＝Ｇ_Ｔとソートされる。内側ループにおいてｉ＝１のときにグループＧ_１＝Ｇ_Ｃからレコード＃４が選択され、ｉ＝２のときにグループＧ_２＝Ｇ_Ｇからレコード＃１２が選択され、ｉ＝３のときにグループＧ_３＝Ｇ_Ｔからレコード＃５が選択される。ｉ＝４になるとｉ＞ｒであるため内側ループが打ち切られる。 (S20) The grouping unit 133 increases the value of the variable i by one. Then, the process proceeds to step S18.
For example, in the case of the group table 144 in FIG. 7, since | GC | = 4, | _{G G} _| = 3, | G _T | = 2, G ₁ = GC, G ₂ = _{G G} _, G ₃ = Sorted with _GT . In the inner loop, record # 4 is selected from group _G ₁ = GC when i = 1, record # 12 is selected from group _G ₂ = GG when i = 2, and group when i = 3. Record # 5 is selected from G ₃ = _GT . When i = 4, since i> r, the inner loop is terminated.

次に、｜Ｇ_Ｃ｜＝３，｜Ｇ_Ｇ｜＝２，｜Ｇ_Ｔ｜＝１であるため、Ｇ_１＝Ｇ_Ｃ，Ｇ_２＝Ｇ_Ｇ，Ｇ_３＝Ｇ_Ｔとソートされる。上記と同様に、内側ループにおいてｉ＝１のときにグループＧ_１＝Ｇ_Ｃからレコード＃１０が選択され、ｉ＝２のときにグループＧ_２＝Ｇ_Ｇからレコード＃３が選択され、ｉ＝３のときにグループＧ_３＝Ｇ_Ｔからレコード＃７が選択される。ｉ＝４になるとｉ＞ｒであるため内側ループが打ち切られる。 Next, since | GC | = 3, | _{G G} _| = 2, | _GT | = 1, it is sorted as _G ₁ = GC, G ₂ = G _G , G ₃ = _GT . Similar to the above, in the inner loop, record # 10 is selected from group _G ₁ = GC when i = 1, record # 3 is selected from group _G ₂ = GG when i = 2, and i =. When it is 3, record # 7 is selected from the group G ₃ = _GT . When i = 4, since i> r, the inner loop is terminated.

次に、｜Ｇ_Ｃ｜＝２，｜Ｇ_Ｇ｜＝１，｜Ｇ_Ｔ｜＝０であるため、Ｇ_１＝Ｇ_Ｃ，Ｇ_２＝Ｇ_Ｇ，Ｇ_３＝Ｇ_Ｔとソートされる。内側ループにおいてｉ＝１のときにグループＧ_１＝Ｇ_Ｃからレコード＃８が選択され、ｉ＝２のときにグループＧ_２＝Ｇ_Ｇからレコード＃９が選択される。ｉ＝３になるとグループＧ_３＝Ｇ_Ｔ＝φであるため内側ループが打ち切られる。次に、｜Ｇ_Ｃ｜＝１，｜Ｇ_Ｇ｜＝｜Ｇ_Ｔ｜＝０であるため、Ｇ_１＝Ｇ_Ｃ，Ｇ_２＝Ｇ_Ｇ，Ｇ_３＝Ｇ_Ｔとソートされる。内側ループにおいてｉ＝１のときにグループＧ_１＝Ｇ_Ｃからレコード＃２が選択される。ｉ＝２になるとグループＧ_２＝Ｇ_Ｇ＝φであるため内側ループが打ち切られる。そして、｜Ｇ_Ｃ｜＝｜Ｇ_Ｇ｜＝｜Ｇ_Ｔ｜＝０であるため、元データ分析が終了する。 Next, since | GC | = 2, | _{G G} _| = 1, | G _T | = 0, it is sorted as _G ₁ = GC, G ₂ = G _G , G ₃ = _GT . In the inner loop, record # 8 is selected from group _G ₁ = GC when i = 1, and record # 9 is selected from group G ₂ = G _G when i = 2. When i = 3, the inner loop is terminated because the group G ₃ = _GT = φ. Next, since | GC | = 1, | _{G G} _| = | _GT | = 0, it is sorted as _G ₁ = GC, G ₂ = G _G , G ₃ = _GT . Record # 2 is selected from group _G ₁ = GC when i = 1 in the inner loop. When i = 2, the inner loop is terminated because the group G ₂ = G _G = φ. Then, since | GC | = | _GG | = | _GT | = ₀ , the original data analysis is completed.

図１４は、提供データ生成の手順例を示すフローチャートである。
提供データ生成は、１つの提供先識別子に対して１回実行される。
（Ｓ３０）削除パターン生成部１３４は、提供先識別子ｄを取得する。提供先識別子ｄは、例えば、１セットの提供データを生成したいときにユーザから指定される。 FIG. 14 is a flowchart showing an example of the procedure for generating the provided data.
The provided data generation is executed once for one provided destination identifier.
(S30) The deletion pattern generation unit 134 acquires the provider identifier d. The destination identifier d is specified by the user, for example, when he / she wants to generate a set of provided data.

（Ｓ３１）削除パターン生成部１３４は、対応表記憶部１２４に記憶された対応表１４６に提供先識別子ｄが登録されているか判断する。提供先識別子ｄが登録されている場合はステップＳ３２に処理が進み、登録されていない場合はステップＳ３３に処理が進む。 (S31) The deletion pattern generation unit 134 determines whether or not the provision destination identifier d is registered in the correspondence table 146 stored in the correspondence table storage unit 124. If the provider identifier d is registered, the process proceeds to step S32, and if it is not registered, the process proceeds to step S33.

（Ｓ３２）削除パターン生成部１３４は、対応表１４６から提供先識別子ｄに対応する削除パターンＰを検索する。そして、ステップＳ３５に処理が進む。
（Ｓ３３）削除パターン生成部１３４は、識別可能データテーブル１４５の上から順に削除レコード数ｒだけ識別可能レコードを抽出する。 (S32) The deletion pattern generation unit 134 searches the correspondence table 146 for the deletion pattern P corresponding to the provider identifier d. Then, the process proceeds to step S35.
(S33) The deletion pattern generation unit 134 extracts identifiable records by the number of deleted records r in order from the top of the identifiable data table 145.

（Ｓ３４）削除パターン生成部１３４は、ステップＳ３３で抽出したｒ個の識別可能レコードのレコードＩＤを列挙した削除パターンＰを生成する。削除パターン生成部１３４は、提供先識別子ｄと削除パターンＰを対応付けて対応表１４６に登録する。 (S34) The deletion pattern generation unit 134 generates a deletion pattern P that lists the record IDs of the r identifiable records extracted in step S33. The deletion pattern generation unit 134 registers the provision destination identifier d and the deletion pattern P in association with each other in the correspondence table 146.

（Ｓ３５）レコード削除部１３５は、度数分布テーブル１４３をコピーする。レコード削除部１３５は、ステップＳ３２またはステップＳ３４の削除パターンＰに従って、度数分布テーブル１４３のコピーから一部の識別可能レコードを削除する。すなわち、度数分布テーブル１４３のコピーから、削除パターンＰに列挙されたレコードＩＤをもつ識別可能レコードを削除する。これにより、修正度数分布テーブル１４７が生成される。 (S35) The record deletion unit 135 copies the frequency distribution table 143. The record deletion unit 135 deletes some identifiable records from the copy of the frequency distribution table 143 according to the deletion pattern P in step S32 or step S34. That is, the identifiable record having the record ID listed in the deletion pattern P is deleted from the copy of the frequency distribution table 143. As a result, the modified frequency distribution table 147 is generated.

（Ｓ３６）レコード度数低減部１３６は、修正度数分布テーブル１４７から識別不能レコードに相当する塩基配列、すなわち、度数が１より大きい塩基配列を１つ選択する。
（Ｓ３７）レコード度数低減部１３６は、識別可能レコード数ｎ、削除レコード数ｒおよび選択した塩基配列の度数ｎ_ｔから低減度数ｒ_ｔを決定する。低減度数ｒ_ｔは、例えば、ｒ_ｔ＝ｒ／ｎ×ｎ_ｔによって算出される非負整数である。 (S36) The record frequency reduction unit 136 selects one base sequence corresponding to the unidentifiable record, that is, a base sequence having a frequency greater than 1, from the modified frequency distribution table 147.
(S37) The record frequency reduction unit 136 determines the reduction frequency rt from the number of identifiable records n, the number of deleted records _r , and the frequency _nt of the selected base sequence. The reduction frequency rt is, for example, a non-negative integer calculated by _rt = _{r / n × n t} _.

（Ｓ３８）レコード度数低減部１３６は、修正度数分布テーブル１４７における選択した塩基配列の度数をｎ_ｔからｎ_ｔ－ｒ_ｔに変更する。
（Ｓ３９）レコード度数低減部１３６は、修正度数分布テーブル１４７から識別不能レコードに相当する塩基配列の全てを選択したか判断する。該当の全ての塩基配列を選択した場合はステップＳ４０に処理が進み、それ以外の場合はステップＳ３６に処理が進む。 (S38) The record frequency reduction unit 136 changes the frequency of the selected base sequence in the modified frequency distribution table 147 from n _t to n _t _−rt .
(S39) The record frequency reduction unit 136 determines whether or not all of the base sequences corresponding to the unidentifiable records have been selected from the modified frequency distribution table 147. If all the relevant base sequences are selected, the process proceeds to step S40, and if not, the process proceeds to step S36.

（Ｓ４０）レコード度数低減部１３６は、修正度数分布テーブル１４７からレコードＩＤの項目を削除することで提供データテーブル１４８を生成する。レコード度数低減部１３６は、提供データテーブル１４８を提供データ記憶部１２３に格納する。 (S40) The record frequency reduction unit 136 generates the provided data table 148 by deleting the item of the record ID from the modified frequency distribution table 147. The record frequency reduction unit 136 stores the provided data table 148 in the provided data storage unit 123.

図１５は、漏洩元推定の手順例を示すフローチャートである。
漏洩元推定は、１セットの漏洩データに対して１回実行される。
（Ｓ５０）漏洩元推定部１３７は、漏洩データ記憶部１２５に記憶された漏洩データテーブル１４９を取得する。この時点で漏洩データにはレコードＩＤが含まれていない。 FIG. 15 is a flowchart showing an example of a procedure for estimating the leakage source.
The leak source estimation is performed once for one set of leaked data.
(S50) The leakage source estimation unit 137 acquires the leakage data table 149 stored in the leakage data storage unit 125. At this point, the leaked data does not include the record ID.

（Ｓ５１）漏洩元推定部１３７は、対応表記憶部１２４に記憶された対応表１４６を参照して、これまでＤＮＡデータを提供した提供先の集合を特定する。ここで特定される提供先の集合は、初期状態における漏洩元候補の集合となる。 (S51) The leakage source estimation unit 137 refers to the correspondence table 146 stored in the correspondence table storage unit 124, and identifies a set of destinations that have provided DNA data so far. The set of providers specified here is a set of leak source candidates in the initial state.

（Ｓ５２）漏洩元推定部１３７は、漏洩データテーブル１４９に含まれる複数の塩基配列の中から１つの塩基配列を選択する。
（Ｓ５３）漏洩元推定部１３７は、選択した塩基配列を含むレコードのレコードＩＤを、元データテーブル１４１または度数分布テーブル１４３から検索する。 (S52) The leak source estimation unit 137 selects one base sequence from the plurality of base sequences included in the leak data table 149.
(S53) The leakage source estimation unit 137 searches the original data table 141 or the frequency distribution table 143 for the record ID of the record including the selected base sequence.

（Ｓ５４）漏洩元推定部１３７は、ステップＳ５３においてレコードＩＤが１つに特定されたか、すなわち、選択した塩基配列が識別可能レコードに相当するか判断する。レコードＩＤが１つに特定された場合はステップＳ５５に処理が進み、レコードＩＤが２以上検索された場合はステップＳ５２に処理が進む。 (S54) The leakage source estimation unit 137 determines whether the record ID is specified as one in step S53, that is, whether the selected base sequence corresponds to the identifiable record. If one record ID is specified, the process proceeds to step S55, and if two or more record IDs are searched, the process proceeds to step S52.

（Ｓ５５）漏洩元推定部１３７は、特定されたレコードＩＤを含む削除パターンを対応表１４６から検索する。漏洩元推定部１３７は、ステップＳ５１で特定した漏洩元候補の集合から、検索された削除パターンに対応する提供先を除外する。 (S55) The leakage source estimation unit 137 searches the correspondence table 146 for a deletion pattern including the specified record ID. The leak source estimation unit 137 excludes the provider corresponding to the searched deletion pattern from the set of leak source candidates identified in step S51.

（Ｓ５６）漏洩元推定部１３７は、漏洩データテーブル１４９に含まれる全ての塩基配列を選択したか判断する。全ての塩基配列を選択した場合はステップＳ５７に処理が進み、未選択の塩基配列がある場合はステップＳ５２に処理が進む。 (S56) The leakage source estimation unit 137 determines whether or not all the base sequences included in the leakage data table 149 have been selected. If all the base sequences are selected, the process proceeds to step S57, and if there are unselected base sequences, the process proceeds to step S52.

（Ｓ５７）漏洩元推定部１３７は、ステップＳ５５で除外されずに残った提供先を漏洩元と推定する。好ましくは、この時点で漏洩元候補は１つに絞り込まれている。漏洩元推定部１３７は、漏洩元の提供先識別子など漏洩元を示す漏洩元情報を出力する。例えば、漏洩元推定部１３７は、ディスプレイ１１１に漏洩元情報を表示させる。 (S57) The leak source estimation unit 137 estimates that the provider that remains without being excluded in step S55 is the leak source. Preferably, at this point, the leak source candidates are narrowed down to one. The leak source estimation unit 137 outputs leak source information indicating the leak source, such as a leak source provider identifier. For example, the leak source estimation unit 137 causes the display 111 to display the leak source information.

第２の実施の形態の情報処理装置１００によれば、複数の組織にデータを提供した場合であっても、漏洩データ自体から漏洩元の組織を推定することができる。よって、漏洩元の組織に再発防止を要求するなどデータの保護を強化することが可能となる。また、提供するデータに識別性を付与する方法として、データから少数のレコードを削除する方法が用いられる。よって、レコードの値にノイズを付加する方法と比べて、レコードの真正性を確保することができ、提供先にとってのデータの価値が低下するのを防止できる。また、空白の長さなどの修飾情報を変化させる方法と比べて、提供先の組織がデータを正規化するなど修飾情報が消えてしまうような操作を行っても、識別性が喪失しない。 According to the information processing apparatus 100 of the second embodiment, even when data is provided to a plurality of organizations, the organization of the leakage source can be estimated from the leakage data itself. Therefore, it is possible to strengthen the protection of data by requesting the organization of the leakage source to prevent recurrence. Further, as a method of imparting distinctiveness to the provided data, a method of deleting a small number of records from the data is used. Therefore, as compared with the method of adding noise to the value of the record, the authenticity of the record can be ensured, and the value of the data to the provider can be prevented from deteriorating. Further, as compared with the method of changing the modification information such as the length of the blank, the distinctiveness is not lost even if the providing organization performs an operation such as normalizing the data so that the modification information disappears.

また、所望の推定能力から削除レコード数が決定されるため、漏洩元を一意に特定できる確率とデータの有用性とのバランスを調整することが容易となる。また、提供先の組織によるデータの使用方法の観点からレコードが複数のグループに分類され、異なるグループに属するレコードが混在して削除される。よって、提供先の組織が１つのグループに属するレコードのみを使用しており当該１つのグループに属するレコードのみを漏洩させた場合であっても、漏洩元を特定できる可能性が高くなる。また、識別可能レコードの削除割合と整合するように識別不能レコードも削除されるため、削除後のデータが削除前のデータに近い特性を維持することが可能となる。 Further, since the number of deleted records is determined from the desired estimation ability, it becomes easy to adjust the balance between the probability that the leakage source can be uniquely identified and the usefulness of the data. In addition, records are classified into a plurality of groups from the viewpoint of how the data is used by the organization to which the records are provided, and records belonging to different groups are mixedly deleted. Therefore, even if the providing destination organization uses only the records belonging to one group and leaks only the records belonging to the one group, there is a high possibility that the leakage source can be identified. In addition, since the unidentifiable records are also deleted so as to be consistent with the deletion ratio of the identifiable records, it is possible to maintain the characteristics that the deleted data is close to the data before the deletion.

１０データ加工装置
１１記憶部
１２処理部
１３対応情報
１４ａ，１４ｂ，１４ｃ提供先識別子
１５ａ，１５ｂ，１５ｃ削除パターン
１６漏洩データ
１７漏洩元情報 10 Data processing equipment 11 Storage unit 12 Processing unit 13 Correspondence information 14a, 14b, 14c Provider identifier 15a, 15b, 15c Deletion pattern 16 Leakage data 17 Leakage source information

Claims

Multiple subsets of the data by generating multiple deletion patterns indicating the records to be deleted from the data containing the plurality of records and deleting the records contained in the data based on each of the plurality of deletion patterns. And the processing unit that generates
A storage unit that stores correspondence information in which each of the plurality of deletion patterns is associated with a delivery destination identifier indicating a delivery destination of a subset generated based on the deletion pattern .
In the generation of the plurality of deletion patterns, the processing unit accepts the number of provision destinations and the presumable probability that the leakage source can be estimated from the leaked data, and is based on the number of provision destinations and the presumable probability. Each delete pattern determines the number of records to be deleted,
Data processing equipment.

In the generation of the plurality of deletion patterns, the processing unit extracts a plurality of identifiable records that do not contain the same value as the other records from the data, and each deletion pattern is selected from the plurality of identifiable records. Select the record to be deleted,
The data processing apparatus according to claim 1 .

In the generation of the plurality of subsets, the processing unit performs a part of the plurality of unidentifiable records other than the plurality of identifiable records among the plurality of records, depending on the number of records specified to be deleted. Delete additional records,
The data processing apparatus according to claim 2 .

In the generation of the plurality of deletion patterns, the processing unit classifies at least a part of the plurality of records into a plurality of groups based on a predetermined criterion, and each deletion pattern specifies two or more records as deletion targets. Mix records belonging to different groups inside,
The data processing apparatus according to claim 1 .

Multiple subsets of the data by generating multiple deletion patterns indicating the records to be deleted from the data containing the plurality of records and deleting the records contained in the data based on each of the plurality of deletion patterns. And the processing unit that generates
A storage unit that stores correspondence information in which each of the plurality of deletion patterns is associated with a delivery destination identifier indicating a delivery destination of a subset generated based on the deletion pattern .
The processing unit acquires the leaked data, searches for a deletion pattern in which the record included in the leaked data is not designated as a deletion target from the plurality of deletion patterns indicated by the corresponding information, and the above-mentioned. Extract the destination identifier corresponding to the searched deletion pattern,
Data processing equipment.

It is a data processing method executed by a computer.
Generates multiple deletion patterns that indicate records that are deleted from data that contains multiple records.
By deleting the records contained in the data based on each of the plurality of deletion patterns, a plurality of subsets of the data are generated.
Correspondence information in which each of the plurality of deletion patterns is associated with the provision destination identifier indicating the provision destination of the subset generated based on the deletion pattern is stored in the storage unit.
In the generation of the plurality of deletion patterns, the number of provision destinations and the estimable probability that the leakage source can be estimated from the leaked data are accepted, and each deletion pattern is designated as the deletion target based on the number of provision destinations and the presumable probability. Determine the number of records,
Data processing method.

On the computer
Generates multiple deletion patterns that indicate records that are deleted from data that contains multiple records.
By deleting the records contained in the data based on each of the plurality of deletion patterns, a plurality of subsets of the data are generated.
Corresponding information in which each of the plurality of deletion patterns is associated with a delivery destination identifier indicating a delivery destination of a subset generated based on the deletion pattern is stored in the storage unit.
Processing is executed, and in the generation of the plurality of deletion patterns, the number of provision destinations and the presumable probability that the leakage source can be estimated from the leaked data are accepted, and each deletion pattern is based on the number of provision destinations and the presumable probability. Determines the number of records to be deleted,
Data processing program.