WO2024079795A1 - Risk calculation device, risk calculation method, and risk calculation program - Google Patents
Risk calculation device, risk calculation method, and risk calculation program Download PDFInfo
- Publication number
- WO2024079795A1 WO2024079795A1 PCT/JP2022/037926 JP2022037926W WO2024079795A1 WO 2024079795 A1 WO2024079795 A1 WO 2024079795A1 JP 2022037926 W JP2022037926 W JP 2022037926W WO 2024079795 A1 WO2024079795 A1 WO 2024079795A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- sample data
- risk calculation
- shadow models
- losses
- unit
- Prior art date
Links
- 238000004364 calculation method Methods 0.000 title claims abstract description 73
- 238000010801 machine learning Methods 0.000 claims abstract description 24
- 238000012545 processing Methods 0.000 claims description 15
- 238000010276 construction Methods 0.000 claims description 14
- 238000000034 method Methods 0.000 description 16
- 238000012549 training Methods 0.000 description 9
- 238000013528 artificial neural network Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 230000006870 function Effects 0.000 description 3
- 230000010365 information processing Effects 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 101100070542 Podospora anserina het-s gene Proteins 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000001174 ascending effect Effects 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/60—Protecting data
- G06F21/62—Protecting access to data via a platform, e.g. using keys or access control rules
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
Definitions
- the present invention relates to a risk calculation device, a risk calculation method, and a risk calculation program for calculating the privacy risk of a machine learning model.
- Machine learning technologies such as Deep Neural Networks (DNNs) have been pointed out as posing privacy risks due to their tendency to memorize training data. Specifically, it has been shown that it is possible to infer from the output of a trained model whether or not specific data was included in the training data. Therefore, consideration must be given to privacy risks when handling data that users do not want others to know, such as medical data or web browsing history.
- DNNs Deep Neural Networks
- Non-Patent Documents 1, 2, and 3 a method has been proposed in which an attack is performed on a trained model to determine whether or not certain data is included in the training data, and the privacy risk is calculated based on the degree to which the attack is successful (see Non-Patent Documents 1, 2, and 3).
- the present invention aims to solve the above problems and accurately calculate the privacy risk in realistic settings of machine learning models.
- the present invention is characterized by comprising a first model construction unit that constructs multiple first shadow models using a predetermined dataset, a first selection unit that selects, from sample data, the sample data for which the average loss of the multiple first shadow models is equal to or greater than a predetermined value and the variance is equal to or less than a predetermined value, a second model construction unit that constructs multiple second shadow models using the selected sample data and the dataset, a distance calculation unit that calculates the distance between the distribution of losses of the multiple first shadow models and the distribution of losses of the multiple second shadow models for the sample data, a second selection unit that selects the sample data for which the distance is equal to or greater than a predetermined threshold as sample data to be used in calculating the privacy risk of a machine learning model, and an output processing unit that outputs the selected sample data.
- the present invention makes it possible to highly (accurately) calculate privacy risks in realistic settings for machine learning models.
- FIG. 1 is a diagram for explaining an overview of a risk calculation device.
- FIG. 2 is a diagram illustrating an example of the configuration of the risk calculation device.
- FIG. 3 is a flowchart illustrating an example of a processing procedure executed by the risk calculation device.
- FIG. 4 is a flowchart showing an example of calculating a privacy risk using the risk calculation device.
- FIG. 5 is a diagram illustrating a computer that executes a risk calculation program.
- the risk calculation device selects sample data (target samples) used for calculating a privacy risk of a trained model (machine learning model) in the following manner.
- the risk calculation device constructs multiple shadow models (first shadow models) using a previously prepared data set.
- the risk calculation device inputs sample data (data separate from the data set) to each of the constructed multiple first shadow models.
- the risk calculation device calculates the loss of each of the constructed multiple first shadow models, and selects sample data in which the average of the calculated losses is sufficiently large and the variance is small (selection 1).
- the risk calculation device constructs multiple shadow models (second shadow models) using the data obtained by adding the sample data selected in selection 1 to the above data set. The risk calculation device then calculates the loss of each of the multiple second shadow models for the sample data.
- the risk calculation device selects sample data in which the difference between the distribution of losses of each of the multiple first shadow models and the distribution of losses of each of the multiple second shadow models becomes large as a target sample (selection 2).
- sample data in which the average loss of multiple shadow models is sufficiently large and the variance is small can be considered to be sample data with a high privacy risk.
- the sample data can be considered to be data that is easy to predict. In other words, the sample data can be considered to be data with a high privacy risk.
- the risk calculation device therefore first selects sample data with a sufficiently large average loss for multiple shadow models and small variance as target sample candidates (Selection 1). Next, the risk calculation device selects, from among the target sample candidates, a sample that, when added to the teacher data for the shadow model, will significantly change the distribution of losses for the shadow model as the target sample (Selection 2).
- the risk calculation device can select sample data with a high privacy risk as a target sample. Then, the risk calculation device can calculate the privacy risk of the machine learning model using the target sample, thereby calculating the privacy risk to be high (correct).
- the risk calculation device 10 includes, for example, an input/output unit 11, a storage unit 12, and a control unit 13.
- the input/output unit 11 is an interface that handles the input and output of various data.
- the input/output unit 11 accepts, for example, input of a dataset and sample data.
- the dataset is a dataset used to construct the first shadow model and the second shadow model.
- the sample data is candidate data (target sample) used to calculate the privacy risk of the machine learning model.
- the memory unit 12 stores data, programs, etc. that are referenced when the control unit 13 executes various processes.
- the memory unit 12 is realized by a semiconductor memory element such as a RAM (Random Access Memory) or a flash memory, or a storage device such as a hard disk or an optical disk.
- the storage unit 12 stores a data set, sample data, etc. input via the input/output unit 11. Also, for example, the storage unit 12 stores parameters of a first shadow model and parameters of a second shadow model constructed by the control unit 13, etc.
- the control unit 13 is responsible for controlling the entire risk calculation device 10.
- the functions of the control unit 13 are realized, for example, by a CPU (Central Processing Unit) executing a program stored in the memory unit 12.
- a CPU Central Processing Unit
- the control unit 13 includes, for example, a first model construction unit 131, a first selection unit 132, a second model construction unit 133, a distance calculation unit 134, a second selection unit 135, a risk calculation unit 136, and an output processing unit 137.
- a label manipulation unit 138 and a noise addition unit 139 shown by dashed lines may or may not be provided, and the cases in which they are provided will be described later.
- the first model construction unit 131 constructs multiple first shadow models using the above dataset as training data. Note that the settings used by the first model construction unit 131 when constructing the first shadow models are the same as the settings of the machine learning model for which privacy risks are to be calculated.
- the first selection unit 132 inputs sample data to the multiple first shadow models constructed by the first model construction unit 131, and selects sample data for which the average loss of the multiple first shadow models is sufficiently large and the variance is small.
- the first selection unit 132 first calculates the loss of each of the first shadow models for the sample data. Next, the first selection unit 132 calculates the average and variance of the calculated losses. Then, the first selection unit 132 selects sample data with a small s outi shown in the following formula (1).
- ⁇ outi is the average of the losses of the multiple first shadow models for the sample data (x i , y i ), and ⁇ 2 outi is the variance of the losses of the multiple first shadow models for the sample data (x i , y i ).
- the first selection unit 132 may select sample data in which the average loss of the multiple first shadow models is large and the variance is small, without using the constant k. In this case, the first selection unit 132 selects sample data in which s outi shown in the following formula (2) is small.
- the first selection unit 132 selects a predetermined number (for example, about 100-1000) of sample data in ascending order of s outi calculated by the above formula (1) or (2).
- the first selection unit 132 may also select sample data whose s outi is equal to or smaller than a predetermined value.
- the first selection unit 132 may also select sample data for which the average loss of multiple first shadow models for the sample data is equal to or greater than a predetermined value and the variance of the loss is equal to or less than a predetermined value.
- the second model construction unit 133 constructs multiple second shadow models using the sample data selected by the first selection unit 132 and the dataset (the dataset used to construct the first shadow model) as training data. Note that the settings used by the second model construction unit 133 when constructing the second shadow model are the same as the settings of the machine learning model for which privacy risk is calculated.
- the distance calculation unit 134 calculates the degree to which the distribution of losses of the multiple first shadow models differs from the distribution of losses of the multiple second shadow models for the sample data.
- the distance calculation unit 134 calculates the distance d i between the distribution of losses of multiple first shadow models and the distribution of losses of multiple second shadow models for the sample data selected by the first selection unit 132 based on the following equation ( 3 ):
- ⁇ ini is the average of the losses of multiple second shadow models for the sample data (x i , y i )
- ⁇ 2 ini is the variance of the losses of multiple second shadow models for the sample data (x i , y i ).
- the second selection unit 135 selects sample data having a large distance d i between the distribution of losses of the plurality of first shadow models and the distribution of losses of the plurality of second shadow models calculated by the distance calculation unit 134 .
- the second selection unit 135 selects a predetermined number (for example, about 1-100) of sample data in descending order of the distance d i .
- the second selection unit 135 may select sample data whose distance d i is equal to or greater than a predetermined value.
- the risk calculation unit 136 calculates the privacy risk of the machine learning model using the sample data (target sample) selected by the second selection unit 135.
- data (D 1 ) including a target sample and data (D 0 ) not including the target sample are prepared.
- the risk calculation unit 136 repeats a game of guessing whether the data (D 1 ) including the target sample was used for learning the machine learning model 1000 times based on the following formula (4).
- the risk calculation unit 136 calculates the privacy risk of the machine learning model from the accuracy rate of the game.
- the output processing unit 137 outputs the processing result by the control unit 13. For example, the output processing unit 137 outputs the calculation result of the privacy risk of the machine learning model by the risk calculation unit 136.
- the calculation of the privacy risk using the target sample selected by the second selection unit 135 may be performed by a device (external device) other than the risk calculation device 10.
- the risk calculation device 10 outputs the target sample selected by the second selection unit 135 to the external device. Then, the external device calculates the privacy risk using the target sample.
- the first model construction unit 131 of the risk calculation device 10 constructs a plurality of first shadow models using a data set (S1).
- the first selection unit 132 selects sample data based on the average and variance of the losses of the multiple first shadow models constructed in S1 (S2). For example, the first selection unit 132 selects sample data from the sample data in which the average of the losses of the multiple first shadow models is sufficiently large and the variance is small.
- the second model construction unit 133 constructs multiple second shadow models using the sample data selected in S2 and the dataset (the dataset used to construct the first shadow model) (S3).
- the distance calculation unit 134 calculates the distance between the distribution of losses of the multiple first shadow models constructed in S1 and the distribution of losses of the multiple second shadow models constructed in S3 (S4).
- the second selection unit 135 selects sample data (target samples) whose distance calculated in S4 is equal to or greater than a predetermined threshold (S5).
- the risk calculation unit 136 uses the sample data (target samples) selected in S5 to calculate the privacy risk of the machine learning model (S6).
- the output processing unit 137 then outputs the calculation result of the privacy risk obtained in S6 (S7).
- the risk calculation device 10 can highly (accurately) calculate the privacy risk in a realistic setting of the machine learning model.
- the risk calculation device 10 may further include a label manipulation unit 138 (see FIG. 2).
- the label manipulation unit 138 creates new sample data by performing label manipulation on the sample data.
- the label operation unit 138 creates new sample data (x, y p ) from sample data (x, y ) (y is the label of x). However, y p ⁇ y.
- the label operation unit 138 may create sample data of all classes other than the correct answer (y). For example, if the shadow model is a model that classifies input data into 10 classes, the label operation unit 138 may create sample data to which nine labels other than the correct answer (y) are assigned as y p .
- the first selection unit 132 selects sample data in which the average loss of the multiple first shadow models is sufficiently large and the variance is small from the sample data to which new sample data has been added by the label operation unit 138.
- the second selection unit 135 selects a target sample from the sample data selected by the first selection unit 132.
- the risk calculation device 10 may further include a noise addition unit 139 (see FIG. 2).
- the noise addition unit 139 creates sample data that is vulnerable to attacks by adding noise to the sample data.
- the noise addition unit 139 adds noise to the sample data such that the average loss of the multiple first shadow models is sufficiently large and the variance is small. For example, the noise addition unit 139 optimizes the noise so that the losses shown in the following formulas (5) and (6) are small. Then, the noise addition unit 139 adds the optimized noise to the sample data.
- the noise adding unit 139 may add a condition that the noises to be added are not similar to each other. For example, the noise adding unit 139 adds noise such that the loss Lsim shown in the following formula (7) is small. Note that x i t in formula (7) is sample data to which noise has been added at step t of the optimization of noise addition.
- the noise addition unit 139 creates sample data that is vulnerable to attacks, and the risk calculation device 10 can select target samples with a higher privacy risk. This allows the risk calculation device 10 to more accurately calculate the privacy risk of a machine learning model in a realistic setting.
- the administrator of the risk calculation device 10 designs a neural network (NN) for calculating privacy risks (S11).
- NN neural network
- S11 privacy risks
- training of a NN that satisfies Differential Privacy is performed by an optimization method DP-SGD (Differentially Private Stochastic Gradient Descent).
- the risk calculation device 10 selects a target sample from the sample data (S12). Then, the risk calculation device 10 calculates a privacy risk using the target sample selected in S12 (S13). For example, the risk calculation device 10 repeats the game for calculating the privacy risk of the conventional technology 1,000 times using the target sample selected in S12, and calculates the privacy risk of the machine learning model from the accuracy rate.
- each component of each part shown in the figure is a functional concept, and does not necessarily have to be physically configured as shown in the figure.
- the specific form of distribution and integration of each device is not limited to that shown in the figure, and all or a part of it can be functionally or physically distributed and integrated in any unit depending on various loads, usage conditions, etc.
- each processing function performed by each device can be realized in whole or in any part by a CPU and a program executed by the CPU, or can be realized as hardware using wired logic.
- the risk calculation device 10 can be implemented by installing a program (risk calculation program) as package software or online software on a desired computer. For example, by executing the above program on an information processing device, the information processing device can function as the risk calculation device 10.
- the information processing device referred to here includes mobile communication terminals such as smartphones, mobile phones, and PHS (Personal Handyphone Systems), as well as terminals such as PDAs (Personal Digital Assistants).
- FIG. 5 is a diagram showing an example of a computer that executes a risk calculation program.
- the computer 1000 has, for example, a memory 1010 and a CPU 1020.
- the computer 1000 also has a hard disk drive interface 1030, a disk drive interface 1040, a serial port interface 1050, a video adapter 1060, and a network interface 1070. Each of these components is connected by a bus 1080.
- the memory 1010 includes a ROM (Read Only Memory) 1011 and a RAM (Random Access Memory) 1012.
- the ROM 1011 stores a boot program such as a BIOS (Basic Input Output System).
- BIOS Basic Input Output System
- the hard disk drive interface 1030 is connected to a hard disk drive 1090.
- the disk drive interface 1040 is connected to a disk drive 1100.
- a removable storage medium such as a magnetic disk or optical disk is inserted into the disk drive 1100.
- the serial port interface 1050 is connected to a mouse 1110 and a keyboard 1120, for example.
- the video adapter 1060 is connected to a display 1130, for example.
- the hard disk drive 1090 stores, for example, an OS 1091, an application program 1092, a program module 1093, and program data 1094. That is, the programs that define each process executed by the risk calculation device 10 are implemented as program modules 1093 in which computer-executable code is written.
- the program modules 1093 are stored, for example, in the hard disk drive 1090.
- a program module 1093 for executing processes similar to the functional configuration of the risk calculation device 10 is stored in the hard disk drive 1090.
- the hard disk drive 1090 may be replaced by an SSD (Solid State Drive).
- the data used in the processing of the above-described embodiment is stored as program data 1094, for example, in memory 1010 or hard disk drive 1090. Then, the CPU 1020 reads the program module 1093 or program data 1094 stored in memory 1010 or hard disk drive 1090 into RAM 1012 as necessary and executes it.
- the program module 1093 and program data 1094 are not limited to being stored in the hard disk drive 1090, but may be stored in, for example, a removable storage medium and read by the CPU 1020 via the disk drive 1100 or the like. Alternatively, the program module 1093 and program data 1094 may be stored in another computer connected via a network (such as a LAN (Local Area Network), WAN (Wide Area Network)). The program module 1093 and program data 1094 may then be read by the CPU 1020 from the other computer via the network interface 1070.
- a network such as a LAN (Local Area Network), WAN (Wide Area Network)
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Medical Informatics (AREA)
- Computing Systems (AREA)
- Data Mining & Analysis (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Mathematical Physics (AREA)
- Artificial Intelligence (AREA)
- Health & Medical Sciences (AREA)
- Bioethics (AREA)
- General Health & Medical Sciences (AREA)
- Computer Hardware Design (AREA)
- Computer Security & Cryptography (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
This risk calculation device constructs a plurality of first shadow models using a predetermined dataset. Then, the risk calculation device selects, from a set of sample data, pieces of sample data for which the average loss of the plurality of first shadow models is sufficiently large and the variance thereof is small. Subsequently, the risk calculation device constructs a plurality of second shadow models using the selected pieces of sample data and the dataset. Next, the risk calculation device calculates, for the pieces of sample data, a distance between the distribution of losses of the plurality of first shadow models and the distribution of losses of the plurality of second shadow models. Then, the risk calculation device selects sample data for which the distance is equal to or larger than a predetermined threshold value as sample data (target sample) to be used for calculating a privacy risk of a machine learning model.
Description
本発明は、機械学習モデルのプライバシーリスクを算出するための、リスク算出装置、リスク算出方法、および、リスク算出プログラムに関する。
The present invention relates to a risk calculation device, a risk calculation method, and a risk calculation program for calculating the privacy risk of a machine learning model.
Deep Neural Network(DNN)に代表される機械学習技術は、教師データを記憶しやすいという特性からプライバシーリスクがあることが指摘されている。具体的には、特定のデータが教師データに含まれていたか否かを、学習済みモデルの出力から推定できることが示されている。したがって、医療データやウェブの閲覧履歴など、ユーザが他人に知られたくないデータを扱う場合はプライバシーリスクへの配慮が必要である。
Machine learning technologies such as Deep Neural Networks (DNNs) have been pointed out as posing privacy risks due to their tendency to memorize training data. Specifically, it has been shown that it is possible to infer from the output of a trained model whether or not specific data was included in the training data. Therefore, consideration must be given to privacy risks when handling data that users do not want others to know, such as medical data or web browsing history.
従来、学習済みモデルに対し、あるデータが教師データに含まれているか否かを特定する攻撃を行い、その攻撃がどの程度成功するかに基づき、プライバシーリスクを算出する方法が提案されている(非特許文献1、非特許文献2、非特許文献3参照)。
Conventionally, a method has been proposed in which an attack is performed on a trained model to determine whether or not certain data is included in the training data, and the privacy risk is calculated based on the degree to which the attack is successful (see Non-Patent Documents 1, 2, and 3).
しかし、従来のプライバシーリスク算出方法は、弱い攻撃か非現実的な攻撃に基づく方法しか提案されていないため、現実的な設定でのプライバシーリスクを低く見積もってしまうという問題があった。なお、弱い攻撃に基づくプライバシーリスクの算出は、例えば、ランダムに選択されたターゲットサンプルを用いて平均的なプライバシーリスクを算出することである。また、非現実的な攻撃とは、例えば、攻撃者が学習途中のモデルにアクセス可能なことを想定した攻撃や、教師データの操作が可能なことを想定した攻撃等である。
However, conventional privacy risk calculation methods have only proposed methods based on weak attacks or unrealistic attacks, which means that they tend to underestimate privacy risks in realistic settings. Calculating privacy risks based on weak attacks involves, for example, calculating the average privacy risk using randomly selected target samples. Unrealistic attacks include, for example, attacks that assume that an attacker has access to a model that is in the middle of training, or attacks that assume that they can manipulate training data.
そこで、本発明は、前記した問題を解決し、機械学習モデルの現実的な設定でのプライバシーリスクを高く(正しく)算出することを課題とする。
The present invention aims to solve the above problems and accurately calculate the privacy risk in realistic settings of machine learning models.
前記した課題を解決するため、本発明は、所定のデータセットを用いて複数の第1のシャドウモデルを構築する第1のモデル構築部と、サンプルデータの中から、複数の前記第1のシャドウモデルの損失の平均が所定値以上、かつ、分散が所定値以下となる前記サンプルデータを選択する第1の選択部と、選択した前記サンプルデータおよび前記データセットを用いて第2のシャドウモデルを複数構築する第2のモデル構築部と、前記サンプルデータに対する、複数の前記第1のシャドウモデルの損失の分布と複数の前記第2のシャドウモデルの損失の分布との距離を算出する距離算出部と、前記距離が所定の閾値以上となる前記サンプルデータを、機械学習モデルのプライバシーリスクの算出に用いるサンプルデータとして選択する第2の選択部と、選択した前記サンプルデータを出力する出力処理部とを備えることを特徴とする。
In order to solve the above-mentioned problems, the present invention is characterized by comprising a first model construction unit that constructs multiple first shadow models using a predetermined dataset, a first selection unit that selects, from sample data, the sample data for which the average loss of the multiple first shadow models is equal to or greater than a predetermined value and the variance is equal to or less than a predetermined value, a second model construction unit that constructs multiple second shadow models using the selected sample data and the dataset, a distance calculation unit that calculates the distance between the distribution of losses of the multiple first shadow models and the distribution of losses of the multiple second shadow models for the sample data, a second selection unit that selects the sample data for which the distance is equal to or greater than a predetermined threshold as sample data to be used in calculating the privacy risk of a machine learning model, and an output processing unit that outputs the selected sample data.
本発明によれば、機械学習モデルの現実的な設定でのプライバシーリスクを高く(正しく)算出することができる。
The present invention makes it possible to highly (accurately) calculate privacy risks in realistic settings for machine learning models.
以下、図面を参照しながら、本発明を実施するための形態(実施形態)について説明する。本発明は、本実施形態に限定されない。
Below, a form (embodiment) for carrying out the present invention will be described with reference to the drawings. The present invention is not limited to this embodiment.
[概要]
まず、図1を用いて、本実施形態のリスク算出装置の概要を説明する。リスク算出装置は、学習済みのモデル(機械学習モデル)のプライバシーリスクの算出に用いるサンプルデータ(ターゲットサンプル)を以下のようにして選択する。 [overview]
First, an overview of the risk calculation device of this embodiment will be described with reference to Fig. 1. The risk calculation device selects sample data (target samples) used for calculating a privacy risk of a trained model (machine learning model) in the following manner.
まず、図1を用いて、本実施形態のリスク算出装置の概要を説明する。リスク算出装置は、学習済みのモデル(機械学習モデル)のプライバシーリスクの算出に用いるサンプルデータ(ターゲットサンプル)を以下のようにして選択する。 [overview]
First, an overview of the risk calculation device of this embodiment will be described with reference to Fig. 1. The risk calculation device selects sample data (target samples) used for calculating a privacy risk of a trained model (machine learning model) in the following manner.
まず、リスク算出装置は、予め用意されたデータセットを用いて複数のシャドウモデル(第1のシャドウモデル)を構築する。次に、リスク算出装置は、構築した複数の第1のシャドウモデルそれぞれにサンプルデータ(データセットとは別のデータ)を入力する。そして、リスク算出装置は、構築した複数の第1のシャドウモデルそれぞれの損失を算出し、算出した損失の平均が充分大きく、かつ、分散が小さいサンプルデータを選択する(選択1)。
First, the risk calculation device constructs multiple shadow models (first shadow models) using a previously prepared data set. Next, the risk calculation device inputs sample data (data separate from the data set) to each of the constructed multiple first shadow models. Then, the risk calculation device calculates the loss of each of the constructed multiple first shadow models, and selects sample data in which the average of the calculated losses is sufficiently large and the variance is small (selection 1).
次に、リスク算出装置は、上記のデータセットに選択1で選択されたサンプルデータを加えたデータを用いて複数のシャドウモデル(第2のシャドウモデル)を構築する。そして、リスク算出装置は、サンプルデータに対する、複数の第2のシャドウモデルそれぞれの損失を算出する。
Then, the risk calculation device constructs multiple shadow models (second shadow models) using the data obtained by adding the sample data selected in selection 1 to the above data set. The risk calculation device then calculates the loss of each of the multiple second shadow models for the sample data.
その後、リスク算出装置は、複数の第1のシャドウモデルそれぞれの損失の分布と、複数の第2のシャドウモデルそれぞれの損失の分布との違いが大きくなるサンプルデータをターゲットサンプルとして選択する(選択2)。
Then, the risk calculation device selects sample data in which the difference between the distribution of losses of each of the multiple first shadow models and the distribution of losses of each of the multiple second shadow models becomes large as a target sample (selection 2).
つまり、複数のシャドウモデルの損失の平均が充分大きく、かつ、分散が小さいサンプルデータは、プライバシーリスクの高いサンプルデータと考えることができる。また、シャドウモデルの教師データにサンプルデータを追加することにより当該モデルの損失の分布が大きく変わる場合、当該サンプルデータは想定されやすいデータであると考えることができる。つまり、当該サンプルデータはプライバシーリスクの高いデータと考えることができる。
In other words, sample data in which the average loss of multiple shadow models is sufficiently large and the variance is small can be considered to be sample data with a high privacy risk. Furthermore, if adding sample data to the training data of a shadow model significantly changes the distribution of losses in that model, the sample data can be considered to be data that is easy to predict. In other words, the sample data can be considered to be data with a high privacy risk.
そこで、リスク算出装置は、まず、複数のシャドウモデルの損失の平均が充分大きく、かつ、分散が小さいサンプルデータをターゲットサンプルの候補とする(選択1)。次に、リスク算出装置は、ターゲットサンプルの候補の中から、シャドウモデルの教師データに追加すると当該シャドウモデルの損失の分布が大きく変わるものをターゲットサンプルとして選択する(選択2)。
The risk calculation device therefore first selects sample data with a sufficiently large average loss for multiple shadow models and small variance as target sample candidates (Selection 1). Next, the risk calculation device selects, from among the target sample candidates, a sample that, when added to the teacher data for the shadow model, will significantly change the distribution of losses for the shadow model as the target sample (Selection 2).
このようにすることで、リスク算出装置は、ターゲットサンプルとして、プライバシーリスクの高いサンプルデータを選択することができる。そして、リスク算出装置は、上記のターゲットサンプルを用いて機械学習モデルのプライバシーリスクを算出することで、プライバシーリスクを高く(正しく)算出することができる。
In this way, the risk calculation device can select sample data with a high privacy risk as a target sample. Then, the risk calculation device can calculate the privacy risk of the machine learning model using the target sample, thereby calculating the privacy risk to be high (correct).
[構成例]
次に、図2を用いて、リスク算出装置10の構成例を説明する。リスク算出装置10は、例えば、入出力部11、記憶部12、および、制御部13を備える。 [Configuration example]
Next, a configuration example of the risk calculation device 10 will be described with reference to Fig. 2. The risk calculation device 10 includes, for example, an input/output unit 11, a storage unit 12, and acontrol unit 13.
次に、図2を用いて、リスク算出装置10の構成例を説明する。リスク算出装置10は、例えば、入出力部11、記憶部12、および、制御部13を備える。 [Configuration example]
Next, a configuration example of the risk calculation device 10 will be described with reference to Fig. 2. The risk calculation device 10 includes, for example, an input/output unit 11, a storage unit 12, and a
入出力部11は、各種データの入出力を司るインタフェースである。入出力部11は、例えば、データセット、サンプルデータの入力を受け付ける。データセットは、第1のシャドウモデルおよび第2のシャドウモデルの構築に用いられるデータセットである。サンプルデータは、機械学習モデルのプライバシーリスクの算出のために用いられるデータ(ターゲットサンプル)の候補となるデータである。
The input/output unit 11 is an interface that handles the input and output of various data. The input/output unit 11 accepts, for example, input of a dataset and sample data. The dataset is a dataset used to construct the first shadow model and the second shadow model. The sample data is candidate data (target sample) used to calculate the privacy risk of the machine learning model.
記憶部12は、制御部13が各種処理を実行する際に参照されるデータ、プログラム等を記憶する。記憶部12は、RAM(Random Access Memory)、フラッシュメモリ(Flash Memory)等の半導体メモリ素子、または、ハードディスク、光ディスク等の記憶装置によって実現される。
The memory unit 12 stores data, programs, etc. that are referenced when the control unit 13 executes various processes. The memory unit 12 is realized by a semiconductor memory element such as a RAM (Random Access Memory) or a flash memory, or a storage device such as a hard disk or an optical disk.
例えば、記憶部12は、入出力部11経由で入力された、データセット、サンプルデータ等を記憶する。また、例えば、記憶部12は、制御部13により構築された第1のシャドウモデルのパラメータおよび第2のシャドウモデルのパラメータ等を記憶する。
For example, the storage unit 12 stores a data set, sample data, etc. input via the input/output unit 11. Also, for example, the storage unit 12 stores parameters of a first shadow model and parameters of a second shadow model constructed by the control unit 13, etc.
制御部13は、リスク算出装置10全体の制御を司る。制御部13の機能は、例えば、CPU(Central Processing Unit)が、記憶部12に記憶されるプログラムを実行することにより実現される。
The control unit 13 is responsible for controlling the entire risk calculation device 10. The functions of the control unit 13 are realized, for example, by a CPU (Central Processing Unit) executing a program stored in the memory unit 12.
制御部13は、例えば、第1のモデル構築部131と、第1の選択部132と、第2のモデル構築部133と、距離算出部134と、第2の選択部135と、リスク算出部136と、出力処理部137とを備える。なお、破線で示すラベル操作部138と、ノイズ付加部139は、装備される場合と装備されない場合とがあり、装備される場合については後記する。
The control unit 13 includes, for example, a first model construction unit 131, a first selection unit 132, a second model construction unit 133, a distance calculation unit 134, a second selection unit 135, a risk calculation unit 136, and an output processing unit 137. Note that a label manipulation unit 138 and a noise addition unit 139 shown by dashed lines may or may not be provided, and the cases in which they are provided will be described later.
第1のモデル構築部131は、上記のデータセットを教師データとして用いて第1のシャドウモデルを複数構築する。なお、第1のモデル構築部131が、第1のシャドウモデルを構築する際に用いる設定は、プライバシーリスクの算出対象の機械学習モデルの設定と同じ設定とする。
The first model construction unit 131 constructs multiple first shadow models using the above dataset as training data. Note that the settings used by the first model construction unit 131 when constructing the first shadow models are the same as the settings of the machine learning model for which privacy risks are to be calculated.
第1の選択部132は、第1のモデル構築部131により構築された複数の第1のシャドウモデルにサンプルデータを入力し、複数の第1のシャドウモデルの損失の平均が充分大きく、分散が小さいサンプルデータを選択する。
The first selection unit 132 inputs sample data to the multiple first shadow models constructed by the first model construction unit 131, and selects sample data for which the average loss of the multiple first shadow models is sufficiently large and the variance is small.
例えば、第1の選択部132は、まず、サンプルデータに対する第1のシャドウモデルそれぞれの損失を算出する。次に、第1の選択部132は、算出した損失の平均と分散を算出する。そして、第1の選択部132は、以下の式(1)に示すsoutiが小さいサンプルデータを選択する。
For example, the first selection unit 132 first calculates the loss of each of the first shadow models for the sample data. Next, the first selection unit 132 calculates the average and variance of the calculated losses. Then, the first selection unit 132 selects sample data with a small s outi shown in the following formula (1).
なお、式(1)における、μoutiは、サンプルデータ(xi,yi)に対する複数の第1のシャドウモデルの損失の平均であり、σ2
outiは、サンプルデータ(xi,yi)に対する複数の第1のシャドウモデルの損失の分散である。kは充分大きな損失の定数であり、例えば、k=20程度である。このkの値は、例えば、攻撃に弱いサンプルの分析結果に基づいて決定された値でもよい。
In addition, in formula (1), μ outi is the average of the losses of the multiple first shadow models for the sample data (x i , y i ), and σ 2 outi is the variance of the losses of the multiple first shadow models for the sample data (x i , y i ). k is a sufficiently large loss constant, for example, k=20. The value of k may be determined based on the analysis results of samples that are vulnerable to attacks, for example.
また、第1の選択部132は、上記の定数kを用いず、複数の第1のシャドウモデルの損失の平均が大きく、分散が小さいサンプルデータを選択してもよい。その場合、第1の選択部132は、以下の式(2)に示すsoutiが小さいサンプルデータを選択する。
Alternatively, the first selection unit 132 may select sample data in which the average loss of the multiple first shadow models is large and the variance is small, without using the constant k. In this case, the first selection unit 132 selects sample data in which s outi shown in the following formula (2) is small.
例えば、第1の選択部132は、上記の式(1)または式(2)により算出されたsoutiが小さいサンプルデータから順に所定個(例えば、100-1000程度)のサンプルデータを選択する。また、第1の選択部132は、上記のsoutiが所定値以下となるサンプルデータを選択してもよい。
For example, the first selection unit 132 selects a predetermined number (for example, about 100-1000) of sample data in ascending order of s outi calculated by the above formula (1) or (2). The first selection unit 132 may also select sample data whose s outi is equal to or smaller than a predetermined value.
また、第1の選択部132は、例えば、サンプルデータに対する複数の第1のシャドウモデルの損失の平均が所定値以上であり、かつ、損失の分散が所定値以下であるサンプルデータを選択してもよい。
The first selection unit 132 may also select sample data for which the average loss of multiple first shadow models for the sample data is equal to or greater than a predetermined value and the variance of the loss is equal to or less than a predetermined value.
第2のモデル構築部133は、第1の選択部132により選択されたサンプルデータと、データセット(第1のシャドウモデルの構築に用いたデータセット)とを教師データとして用いて第2のシャドウモデルを複数構築する。なお、第2のモデル構築部133が第2のシャドウモデルを構築する際に用いる設定も、プライバシーリスクの算出対象の機械学習モデルの設定と同じ設定とする。
The second model construction unit 133 constructs multiple second shadow models using the sample data selected by the first selection unit 132 and the dataset (the dataset used to construct the first shadow model) as training data. Note that the settings used by the second model construction unit 133 when constructing the second shadow model are the same as the settings of the machine learning model for which privacy risk is calculated.
距離算出部134は、サンプルデータに対する、複数の第1のシャドウモデルの損失の分布と複数の第2のシャドウモデルの損失の分布とがどの程度異なっているかを算出する。
The distance calculation unit 134 calculates the degree to which the distribution of losses of the multiple first shadow models differs from the distribution of losses of the multiple second shadow models for the sample data.
例えば、距離算出部134は、以下の式(3)に基づき、第1の選択部132により選択されたサンプルデータに対する、複数の第1のシャドウモデルの損失の分布と、複数の第2のシャドウモデルの損失の分布との距離diを算出する。
For example, the distance calculation unit 134 calculates the distance d i between the distribution of losses of multiple first shadow models and the distribution of losses of multiple second shadow models for the sample data selected by the first selection unit 132 based on the following equation ( 3 ):
なお、式(3)における、μiniはサンプルデータ(xi,yi)に対する複数の第2のシャドウモデルの損失の平均であり、σ2
iniはサンプルデータ(xi,yi)に対する複数の第2のシャドウモデルの損失の分散である。
In addition, in equation (3), μ ini is the average of the losses of multiple second shadow models for the sample data (x i , y i ), and σ 2 ini is the variance of the losses of multiple second shadow models for the sample data (x i , y i ).
第2の選択部135は、距離算出部134により算出された複数の第1のシャドウモデルの損失の分布と複数の第2のシャドウモデルの損失の分布との距離diが大さいサンプルデータを選択する。
The second selection unit 135 selects sample data having a large distance d i between the distribution of losses of the plurality of first shadow models and the distribution of losses of the plurality of second shadow models calculated by the distance calculation unit 134 .
例えば、第2の選択部135は、上記の距離diが大きいサンプルデータから順に所定個(例えば、1-100程度)のサンプルデータを選択する。また、第2の選択部135は、上記の距離diが所定値以上のサンプルデータを選択してもよい。
For example, the second selection unit 135 selects a predetermined number (for example, about 1-100) of sample data in descending order of the distance d i . Alternatively, the second selection unit 135 may select sample data whose distance d i is equal to or greater than a predetermined value.
リスク算出部136は、第2の選択部135により選択されたサンプルデータ(ターゲットサンプル)を用いて、機械学習モデルのプライバシーリスクを算出する。
The risk calculation unit 136 calculates the privacy risk of the machine learning model using the sample data (target sample) selected by the second selection unit 135.
例えば、ターゲットサンプルを含むデータ(D1)とターゲットサンプルを含まないデータ(D0)とを用意する。そして、リスク算出部136は、以下の式(4)に基づき、ターゲットサンプルを含むデータ(D1)が機械学習モデルの学習に用いられたか否かを当てるゲームを1000回繰り返す。そして、リスク算出部136は、ゲームの正答率から、機械学習モデルのプライバシーリスクを算出する。
For example, data (D 1 ) including a target sample and data (D 0 ) not including the target sample are prepared. Then, the risk calculation unit 136 repeats a game of guessing whether the data (D 1 ) including the target sample was used for learning the machine learning model 1000 times based on the following formula (4). Then, the risk calculation unit 136 calculates the privacy risk of the machine learning model from the accuracy rate of the game.
出力処理部137は、制御部13による処理結果を出力する。例えば、出力処理部137は、リスク算出部136による、機械学習モデルのプライバシーリスクの算出結果を出力する。
The output processing unit 137 outputs the processing result by the control unit 13. For example, the output processing unit 137 outputs the calculation result of the privacy risk of the machine learning model by the risk calculation unit 136.
このようなリスク算出装置10によれば、機械学習モデルのプライバシーリスクを高く(正しく)算出することができる。
With this risk calculation device 10, the privacy risk of the machine learning model can be calculated highly (accurately).
なお、第2の選択部135により選択されたターゲットサンプルを用いたプライバシーリスクの算出は、リスク算出装置10以外の装置(外部装置)が行ってもよい。この場合、リスク算出装置10は、第2の選択部135により選択されたターゲットサンプルを外部装置に出力する。そして、外部装置は、当該ターゲットサンプルを用いて、プライバシーリスクを算出する。
Note that the calculation of the privacy risk using the target sample selected by the second selection unit 135 may be performed by a device (external device) other than the risk calculation device 10. In this case, the risk calculation device 10 outputs the target sample selected by the second selection unit 135 to the external device. Then, the external device calculates the privacy risk using the target sample.
[処理手順の例]
次に、図3を用いて、リスク算出装置10が実行する処理手順の例を説明する。まず、リスク算出装置10の第1のモデル構築部131は、データセットを用いて第1のシャドウモデルを複数構築する(S1)。 [Example of processing procedure]
Next, an example of a processing procedure executed by the risk calculation device 10 will be described with reference to Fig. 3. First, the firstmodel construction unit 131 of the risk calculation device 10 constructs a plurality of first shadow models using a data set (S1).
次に、図3を用いて、リスク算出装置10が実行する処理手順の例を説明する。まず、リスク算出装置10の第1のモデル構築部131は、データセットを用いて第1のシャドウモデルを複数構築する(S1)。 [Example of processing procedure]
Next, an example of a processing procedure executed by the risk calculation device 10 will be described with reference to Fig. 3. First, the first
S1の後、第1の選択部132は、S1で構築された複数の第1のシャドウモデルの損失の平均および分散に基づき、サンプルデータを選択する(S2)。例えば、第1の選択部132は、サンプルデータの中から、複数の第1のシャドウモデルの損失の平均が充分に大きく、分散が小さいサンプルデータを選択する。
After S1, the first selection unit 132 selects sample data based on the average and variance of the losses of the multiple first shadow models constructed in S1 (S2). For example, the first selection unit 132 selects sample data from the sample data in which the average of the losses of the multiple first shadow models is sufficiently large and the variance is small.
S2の後、第2のモデル構築部133は、S2で選択したサンプルデータとデータセット(第1のシャドウモデルの構築に用いたデータセット)とを用いて第2のシャドウモデルを複数構築する(S3)。
After S2, the second model construction unit 133 constructs multiple second shadow models using the sample data selected in S2 and the dataset (the dataset used to construct the first shadow model) (S3).
S3の後、距離算出部134は、S1で構築された複数の第1のシャドウモデルの損失の分布と、S3で構築した複数の第2のシャドウモデルの損失の分布との距離を算出する(S4)。
After S3, the distance calculation unit 134 calculates the distance between the distribution of losses of the multiple first shadow models constructed in S1 and the distribution of losses of the multiple second shadow models constructed in S3 (S4).
S4の後、第2の選択部135は、S4で算出された距離が所定の閾値以上のサンプルデータ(ターゲットサンプル)を選択する(S5)。その後、リスク算出部136は、S5で選択されたサンプルデータ(ターゲットサンプル)を用いて、機械学習モデルのプライバシーリスクを算出する(S6)。そして、出力処理部137は、S6で得られたプライバシーリスクの算出結果を出力する(S7)。
After S4, the second selection unit 135 selects sample data (target samples) whose distance calculated in S4 is equal to or greater than a predetermined threshold (S5). The risk calculation unit 136 then uses the sample data (target samples) selected in S5 to calculate the privacy risk of the machine learning model (S6). The output processing unit 137 then outputs the calculation result of the privacy risk obtained in S6 (S7).
リスク算出装置10が上記の処理を行うことで、機械学習モデルの現実的な設定でのプライバシーリスクを高く(正しく)算出することができる。
By performing the above process, the risk calculation device 10 can highly (accurately) calculate the privacy risk in a realistic setting of the machine learning model.
[その他の実施形態]
なお、リスク算出装置10は、ラベル操作部138(図2参照)をさらに備えてもよい。ラベル操作部138は、サンプルデータに対するラベル操作により新たなサンプルデータを作成する。 [Other embodiments]
The risk calculation device 10 may further include a label manipulation unit 138 (see FIG. 2). Thelabel manipulation unit 138 creates new sample data by performing label manipulation on the sample data.
なお、リスク算出装置10は、ラベル操作部138(図2参照)をさらに備えてもよい。ラベル操作部138は、サンプルデータに対するラベル操作により新たなサンプルデータを作成する。 [Other embodiments]
The risk calculation device 10 may further include a label manipulation unit 138 (see FIG. 2). The
例えば、ラベル操作部138は、サンプルデータ(x,y)(yはxのラベル)から、新たなサンプルデータ(x,yp)を作成する。ただし、yp≠yである。ここで、ラベル操作部138は、新たなサンプルデータ(x,yp)を作成する際、正解(y)以外のすべてのクラスのサンプルデータを作成してもよい。例えば、シャドウモデルが入力データを10クラスに分類するモデルである場合、ラベル操作部138は、ypとして正解(y)以外の9つのラベルを付与したサンプルデータを作成してもよい。
For example, the label operation unit 138 creates new sample data (x, y p ) from sample data (x, y ) (y is the label of x). However, y p ≠ y. Here, when creating new sample data (x, y p ), the label operation unit 138 may create sample data of all classes other than the correct answer (y). For example, if the shadow model is a model that classifies input data into 10 classes, the label operation unit 138 may create sample data to which nine labels other than the correct answer (y) are assigned as y p .
そして、第1の選択部132は、ラベル操作部138により新たなサンプルデータが追加されたサンプルデータの中から、複数の第1のシャドウモデルの損失の平均が充分大きく、かつ、分散が小さいサンプルデータを選択する。その後、第2の選択部135は、第1の選択部132により選択されたサンプルデータの中から、ターゲットサンプルを選択する。
Then, the first selection unit 132 selects sample data in which the average loss of the multiple first shadow models is sufficiently large and the variance is small from the sample data to which new sample data has been added by the label operation unit 138. After that, the second selection unit 135 selects a target sample from the sample data selected by the first selection unit 132.
また、リスク算出装置10は、ノイズ付加部139(図2参照)をさらに備えていてもよい。ノイズ付加部139は、サンプルデータにノイズを付加することにより攻撃に弱いサンプルデータを作成する。
The risk calculation device 10 may further include a noise addition unit 139 (see FIG. 2). The noise addition unit 139 creates sample data that is vulnerable to attacks by adding noise to the sample data.
例えば、ノイズ付加部139は、複数の第1のシャドウモデルの損失の平均が充分大きく、分散が小さくなるようなノイズをサンプルデータに付加する。例えば、ノイズ付加部139は、以下の式(5)、式(6)に示す損失が小さくなるようにノイズを最適化する。そして、ノイズ付加部139は、最適化されたノイズをサンプルデータに付加する。
For example, the noise addition unit 139 adds noise to the sample data such that the average loss of the multiple first shadow models is sufficiently large and the variance is small. For example, the noise addition unit 139 optimizes the noise so that the losses shown in the following formulas (5) and (6) are small. Then, the noise addition unit 139 adds the optimized noise to the sample data.
また、ノイズ付加部139は、付加するノイズが互いに類似しないような条件を追加してもよい。例えば、ノイズ付加部139は、以下の式(7)に示す損失Lsimが小さくなるようなノイズを付加する。なお、式(7)における、xi
tはノイズ付加の最適化のtステップでのノイズが付加されたサンプルデータである。
Furthermore, the noise adding unit 139 may add a condition that the noises to be added are not similar to each other. For example, the noise adding unit 139 adds noise such that the loss Lsim shown in the following formula (7) is small. Note that x i t in formula (7) is sample data to which noise has been added at step t of the optimization of noise addition.
このようにノイズ付加部139が、攻撃に対し弱いサンプルデータを作成することで、リスク算出装置10は、よりプライバシーリスクの高いターゲットサンプルを選択することができる。これにより、リスク算出装置10は、現実的な設定での機械学習モデルのプライバシーリスクをより高く(正しく)算出することができる。
In this way, the noise addition unit 139 creates sample data that is vulnerable to attacks, and the risk calculation device 10 can select target samples with a higher privacy risk. This allows the risk calculation device 10 to more accurately calculate the privacy risk of a machine learning model in a realistic setting.
[適用例]
次に、図4を用いて、リスク算出装置10の適用例を説明する。例えば、リスク算出装置10の管理者はプライバシーリスクの算出対象となるNN(ニューラルネットワーク)を設計する(S11)。例えば、最適化手法DP-SGD(Differentially Private Stochastic Gradient Descent)により、Differential Privacyを満たすNNのトレーニングを行う。 [Application example]
Next, an application example of the risk calculation device 10 will be described with reference to Fig. 4. For example, the administrator of the risk calculation device 10 designs a neural network (NN) for calculating privacy risks (S11). For example, training of a NN that satisfies Differential Privacy is performed by an optimization method DP-SGD (Differentially Private Stochastic Gradient Descent).
次に、図4を用いて、リスク算出装置10の適用例を説明する。例えば、リスク算出装置10の管理者はプライバシーリスクの算出対象となるNN(ニューラルネットワーク)を設計する(S11)。例えば、最適化手法DP-SGD(Differentially Private Stochastic Gradient Descent)により、Differential Privacyを満たすNNのトレーニングを行う。 [Application example]
Next, an application example of the risk calculation device 10 will be described with reference to Fig. 4. For example, the administrator of the risk calculation device 10 designs a neural network (NN) for calculating privacy risks (S11). For example, training of a NN that satisfies Differential Privacy is performed by an optimization method DP-SGD (Differentially Private Stochastic Gradient Descent).
次に、リスク算出装置10は、サンプルデータからターゲットサンプルを選択する(S12)。そして、リスク算出装置10は、S12で選択したターゲットサンプルを使用して、プライバシーリスクを算出する(S13)。例えば、リスク算出装置10は、S12で選択したターゲットサンプルを使用して、従来技術のプライバシーリスクを算出するためのゲームを1000回繰り返し、その正答率から機械学習モデルのプライバシーリスクを算出する。
Next, the risk calculation device 10 selects a target sample from the sample data (S12). Then, the risk calculation device 10 calculates a privacy risk using the target sample selected in S12 (S13). For example, the risk calculation device 10 repeats the game for calculating the privacy risk of the conventional technology 1,000 times using the target sample selected in S12, and calculates the privacy risk of the machine learning model from the accuracy rate.
[システム構成等]
また、図示した各部の各構成要素は機能概念的なものであり、必ずしも物理的に図示のように構成されていることを要しない。すなわち、各装置の分散・統合の具体的形態は図示のものに限られず、その全部又は一部を、各種の負荷や使用状況等に応じて、任意の単位で機能的又は物理的に分散・統合して構成することができる。さらに、各装置にて行われる各処理機能は、その全部又は任意の一部が、CPU及び当該CPUにて実行されるプログラムにて実現され、あるいは、ワイヤードロジックによるハードウェアとして実現され得る。 [System configuration, etc.]
In addition, each component of each part shown in the figure is a functional concept, and does not necessarily have to be physically configured as shown in the figure. In other words, the specific form of distribution and integration of each device is not limited to that shown in the figure, and all or a part of it can be functionally or physically distributed and integrated in any unit depending on various loads, usage conditions, etc. Furthermore, each processing function performed by each device can be realized in whole or in any part by a CPU and a program executed by the CPU, or can be realized as hardware using wired logic.
また、図示した各部の各構成要素は機能概念的なものであり、必ずしも物理的に図示のように構成されていることを要しない。すなわち、各装置の分散・統合の具体的形態は図示のものに限られず、その全部又は一部を、各種の負荷や使用状況等に応じて、任意の単位で機能的又は物理的に分散・統合して構成することができる。さらに、各装置にて行われる各処理機能は、その全部又は任意の一部が、CPU及び当該CPUにて実行されるプログラムにて実現され、あるいは、ワイヤードロジックによるハードウェアとして実現され得る。 [System configuration, etc.]
In addition, each component of each part shown in the figure is a functional concept, and does not necessarily have to be physically configured as shown in the figure. In other words, the specific form of distribution and integration of each device is not limited to that shown in the figure, and all or a part of it can be functionally or physically distributed and integrated in any unit depending on various loads, usage conditions, etc. Furthermore, each processing function performed by each device can be realized in whole or in any part by a CPU and a program executed by the CPU, or can be realized as hardware using wired logic.
また、前記した実施形態において説明した処理のうち、自動的に行われるものとして説明した処理の全部又は一部を手動的に行うこともでき、あるいは、手動的に行われるものとして説明した処理の全部又は一部を公知の方法で自動的に行うこともできる。この他、上記文書中や図面中で示した処理手順、制御手順、具体的名称、各種のデータやパラメータを含む情報については、特記する場合を除いて任意に変更することができる。
Furthermore, among the processes described in the above embodiments, all or part of the processes described as being performed automatically can be performed manually, or all or part of the processes described as being performed manually can be performed automatically using known methods. In addition, the information including the processing procedures, control procedures, specific names, various data and parameters shown in the above documents and drawings can be changed as desired unless otherwise specified.
[プログラム]
前記したリスク算出装置10は、パッケージソフトウェアやオンラインソフトウェアとしてプログラム(リスク算出プログラム)を所望のコンピュータにインストールさせることによって実装できる。例えば、上記のプログラムを情報処理装置に実行させることにより、情報処理装置をリスク算出装置10として機能させることができる。ここで言う情報処理装置にはスマートフォン、携帯電話機やPHS(Personal Handyphone System)等の移動体通信端末、さらには、PDA(Personal Digital Assistant)等の端末等がその範疇に含まれる。 [program]
The risk calculation device 10 can be implemented by installing a program (risk calculation program) as package software or online software on a desired computer. For example, by executing the above program on an information processing device, the information processing device can function as the risk calculation device 10. The information processing device referred to here includes mobile communication terminals such as smartphones, mobile phones, and PHS (Personal Handyphone Systems), as well as terminals such as PDAs (Personal Digital Assistants).
前記したリスク算出装置10は、パッケージソフトウェアやオンラインソフトウェアとしてプログラム(リスク算出プログラム)を所望のコンピュータにインストールさせることによって実装できる。例えば、上記のプログラムを情報処理装置に実行させることにより、情報処理装置をリスク算出装置10として機能させることができる。ここで言う情報処理装置にはスマートフォン、携帯電話機やPHS(Personal Handyphone System)等の移動体通信端末、さらには、PDA(Personal Digital Assistant)等の端末等がその範疇に含まれる。 [program]
The risk calculation device 10 can be implemented by installing a program (risk calculation program) as package software or online software on a desired computer. For example, by executing the above program on an information processing device, the information processing device can function as the risk calculation device 10. The information processing device referred to here includes mobile communication terminals such as smartphones, mobile phones, and PHS (Personal Handyphone Systems), as well as terminals such as PDAs (Personal Digital Assistants).
図5は、リスク算出プログラムを実行するコンピュータの一例を示す図である。コンピュータ1000は、例えば、メモリ1010、CPU1020を有する。また、コンピュータ1000は、ハードディスクドライブインタフェース1030、ディスクドライブインタフェース1040、シリアルポートインタフェース1050、ビデオアダプタ1060、ネットワークインタフェース1070を有する。これらの各部は、バス1080によって接続される。
FIG. 5 is a diagram showing an example of a computer that executes a risk calculation program. The computer 1000 has, for example, a memory 1010 and a CPU 1020. The computer 1000 also has a hard disk drive interface 1030, a disk drive interface 1040, a serial port interface 1050, a video adapter 1060, and a network interface 1070. Each of these components is connected by a bus 1080.
メモリ1010は、ROM(Read Only Memory)1011及びRAM(Random Access Memory)1012を含む。ROM1011は、例えば、BIOS(Basic Input Output System)等のブートプログラムを記憶する。ハードディスクドライブインタフェース1030は、ハードディスクドライブ1090に接続される。ディスクドライブインタフェース1040は、ディスクドライブ1100に接続される。例えば磁気ディスクや光ディスク等の着脱可能な記憶媒体が、ディスクドライブ1100に挿入される。シリアルポートインタフェース1050は、例えばマウス1110、キーボード1120に接続される。ビデオアダプタ1060は、例えばディスプレイ1130に接続される。
The memory 1010 includes a ROM (Read Only Memory) 1011 and a RAM (Random Access Memory) 1012. The ROM 1011 stores a boot program such as a BIOS (Basic Input Output System). The hard disk drive interface 1030 is connected to a hard disk drive 1090. The disk drive interface 1040 is connected to a disk drive 1100. A removable storage medium such as a magnetic disk or optical disk is inserted into the disk drive 1100. The serial port interface 1050 is connected to a mouse 1110 and a keyboard 1120, for example. The video adapter 1060 is connected to a display 1130, for example.
ハードディスクドライブ1090は、例えば、OS1091、アプリケーションプログラム1092、プログラムモジュール1093、プログラムデータ1094を記憶する。すなわち、上記のリスク算出装置10が実行する各処理を規定するプログラムは、コンピュータにより実行可能なコードが記述されたプログラムモジュール1093として実装される。プログラムモジュール1093は、例えばハードディスクドライブ1090に記憶される。例えば、リスク算出装置10における機能構成と同様の処理を実行するためのプログラムモジュール1093が、ハードディスクドライブ1090に記憶される。なお、ハードディスクドライブ1090は、SSD(Solid State Drive)により代替されてもよい。
The hard disk drive 1090 stores, for example, an OS 1091, an application program 1092, a program module 1093, and program data 1094. That is, the programs that define each process executed by the risk calculation device 10 are implemented as program modules 1093 in which computer-executable code is written. The program modules 1093 are stored, for example, in the hard disk drive 1090. For example, a program module 1093 for executing processes similar to the functional configuration of the risk calculation device 10 is stored in the hard disk drive 1090. The hard disk drive 1090 may be replaced by an SSD (Solid State Drive).
また、上述した実施形態の処理で用いられるデータは、プログラムデータ1094として、例えばメモリ1010やハードディスクドライブ1090に記憶される。そして、CPU1020が、メモリ1010やハードディスクドライブ1090に記憶されたプログラムモジュール1093やプログラムデータ1094を必要に応じてRAM1012に読み出して実行する。
The data used in the processing of the above-described embodiment is stored as program data 1094, for example, in memory 1010 or hard disk drive 1090. Then, the CPU 1020 reads the program module 1093 or program data 1094 stored in memory 1010 or hard disk drive 1090 into RAM 1012 as necessary and executes it.
なお、プログラムモジュール1093やプログラムデータ1094は、ハードディスクドライブ1090に記憶される場合に限らず、例えば着脱可能な記憶媒体に記憶され、ディスクドライブ1100等を介してCPU1020によって読み出されてもよい。あるいは、プログラムモジュール1093及びプログラムデータ1094は、ネットワーク(LAN(Local Area Network)、WAN(Wide Area Network)等)を介して接続される他のコンピュータに記憶されてもよい。そして、プログラムモジュール1093及びプログラムデータ1094は、他のコンピュータから、ネットワークインタフェース1070を介してCPU1020によって読み出されてもよい。
The program module 1093 and program data 1094 are not limited to being stored in the hard disk drive 1090, but may be stored in, for example, a removable storage medium and read by the CPU 1020 via the disk drive 1100 or the like. Alternatively, the program module 1093 and program data 1094 may be stored in another computer connected via a network (such as a LAN (Local Area Network), WAN (Wide Area Network)). The program module 1093 and program data 1094 may then be read by the CPU 1020 from the other computer via the network interface 1070.
10 リスク算出装置
11 入出力部
12 記憶部
13 制御部
131 第1のモデル構築部
132 第1の選択部
133 第2のモデル構築部
134 距離算出部
135 第2の選択部
136 リスク算出部
137 出力処理部
138 ラベル操作部
139 ノイズ付加部 REFERENCE SIGNS LIST 10 Risk calculation device 11 Input/output unit 12Storage unit 13 Control unit 131 First model construction unit 132 First selection unit 133 Second model construction unit 134 Distance calculation unit 135 Second selection unit 136 Risk calculation unit 137 Output processing unit 138 Label operation unit 139 Noise addition unit
11 入出力部
12 記憶部
13 制御部
131 第1のモデル構築部
132 第1の選択部
133 第2のモデル構築部
134 距離算出部
135 第2の選択部
136 リスク算出部
137 出力処理部
138 ラベル操作部
139 ノイズ付加部 REFERENCE SIGNS LIST 10 Risk calculation device 11 Input/output unit 12
Claims (6)
- 所定のデータセットを用いて複数の第1のシャドウモデルを構築する第1のモデル構築部と、
サンプルデータの中から、複数の前記第1のシャドウモデルの損失の平均が所定値以上、かつ、分散が所定値以下となる前記サンプルデータを選択する第1の選択部と、
選択した前記サンプルデータおよび前記データセットを用いて第2のシャドウモデルを複数構築する第2のモデル構築部と、
前記サンプルデータに対する、複数の前記第1のシャドウモデルの損失の分布と複数の前記第2のシャドウモデルの損失の分布との距離を算出する距離算出部と、
前記距離が所定の閾値以上となる前記サンプルデータを、機械学習モデルのプライバシーリスクの算出に用いるサンプルデータとして選択する第2の選択部と、
選択した前記サンプルデータを出力する出力処理部と
を備えることを特徴とするリスク算出装置。 a first model constructing unit that constructs a plurality of first shadow models using a predetermined data set;
a first selection unit that selects, from among the sample data, the sample data in which an average of losses of the first shadow models is equal to or greater than a predetermined value and a variance is equal to or less than a predetermined value;
a second model construction unit that constructs a plurality of second shadow models using the selected sample data and the selected data set;
a distance calculation unit that calculates a distance between a distribution of losses of the first shadow models and a distribution of losses of the second shadow models for the sample data;
A second selection unit that selects the sample data for which the distance is equal to or greater than a predetermined threshold as sample data to be used in calculating a privacy risk of a machine learning model;
and an output processing unit that outputs the selected sample data. - 出力された前記サンプルデータを用いて、前記機械学習モデルのプライバシーリスクを算出するリスク算出部
をさらに備えることを特徴とする請求項1に記載のリスク算出装置。 The risk calculation device according to claim 1 , further comprising: a risk calculation unit that calculates a privacy risk of the machine learning model by using the output sample data. - 前記第1の選択部により選択された前記サンプルデータに、前記複数の第1のシャドウモデルの損失の平均が大きくなり、かつ、分散が小さくなるようなノイズを付加するノイズ付加部
をさらに備えることを特徴とする請求項1に記載のリスク算出装置。 2. The risk calculation device according to claim 1, further comprising: a noise addition unit that adds noise to the sample data selected by the first selection unit such that an average of losses of the plurality of first shadow models becomes large and a variance becomes small. - 前記サンプルデータに対するラベル操作により新たなサンプルデータを追加するラベル操作部をさらに備え、
前記第1の選択部は、
前記新たなサンプルデータが追加されたサンプルデータの中から、複数の前記第1のシャドウモデルの損失の平均が所定値以上、かつ、分散が所定値以下となる前記サンプルデータを選択する
ことを特徴とする請求項1に記載のリスク算出装置。 a label operation unit for adding new sample data by performing a label operation on the sample data,
The first selection unit is
The risk calculation device according to claim 1, further comprising: selecting, from among the sample data to which the new sample data has been added, the sample data for which an average of losses of the first shadow models is equal to or greater than a predetermined value and a variance is equal to or less than a predetermined value. - リスク算出装置により実行されるリスク算出方法であって、
所定のデータセットを用いて複数の第1のシャドウモデルを構築する工程と、
サンプルデータの中から、複数の前記第1のシャドウモデルの損失の平均が所定値以上、かつ、分散が所定値以下となる前記サンプルデータを選択する工程と、
選択した前記サンプルデータおよび前記データセットを用いて第2のシャドウモデルを複数構築する工程と、
前記サンプルデータに対する、複数の前記第1のシャドウモデルの損失の分布と複数の前記第2のシャドウモデルの損失の分布との距離を算出する工程と、
前記距離が所定の閾値以上となる前記サンプルデータを、機械学習モデルのプライバシーリスクの算出に用いるサンプルデータとして選択する工程と、
選択した前記サンプルデータを出力する工程と
を含むことを特徴とするリスク算出方法。 A risk calculation method executed by a risk calculation device,
constructing a plurality of first shadow models using a predetermined data set;
selecting, from among the sample data, the sample data for which an average of losses of the first shadow models is equal to or greater than a predetermined value and a variance is equal to or less than a predetermined value;
constructing a plurality of second shadow models using the selected sample data and the data set;
Calculating a distance between a distribution of losses of the first shadow models and a distribution of losses of the second shadow models for the sample data;
selecting the sample data for which the distance is equal to or greater than a predetermined threshold as sample data to be used in calculating a privacy risk of a machine learning model;
and outputting the selected sample data. - 所定のデータセットを用いて複数の第1のシャドウモデルを構築する工程と、
サンプルデータの中から、複数の前記第1のシャドウモデルの損失の平均が所定値以上、かつ、分散が所定値以下となる前記サンプルデータを選択する工程と、
選択した前記サンプルデータおよび前記データセットを用いて第2のシャドウモデルを複数構築する工程と、
前記サンプルデータに対する、複数の前記第1のシャドウモデルの損失の分布と複数の前記第2のシャドウモデルの損失の分布との距離を算出する工程と、
前記距離が所定の閾値以上となる前記サンプルデータを、機械学習モデルのプライバシーリスクの算出に用いるサンプルデータとして選択する工程と、
選択した前記サンプルデータを出力する工程と
をコンピュータに実行させるためのリスク算出プログラム。 constructing a plurality of first shadow models using a predetermined data set;
selecting, from among the sample data, the sample data for which an average of losses of the first shadow models is equal to or greater than a predetermined value and a variance is equal to or less than a predetermined value;
constructing a plurality of second shadow models using the selected sample data and the data set;
Calculating a distance between a distribution of losses of the first shadow models and a distribution of losses of the second shadow models for the sample data;
selecting the sample data for which the distance is equal to or greater than a predetermined threshold as sample data to be used in calculating a privacy risk of a machine learning model;
and outputting the selected sample data.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/JP2022/037926 WO2024079795A1 (en) | 2022-10-11 | 2022-10-11 | Risk calculation device, risk calculation method, and risk calculation program |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/JP2022/037926 WO2024079795A1 (en) | 2022-10-11 | 2022-10-11 | Risk calculation device, risk calculation method, and risk calculation program |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2024079795A1 true WO2024079795A1 (en) | 2024-04-18 |
Family
ID=90668972
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2022/037926 WO2024079795A1 (en) | 2022-10-11 | 2022-10-11 | Risk calculation device, risk calculation method, and risk calculation program |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2024079795A1 (en) |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113657762A (en) * | 2021-08-18 | 2021-11-16 | 成都卫士通信息安全技术有限公司 | Method, device, equipment and medium for evaluating confidentiality of training data |
-
2022
- 2022-10-11 WO PCT/JP2022/037926 patent/WO2024079795A1/en unknown
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113657762A (en) * | 2021-08-18 | 2021-11-16 | 成都卫士通信息安全技术有限公司 | Method, device, equipment and medium for evaluating confidentiality of training data |
Non-Patent Citations (1)
Title |
---|
CARLINI NICHOLAS, CHIEN STEVE, NASR MILAD, SONG SHUANG, TERZIS ANDREAS, TRAMER FLORIAN: "Membership Inference Attacks From First Principles", 2022 IEEE SYMPOSIUM ON SECURITY AND PRIVACY (SP), IEEE, 12 April 2022 (2022-04-12) - 26 May 2022 (2022-05-26), pages 1897 - 1914, XP093009044, ISBN: 978-1-6654-1316-9, DOI: 10.1109/SP46214.2022.9833649 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Barman et al. | A Boolean network inference from time-series gene expression data using a genetic algorithm | |
Datta et al. | External control in Markovian genetic regulatory networks | |
Wang et al. | Automatic inference of demographic parameters using generative adversarial networks | |
JP2021533474A (en) | Node classification method, model training method, and its equipment, equipment and computer program | |
CN106803039B (en) | A kind of homologous determination method and device of malicious file | |
Mu et al. | A hybrid genetic algorithm for software architecture re-modularization | |
Crowther et al. | A flexible parametric accelerated failure time model and the extension to time-dependent acceleration factors | |
Garwood et al. | RE voSim: Organism‐level simulation of macro and microevolution | |
Zenil‐Ferguson et al. | chromploid: An R package for chromosome number evolution across the plant tree of life | |
Cifuentes-Fontanals et al. | Control in Boolean networks with model checking | |
Arendsee et al. | Fagin: synteny-based phylostratigraphy and finer classification of young genes | |
Justison et al. | SiPhyNetwork: An R package for simulating phylogenetic networks | |
Fogg et al. | PhyloCoalSimulations: A simulator for network multispecies coalescent models, including a new extension for the inheritance of gene flow | |
WO2024079795A1 (en) | Risk calculation device, risk calculation method, and risk calculation program | |
Anderson et al. | Oxfold: kinetic folding of RNA using stochastic context-free grammars and evolutionary information | |
Mayo | Learning Petri net models of non-linear gene interactions | |
CN113518086B (en) | Network attack prediction method, device and storage medium | |
Attar et al. | Automatic generation of adaptive network models based on similarity to the desired complex network | |
Dutheil | Hidden Markov models in population genomics | |
US11177018B2 (en) | Stable genes in comparative transcriptomics | |
WO2023067666A1 (en) | Calculation device, calculation method, and calculation program | |
CN113191527A (en) | Prediction method and device for population prediction based on prediction model | |
Mizera et al. | Fast simulation of probabilistic Boolean networks | |
WO2024079802A1 (en) | Evaluation device, evaluation method, and evaluation program | |
WO2024079805A1 (en) | Calculation device, calculation method, and calculation program |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22962015 Country of ref document: EP Kind code of ref document: A1 |