WO2024079795A1

WO2024079795A1 - Risk calculation device, risk calculation method, and risk calculation program

Info

Publication number: WO2024079795A1
Application number: PCT/JP2022/037926
Authority: WO
Inventors: 俊樹芝原; 尭之三浦; 真昇紀伊; 敦謙市川
Original assignee: 日本電信電話株式会社
Priority date: 2022-10-11
Filing date: 2022-10-11
Publication date: 2024-04-18

Abstract

This risk calculation device constructs a plurality of first shadow models using a predetermined dataset. Then, the risk calculation device selects, from a set of sample data, pieces of sample data for which the average loss of the plurality of first shadow models is sufficiently large and the variance thereof is small. Subsequently, the risk calculation device constructs a plurality of second shadow models using the selected pieces of sample data and the dataset. Next, the risk calculation device calculates, for the pieces of sample data, a distance between the distribution of losses of the plurality of first shadow models and the distribution of losses of the plurality of second shadow models. Then, the risk calculation device selects sample data for which the distance is equal to or larger than a predetermined threshold value as sample data (target sample) to be used for calculating a privacy risk of a machine learning model.

Description

RISK CALCULATION DEVICE, RISK CALCULATION METHOD, AND RISK CALCULATION PROGRAM

The present invention relates to a risk calculation device, a risk calculation method, and a risk calculation program for calculating the privacy risk of a machine learning model.

Machine learning technologies such as Deep Neural Networks (DNNs) have been pointed out as posing privacy risks due to their tendency to memorize training data. Specifically, it has been shown that it is possible to infer from the output of a trained model whether or not specific data was included in the training data. Therefore, consideration must be given to privacy risks when handling data that users do not want others to know, such as medical data or web browsing history.

　Conventionally, a method has been proposed in which an attack is performed on a trained model to determine whether or not certain data is included in the training data, and the privacy risk is calculated based on the degree to which the attack is successful (see Non-Patent Documents 1, 2, and 3).

However, conventional privacy risk calculation methods have only proposed methods based on weak attacks or unrealistic attacks, which means that they tend to underestimate privacy risks in realistic settings. Calculating privacy risks based on weak attacks involves, for example, calculating the average privacy risk using randomly selected target samples. Unrealistic attacks include, for example, attacks that assume that an attacker has access to a model that is in the middle of training, or attacks that assume that they can manipulate training data.

The present invention aims to solve the above problems and accurately calculate the privacy risk in realistic settings of machine learning models.

In order to solve the above-mentioned problems, the present invention is characterized by comprising a first model construction unit that constructs multiple first shadow models using a predetermined dataset, a first selection unit that selects, from sample data, the sample data for which the average loss of the multiple first shadow models is equal to or greater than a predetermined value and the variance is equal to or less than a predetermined value, a second model construction unit that constructs multiple second shadow models using the selected sample data and the dataset, a distance calculation unit that calculates the distance between the distribution of losses of the multiple first shadow models and the distribution of losses of the multiple second shadow models for the sample data, a second selection unit that selects the sample data for which the distance is equal to or greater than a predetermined threshold as sample data to be used in calculating the privacy risk of a machine learning model, and an output processing unit that outputs the selected sample data.

The present invention makes it possible to highly (accurately) calculate privacy risks in realistic settings for machine learning models.

FIG. 1 is a diagram for explaining an overview of a risk calculation device. FIG. 2 is a diagram illustrating an example of the configuration of the risk calculation device. FIG. 3 is a flowchart illustrating an example of a processing procedure executed by the risk calculation device. FIG. 4 is a flowchart showing an example of calculating a privacy risk using the risk calculation device. FIG. 5 is a diagram illustrating a computer that executes a risk calculation program.

Below, a form (embodiment) for carrying out the present invention will be described with reference to the drawings. The present invention is not limited to this embodiment.

[overview]
First, an overview of the risk calculation device of this embodiment will be described with reference to Fig. 1. The risk calculation device selects sample data (target samples) used for calculating a privacy risk of a trained model (machine learning model) in the following manner.

First, the risk calculation device constructs multiple shadow models (first shadow models) using a previously prepared data set. Next, the risk calculation device inputs sample data (data separate from the data set) to each of the constructed multiple first shadow models. Then, the risk calculation device calculates the loss of each of the constructed multiple first shadow models, and selects sample data in which the average of the calculated losses is sufficiently large and the variance is small (selection 1).

Then, the risk calculation device constructs multiple shadow models (second shadow models) using the data obtained by adding the sample data selected in selection 1 to the above data set. The risk calculation device then calculates the loss of each of the multiple second shadow models for the sample data.

Then, the risk calculation device selects sample data in which the difference between the distribution of losses of each of the multiple first shadow models and the distribution of losses of each of the multiple second shadow models becomes large as a target sample (selection 2).

In other words, sample data in which the average loss of multiple shadow models is sufficiently large and the variance is small can be considered to be sample data with a high privacy risk. Furthermore, if adding sample data to the training data of a shadow model significantly changes the distribution of losses in that model, the sample data can be considered to be data that is easy to predict. In other words, the sample data can be considered to be data with a high privacy risk.

The risk calculation device therefore first selects sample data with a sufficiently large average loss for multiple shadow models and small variance as target sample candidates (Selection 1). Next, the risk calculation device selects, from among the target sample candidates, a sample that, when added to the teacher data for the shadow model, will significantly change the distribution of losses for the shadow model as the target sample (Selection 2).

In this way, the risk calculation device can select sample data with a high privacy risk as a target sample. Then, the risk calculation device can calculate the privacy risk of the machine learning model using the target sample, thereby calculating the privacy risk to be high (correct).

[Configuration example]
Next, a configuration example of the risk calculation device 10 will be described with reference to Fig. 2. The risk calculation device 10 includes, for example, an input/output unit 11, a storage unit 12, and a control unit 13.

The input/output unit 11 is an interface that handles the input and output of various data. The input/output unit 11 accepts, for example, input of a dataset and sample data. The dataset is a dataset used to construct the first shadow model and the second shadow model. The sample data is candidate data (target sample) used to calculate the privacy risk of the machine learning model.

The memory unit 12 stores data, programs, etc. that are referenced when the control unit 13 executes various processes. The memory unit 12 is realized by a semiconductor memory element such as a RAM (Random Access Memory) or a flash memory, or a storage device such as a hard disk or an optical disk.

For example, the storage unit 12 stores a data set, sample data, etc. input via the input/output unit 11. Also, for example, the storage unit 12 stores parameters of a first shadow model and parameters of a second shadow model constructed by the control unit 13, etc.

The control unit 13 is responsible for controlling the entire risk calculation device 10. The functions of the control unit 13 are realized, for example, by a CPU (Central Processing Unit) executing a program stored in the memory unit 12.

The control unit 13 includes, for example, a first model construction unit 131, a first selection unit 132, a second model construction unit 133, a distance calculation unit 134, a second selection unit 135, a risk calculation unit 136, and an output processing unit 137. Note that a label manipulation unit 138 and a noise addition unit 139 shown by dashed lines may or may not be provided, and the cases in which they are provided will be described later.

The first model construction unit 131 constructs multiple first shadow models using the above dataset as training data. Note that the settings used by the first model construction unit 131 when constructing the first shadow models are the same as the settings of the machine learning model for which privacy risks are to be calculated.

The first selection unit 132 inputs sample data to the multiple first shadow models constructed by the first model construction unit 131, and selects sample data for which the average loss of the multiple first shadow models is sufficiently large and the variance is small.

For example, the first selection unit 132 first calculates the loss of each of the first shadow models for the sample data. Next, the first selection unit 132 calculates the average and variance of the calculated losses. Then, the first selection unit 132 selects sample data with a small s _outi shown in the following formula (1).

In addition, in formula (1), μ _outi is the average of the losses of the multiple first shadow models for the sample data (x _i , y _i ), and σ ² _outi is the variance of the losses of the multiple first shadow models for the sample data (x _i , y _i ). k is a sufficiently large loss constant, for example, k=20. The value of k may be determined based on the analysis results of samples that are vulnerable to attacks, for example.

Alternatively, the first selection unit 132 may select sample data in which the average loss of the multiple first shadow models is large and the variance is small, without using the constant k. In this case, the first selection unit 132 selects sample data in which s _outi shown in the following formula (2) is small.

For example, the first selection unit 132 selects a predetermined number (for example, about 100-1000) of sample data in ascending order of s _outi calculated by the above formula (1) or (2). The first selection unit 132 may also select sample data whose s _outi is equal to or smaller than a predetermined value.

The first selection unit 132 may also select sample data for which the average loss of multiple first shadow models for the sample data is equal to or greater than a predetermined value and the variance of the loss is equal to or less than a predetermined value.

The second model construction unit 133 constructs multiple second shadow models using the sample data selected by the first selection unit 132 and the dataset (the dataset used to construct the first shadow model) as training data. Note that the settings used by the second model construction unit 133 when constructing the second shadow model are the same as the settings of the machine learning model for which privacy risk is calculated.

The distance calculation unit 134 calculates the degree to which the distribution of losses of the multiple first shadow models differs from the distribution of losses of the multiple second shadow models for the sample data.

For example, the distance calculation unit 134 calculates the distance d i between the distribution of losses of multiple first shadow models and the distribution of losses of multiple second shadow models for the sample data selected by the first selection unit 132 based on the following equation ( ₃ ):

In addition, in equation (3), μ _ini is the average of the losses of multiple second shadow models for the sample data (x _i , y _i ), and σ ² _ini is the variance of the losses of multiple second shadow models for the sample data (x _i , y _i ).

The second selection unit 135 selects sample data having a large distance d _i between the distribution of losses of the plurality of first shadow models and the distribution of losses of the plurality of second shadow models calculated by the distance calculation unit 134 .

For example, the second selection unit 135 selects a predetermined number (for example, about 1-100) of sample data in descending order of the distance d _i . Alternatively, the second selection unit 135 may select sample data whose distance d _i is equal to or greater than a predetermined value.

The risk calculation unit 136 calculates the privacy risk of the machine learning model using the sample data (target sample) selected by the second selection unit 135.

For example, data (D ₁ ) including a target sample and data (D ₀ ) not including the target sample are prepared. Then, the risk calculation unit 136 repeats a game of guessing whether the data (D ₁ ) including the target sample was used for learning the machine learning model 1000 times based on the following formula (4). Then, the risk calculation unit 136 calculates the privacy risk of the machine learning model from the accuracy rate of the game.

The output processing unit 137 outputs the processing result by the control unit 13. For example, the output processing unit 137 outputs the calculation result of the privacy risk of the machine learning model by the risk calculation unit 136.

With this risk calculation device 10, the privacy risk of the machine learning model can be calculated highly (accurately).

Note that the calculation of the privacy risk using the target sample selected by the second selection unit 135 may be performed by a device (external device) other than the risk calculation device 10. In this case, the risk calculation device 10 outputs the target sample selected by the second selection unit 135 to the external device. Then, the external device calculates the privacy risk using the target sample.

[Example of processing procedure]
Next, an example of a processing procedure executed by the risk calculation device 10 will be described with reference to Fig. 3. First, the first model construction unit 131 of the risk calculation device 10 constructs a plurality of first shadow models using a data set (S1).

After S1, the first selection unit 132 selects sample data based on the average and variance of the losses of the multiple first shadow models constructed in S1 (S2). For example, the first selection unit 132 selects sample data from the sample data in which the average of the losses of the multiple first shadow models is sufficiently large and the variance is small.

After S2, the second model construction unit 133 constructs multiple second shadow models using the sample data selected in S2 and the dataset (the dataset used to construct the first shadow model) (S3).

After S3, the distance calculation unit 134 calculates the distance between the distribution of losses of the multiple first shadow models constructed in S1 and the distribution of losses of the multiple second shadow models constructed in S3 (S4).

After S4, the second selection unit 135 selects sample data (target samples) whose distance calculated in S4 is equal to or greater than a predetermined threshold (S5). The risk calculation unit 136 then uses the sample data (target samples) selected in S5 to calculate the privacy risk of the machine learning model (S6). The output processing unit 137 then outputs the calculation result of the privacy risk obtained in S6 (S7).

By performing the above process, the risk calculation device 10 can highly (accurately) calculate the privacy risk in a realistic setting of the machine learning model.

[Other embodiments]
The risk calculation device 10 may further include a label manipulation unit 138 (see FIG. 2). The label manipulation unit 138 creates new sample data by performing label manipulation on the sample data.

For example, the label operation unit 138 creates new sample data (x, y p ) from sample data (x, _y ) (y is the label of x). However, y _p ≠ y. Here, when creating new sample data (x, y _p ), the label operation unit 138 may create sample data of all classes other than the correct answer (y). For example, if the shadow model is a model that classifies input data into 10 classes, the label operation unit 138 may create sample data to which nine labels other than the correct answer (y) are assigned as y _p .

Then, the first selection unit 132 selects sample data in which the average loss of the multiple first shadow models is sufficiently large and the variance is small from the sample data to which new sample data has been added by the label operation unit 138. After that, the second selection unit 135 selects a target sample from the sample data selected by the first selection unit 132.

The risk calculation device 10 may further include a noise addition unit 139 (see FIG. 2). The noise addition unit 139 creates sample data that is vulnerable to attacks by adding noise to the sample data.

For example, the noise addition unit 139 adds noise to the sample data such that the average loss of the multiple first shadow models is sufficiently large and the variance is small. For example, the noise addition unit 139 optimizes the noise so that the losses shown in the following formulas (5) and (6) are small. Then, the noise addition unit 139 adds the optimized noise to the sample data.

Furthermore, the noise adding unit 139 may add a condition that the noises to be added are not similar to each other. For example, the noise adding unit 139 adds noise such that the loss _Lsim shown in the following formula (7) is small. Note that x _i ^t in formula (7) is sample data to which noise has been added at step t of the optimization of noise addition.

In this way, the noise addition unit 139 creates sample data that is vulnerable to attacks, and the risk calculation device 10 can select target samples with a higher privacy risk. This allows the risk calculation device 10 to more accurately calculate the privacy risk of a machine learning model in a realistic setting.

[Application example]
Next, an application example of the risk calculation device 10 will be described with reference to Fig. 4. For example, the administrator of the risk calculation device 10 designs a neural network (NN) for calculating privacy risks (S11). For example, training of a NN that satisfies Differential Privacy is performed by an optimization method DP-SGD (Differentially Private Stochastic Gradient Descent).

Next, the risk calculation device 10 selects a target sample from the sample data (S12). Then, the risk calculation device 10 calculates a privacy risk using the target sample selected in S12 (S13). For example, the risk calculation device 10 repeats the game for calculating the privacy risk of the conventional technology 1,000 times using the target sample selected in S12, and calculates the privacy risk of the machine learning model from the accuracy rate.

[System configuration, etc.]
In addition, each component of each part shown in the figure is a functional concept, and does not necessarily have to be physically configured as shown in the figure. In other words, the specific form of distribution and integration of each device is not limited to that shown in the figure, and all or a part of it can be functionally or physically distributed and integrated in any unit depending on various loads, usage conditions, etc. Furthermore, each processing function performed by each device can be realized in whole or in any part by a CPU and a program executed by the CPU, or can be realized as hardware using wired logic.

Furthermore, among the processes described in the above embodiments, all or part of the processes described as being performed automatically can be performed manually, or all or part of the processes described as being performed manually can be performed automatically using known methods. In addition, the information including the processing procedures, control procedures, specific names, various data and parameters shown in the above documents and drawings can be changed as desired unless otherwise specified.

[program]
The risk calculation device 10 can be implemented by installing a program (risk calculation program) as package software or online software on a desired computer. For example, by executing the above program on an information processing device, the information processing device can function as the risk calculation device 10. The information processing device referred to here includes mobile communication terminals such as smartphones, mobile phones, and PHS (Personal Handyphone Systems), as well as terminals such as PDAs (Personal Digital Assistants).

FIG. 5 is a diagram showing an example of a computer that executes a risk calculation program. The computer 1000 has, for example, a memory 1010 and a CPU 1020. The computer 1000 also has a hard disk drive interface 1030, a disk drive interface 1040, a serial port interface 1050, a video adapter 1060, and a network interface 1070. Each of these components is connected by a bus 1080.

The memory 1010 includes a ROM (Read Only Memory) 1011 and a RAM (Random Access Memory) 1012. The ROM 1011 stores a boot program such as a BIOS (Basic Input Output System). The hard disk drive interface 1030 is connected to a hard disk drive 1090. The disk drive interface 1040 is connected to a disk drive 1100. A removable storage medium such as a magnetic disk or optical disk is inserted into the disk drive 1100. The serial port interface 1050 is connected to a mouse 1110 and a keyboard 1120, for example. The video adapter 1060 is connected to a display 1130, for example.

The hard disk drive 1090 stores, for example, an OS 1091, an application program 1092, a program module 1093, and program data 1094. That is, the programs that define each process executed by the risk calculation device 10 are implemented as program modules 1093 in which computer-executable code is written. The program modules 1093 are stored, for example, in the hard disk drive 1090. For example, a program module 1093 for executing processes similar to the functional configuration of the risk calculation device 10 is stored in the hard disk drive 1090. The hard disk drive 1090 may be replaced by an SSD (Solid State Drive).

The data used in the processing of the above-described embodiment is stored as program data 1094, for example, in memory 1010 or hard disk drive 1090. Then, the CPU 1020 reads the program module 1093 or program data 1094 stored in memory 1010 or hard disk drive 1090 into RAM 1012 as necessary and executes it.

The program module 1093 and program data 1094 are not limited to being stored in the hard disk drive 1090, but may be stored in, for example, a removable storage medium and read by the CPU 1020 via the disk drive 1100 or the like. Alternatively, the program module 1093 and program data 1094 may be stored in another computer connected via a network (such as a LAN (Local Area Network), WAN (Wide Area Network)). The program module 1093 and program data 1094 may then be read by the CPU 1020 from the other computer via the network interface 1070.

REFERENCE SIGNS LIST 10 Risk calculation device 11 Input/output unit 12 Storage unit 13 Control unit 131 First model construction unit 132 First selection unit 133 Second model construction unit 134 Distance calculation unit 135 Second selection unit 136 Risk calculation unit 137 Output processing unit 138 Label operation unit 139 Noise addition unit

Claims

a first model constructing unit that constructs a plurality of first shadow models using a predetermined data set;
a first selection unit that selects, from among the sample data, the sample data in which an average of losses of the first shadow models is equal to or greater than a predetermined value and a variance is equal to or less than a predetermined value;
a second model construction unit that constructs a plurality of second shadow models using the selected sample data and the selected data set;
a distance calculation unit that calculates a distance between a distribution of losses of the first shadow models and a distribution of losses of the second shadow models for the sample data;
A second selection unit that selects the sample data for which the distance is equal to or greater than a predetermined threshold as sample data to be used in calculating a privacy risk of a machine learning model;
and an output processing unit that outputs the selected sample data.
The risk calculation device according to claim 1 , further comprising: a risk calculation unit that calculates a privacy risk of the machine learning model by using the output sample data.
2. The risk calculation device according to claim 1, further comprising: a noise addition unit that adds noise to the sample data selected by the first selection unit such that an average of losses of the plurality of first shadow models becomes large and a variance becomes small.
a label operation unit for adding new sample data by performing a label operation on the sample data,
The first selection unit is
The risk calculation device according to claim 1, further comprising: selecting, from among the sample data to which the new sample data has been added, the sample data for which an average of losses of the first shadow models is equal to or greater than a predetermined value and a variance is equal to or less than a predetermined value.
A risk calculation method executed by a risk calculation device,
constructing a plurality of first shadow models using a predetermined data set;
selecting, from among the sample data, the sample data for which an average of losses of the first shadow models is equal to or greater than a predetermined value and a variance is equal to or less than a predetermined value;
constructing a plurality of second shadow models using the selected sample data and the data set;
Calculating a distance between a distribution of losses of the first shadow models and a distribution of losses of the second shadow models for the sample data;
selecting the sample data for which the distance is equal to or greater than a predetermined threshold as sample data to be used in calculating a privacy risk of a machine learning model;
and outputting the selected sample data.
constructing a plurality of first shadow models using a predetermined data set;
selecting, from among the sample data, the sample data for which an average of losses of the first shadow models is equal to or greater than a predetermined value and a variance is equal to or less than a predetermined value;
constructing a plurality of second shadow models using the selected sample data and the data set;
Calculating a distance between a distribution of losses of the first shadow models and a distribution of losses of the second shadow models for the sample data;
selecting the sample data for which the distance is equal to or greater than a predetermined threshold as sample data to be used in calculating a privacy risk of a machine learning model;
and outputting the selected sample data.