WO2021220404A1

WO2021220404A1 - Anonymized database generation device, anonymized database generation method, and program

Info

Publication number: WO2021220404A1
Application number: PCT/JP2020/018127
Authority: WO
Inventors: 聡長谷川
Original assignee: 日本電信電話株式会社
Priority date: 2020-04-28
Filing date: 2020-04-28
Publication date: 2021-11-04
Also published as: JP7405248B2; JPWO2021220404A1

Abstract

The present invention provides a technique that enables generation of anonymous processed information without legal knowledge or knowledge of processing methods. The present invention includes: an attribute type classification unit that imparts, to an attribute forming a database as a type of the attribute, a type which is a direct identifier, a quasi-identifier, or the like; and an anonymized database generation unit that generates an anonymized database by anonymizing, for the attribute forming the database, the value of the attribute by using a method corresponding to the type of the attribute.

Description

Anonymized database generator, anonymized database generator, program

The present invention relates to a technique for anonymizing a database.

Anonymously processed information refers to information that has been processed so that a specific individual cannot be identified so that the personal information cannot be restored. The requirements for anonymously processed information are stipulated by the laws of each country (for example, the Personal Information Protection Law in Japan), and the processing methods (for example, deletion or replacement) described in Non-Patent Document 1 and Non-Patent Document 2 are used. Therefore, it is necessary to process personal information so as to meet the requirements.

As mentioned above, in order to properly create anonymous processing information, it is necessary to strictly interpret the laws that stipulate the requirements and manually select and create the appropriate processing method. It can be extremely difficult for those who do not have sufficient knowledge.

Therefore, an object of the present invention is to provide a technique capable of generating anonymously processed information without having knowledge of law or processing method.

One aspect of the present invention is an attribute type classification unit that directly assigns an identifier, a quasi-identifier, or any other type to an attribute constituting the database as the type of the attribute, and an attribute constituting the database. Therefore, it includes an anonymized database generation unit that anonymizes the value of the attribute by using a method according to the type of the attribute and generates an anonymized database.

One aspect of the present invention is an attribute type classification unit that directly assigns an identifier, a quasi-identifier, or any other type to an attribute constituting the database as the type of the attribute, and an attribute constituting the database. When the user determines that the type of the attribute is not appropriate, the attribute type correction unit that corrects the type of the attribute to the type of no processing and the attribute that constitutes the database are adjusted according to the type of the attribute. It includes an anonymization database generation unit that anonymizes the value of the attribute by using the above method and generates an anonymization database.

According to the present invention, it is possible to generate anonymously processed information without having knowledge of the law or processing method.

It is a block diagram which shows the structure of the anonymization database generation apparatus 100. It is a flowchart which shows the operation of the anonymization database generation apparatus 100. It is a block diagram which shows the structure of the anonymization database generation apparatus 200. It is a flowchart which shows the operation of the anonymization database generation apparatus 200. It is a figure which shows an example of a database. It is a figure which shows an example of the functional structure of the computer which realizes each apparatus in embodiment of this invention.

Hereinafter, embodiments of the present invention will be described in detail. The components having the same function are given the same number, and duplicate explanations will be omitted.

Prior to the description of each embodiment, the notation method in this specification will be described.

^ (Caret) represents a superscript. For example, x ^{y ^ z} means that y ^z is a superscript for x, and x _{y ^ z} means that y ^z is a subscript for x. In addition, _ (underscore) represents a subscript. For example, x ^y_z means that y _z is a superscript for x, and x _{y_z} means that y _z is a subscript for x.

Also, superscripts "^" and "~" such as ^ x and ~ x for a certain character x should be written directly above "x", but they should be written directly above "x". Due to restrictions, it is described as ^ x or ~ x.
<Technical background>
In each embodiment of the present invention, the target for generating anonymized processed information is a database, and an anonymized database in which the data of the database is anonymized is generated.

Hereinafter, the procedure for generating the anonymized database in each embodiment will be described.
(1) First, the attributes of the database are classified into direct identifiers, quasi-identifiers, and others. Direct identifiers, quasi-identifiers, and others are called types. A direct identifier is an attribute that can identify a specific individual by itself. A quasi-identifier is an attribute that can identify a specific individual in combination with other attributes. Others refer to attributes that do not correspond to either direct identifiers or quasi-identifiers.
(2) Next, the attribute data is anonymously processed by using an appropriate processing method according to the type.

Before executing the process (2), the user may be able to correct the type of the attribute in consideration of the possibility of misclassification. In this case, a new type of "no processing" is provided, and by designating the type, the processing target of (2) is not set.
<First Embodiment>
The anonymized database generation device 100 takes the database as an input, generates an anonymized database, and outputs the anonymized database.

Hereinafter, the anonymized database generator 100 will be described with reference to FIGS. 1 and 2. FIG. 1 is a block diagram showing the configuration of the anonymized database generation device 100. FIG. 2 is a flowchart showing the operation of the anonymized database generation device 100. As shown in FIG. 1, the anonymization database generation device 100 includes an attribute type classification unit 110, an anonymization database generation unit 120, and a recording unit 190. The recording unit 190 is a configuration unit that appropriately records information necessary for processing of the anonymized database generation device 100. In the recording unit 190, for example, a database to be anonymized is recorded.

The operation of the anonymized database generator 100 will be described with reference to FIG.

In S110, the attribute type classification unit 110 takes the database as an input, assigns a direct identifier, a quasi-identifier, or any other type as the type of the attribute to each of the attributes constituting the database, and outputs the classification result. do.

Hereinafter, examples of direct identifiers and quasi-identifiers, and examples of classification methods will be described.
(Examples of direct identifiers and quasi-identifiers)
Examples of direct identifiers include name, email address, my number, basic pension number, resident's card code, telephone number, passport number, credit card number, and so on.

Examples of quasi-identifiers include age, address, gender, and date of birth. In addition, an attribute that correlates with a certain quasi-identifier is also treated as a quasi-identifier.
(Example of classification method)
There are the following methods as a classification method.
(1) Method by pattern matching The method by pattern matching is whether or not it is a direct identifier / quasi-identifier by pattern matching the data of a predetermined set of data and the attribute data of the classification target (that is, the classification target). It is a method of determining whether or not.

The name, which is a direct identifier, can be discriminated by pattern matching using a list of names as a predetermined set of data. Further, the address and the gender, which are quasi-identifiers, can also be discriminated by pattern matching in which the list of addresses and the list of genders are a set of predetermined data.
(2) Regular expression method The regular expression method is a method of determining whether or not the attribute data to be classified is a direct identifier / quasi-identifier by determining whether or not it corresponds to a predetermined regular expression. be.

The e-mail address, telephone number, and passport number, which are direct identifiers, can be determined by whether or not they correspond to a predetermined regular expression.
(3) Method by check digit generation algorithm The method by the check digit generation algorithm is a direct identifier by determining whether or not the data of the attribute to be classified is the data generated by using a predetermined check digit generation algorithm. It is a method of determining whether or not there is.

The My Number and resident's card code, which are direct identifiers, can be determined by whether or not they are data generated using a check digit generation algorithm called Modulus11Weight234567. Further, the credit card number, which is a direct identifier, can be determined by whether or not the data is generated by using a check digit generation algorithm called Luhn Algorithm.
(4) Method by range check The method by range check is a method of determining whether or not it is a quasi-identifier by determining whether or not the data of the attribute to be classified is included in a predetermined data range.

The age, which is a quasi-identifier, can be determined by whether or not it is included in the data range with {0, 1, ..., 119, 120} as a predetermined data range.
(5) Correlation method The correlation method is a method of determining whether or not an attribute to be classified is a quasi-identifier by determining whether or not there is a correlation with a certain quasi-identifier. In the correlation method, Pearson correlation is used when both the attribute to be classified and the quasi-identifier used for judgment are quantitative attributes. If one of the attribute to be classified and the quasi-identifier used for judgment is a quantitative attribute and the other is a qualitative attribute, the correlation ratio is used. If both the attribute to be classified and the quasi-identifier used for judgment are qualitative attributes, the number of associations of Cramer is used. Here, a qualitative attribute is an attribute that takes a value other than a numerical value as an attribute value such as gender, and a quantitative attribute is an attribute that takes a numerical value as an attribute value such as age. As the quasi-identifier used for the judgment, age, address, gender, and date of birth can be used. At that time, if the distribution of the quasi-identifier data is uniform, do not use it for the judgment of the presence or absence of correlation. It may be. By doing so, it is possible to reduce an error in determining whether or not the attribute to be classified is a quasi-identifier.

Therefore, the attribute type determination unit 110 assigns a type by using some of a pattern matching method, a regular expression method, a check digit generation algorithm method, a range check method, and a correlation method. Can be configured in. That is, the attribute type classification unit 110 assigns one or more methods selected from a pattern matching method, a regular expression method, a check digit generation algorithm method, a range check method, and a correlation method to the attributes constituting the database. The type is determined by sequentially applying to the attribute, and the classification result including the set of the attribute and the type given to the attribute is generated and output.

In S120, the anonymization database generation unit 120 inputs the database and the classification result output in S110, and for each of the attributes constituting the database, the value of the attribute is used by a method according to the type of the attribute. Anonymize, generate an anonymized database, and output.

An example of the anonymization method will be described below.
(When the attribute type is a direct identifier)
(1) Item deletion Item deletion is a method of anonymizing by deleting all the values of the attributes to be anonymized (that is, deleting the attribute items themselves).
(2) Temporary ID conversion Temporary ID conversion is a method of anonymizing by converting the value of an attribute to be anonymized into an ID using a hash function or the like.
(When the attribute type is a quasi-identifier)
For example, there is a method of satisfying k-anonymity described in Reference Non-Patent Document 1.
(Reference Non-Patent Document 1: Khaled El Emam, Fida Kamal Dankar, Romeo Issa, Elizabeth Jonker, Daniel Amyot, Elise Cogo, Jean-Pierre Corriveau, Mark Walker, Sadrul Chowdhury, Regis Vaillancourt, et al., “A globally optimal k) -anonymity method for the de-identification of health data, ”Journal of the American Medical Informatics Association, Vol.16, No.5, pp.670-682, 2009.)
(When the attribute type is other)
(1) Deletion Deletion is a method of anonymizing by deleting a part or all of the values of attributes to be anonymized.
(2) Generalization Generalization is a method of anonymizing by replacing the value of an attribute to be anonymized by using a higher-level concept.
(3) Rounding Rounding is a method of anonymizing an attribute by replacing it with a value obtained by rounding or rounding down the value of the attribute when the attribute to be anonymized is a quantitative attribute. Is.
(4) Swapping Swapping is a method of anonymizing by (probabilistically) exchanging the values of attributes to be anonymized between records.
(5) Addition of noise Noise addition is anonymization by adding a random value generated according to a certain (probability) distribution to the value of the attribute when the attribute to be anonymized is a quantitative attribute. How to do it.
(6) Microaggregation Microaggregation is a method of anonymizing by grouping the values of attributes to be anonymized and replacing the values of the group with representative values.
(7) Top coding When the attribute to be anonymized is a quantitative attribute, top coding is a method of anonymizing by collecting a particularly large numerical value with respect to the value of the attribute.
(8) Bottom coding When the attribute to be anonymized is a quantitative attribute, bottom coding is a method of anonymizing by collecting a numerical value particularly small with respect to the value of the attribute.
(9) Outlier processing Outlier processing is a method of anonymizing by deleting a peculiar value (outlier value) included in an attribute to be anonymized and performing processing such as top coding and bottom coding.
(10) Randomization Randomization is a method of anonymizing by (probabilistically) replacing the value of an attribute to be anonymized with another value.

Therefore, when the type of the attribute that constitutes the database is a direct identifier, the anonymization database generation unit 120 anonymizes the attribute that constitutes the database by using some of the methods of deleting items and creating a temporary ID. If the type is a quasi-identifier, anonymize it using a method that satisfies k-anonymity, and if the type of attributes that make up the database is other, delete, generalize, round, swap, add noise, microaggregation. , Top coding, bottom coding, outlier processing, and randomization can be configured to be anonymized using several methods.

According to the embodiment of the present invention, it is possible to generate anonymously processed information without having knowledge of laws and processing methods. In particular, even a user without specialized knowledge can automatically generate anonymous processing information using an appropriate processing method.
<Second Embodiment>
The anonymized database generator 200 takes the database as an input, generates an anonymized database, and outputs the anonymized database.

Hereinafter, the anonymized database generator 200 will be described with reference to FIGS. 3 to 4. FIG. 3 is a block diagram showing the configuration of the anonymized database generation device 200. FIG. 4 is a flowchart showing the operation of the anonymized database generator 200. As shown in FIG. 3, the anonymization database generation device 200 includes an attribute type classification unit 110, an attribute type correction unit 210, an anonymization database generation unit 120, and a recording unit 190. The recording unit 190 is a configuration unit that appropriately records information necessary for processing of the anonymized database generation device 200. In the recording unit 190, for example, a database to be anonymized is recorded.

The operation of the anonymized database generator 200 will be described with reference to FIG.

In S210, the attribute type correction unit 210 takes the classification result output in S110 as an input, and when the user determines that the attribute type is not appropriate for each of the attributes constituting the database, the attribute type correction unit 210 determines the attribute type. It is corrected to the type of no processing, and the classification result reflecting the correction is output. When the user determines that the attribute type is not appropriate, the user inputs a correction instruction to the attribute type correction unit 210 using, for example, an input unit (not shown).

In S120, the anonymization database generation unit 120 takes the database and the classification result output in S210 as inputs, and for each of the attributes constituting the database, the value of the attribute is used by a method according to the type of the attribute. Anonymize, generate an anonymized database, and output. Here, if the attribute type is unprocessed, the anonymization process is not executed.

According to the embodiment of the present invention, it is possible to generate anonymously processed information without having knowledge of laws and processing methods. In particular, even a user without specialized knowledge can automatically generate anonymous processing information using an appropriate processing method.
<Application example>
Here, the processing in each component will be described using the database shown in FIG. 5 as an example. The database has seven attributes identified as (a), (b), (c), (d), (e), (f), (g).
(1) Processing by the attribute type classification unit 110 and the attribute type correction unit 210 The attribute type classification unit 110 determines whether the above seven attributes directly correspond to an identifier, a quasi-identifier, or any other, and generates a classification result. do. The attribute type correction unit 210 generates a classification result corrected based on the correction instruction by the user.

Regarding the attribute (a), it is determined that the attribute is a name by pattern matching using a list of names, and an identifier is directly assigned as the type of the attribute (a).

Regarding attribute (b), since it follows the check digit generation algorithm called Modulus11Weight234567, it is determined that the attribute is my number, and an identifier is directly assigned as the type of attribute (b).

Regarding attribute (c), it is determined that the attribute is gender by pattern matching using a list of genders, and a quasi-identifier is assigned as the type of attribute (c).

Regarding the attribute (d), it is determined that the attribute is an address by pattern matching using a list of addresses, and a quasi-identifier is assigned as the type of the attribute (d).

Regarding attribute (e), a range check with {0, 1,…, 119, 120} as the data range determines that the attribute is age, and assigns a quasi-identifier as the type of attribute (e).

Regarding the attribute (f), it is judged that the correlation with the attribute (e) is high, and a quasi-identifier is assigned as the type of the attribute (f).

Regarding the attribute (g), the user determines that the type obtained by the processing in the attribute type classification unit 110 is inappropriate, and corrects the type of the attribute (g) without processing.
Classify.
(2) Processing in the anonymization database generation unit 120 The anonymization database generation unit 120 executes the anonymization processing by using the method according to the type obtained in the processing of (1).

Attribute (a) and attribute (b) are direct identifiers, so anonymization processing is executed using item deletion.

For the attribute (c), attribute (d), attribute (e), and attribute (f), anonymization processing is executed using k-anonymization for the four attributes.
Since the attribute (g) is unprocessed, the anonymization process is not executed.
<Supplement>
FIG. 6 is a diagram showing an example of a functional configuration of a computer that realizes each of the above-mentioned devices (that is, each node). The processing in each of the above-mentioned devices can be carried out by causing the recording unit 2020 to read a program for causing the computer to function as each of the above-mentioned devices, and operating the control unit 2010, the input unit 2030, the output unit 2040, and the like.

The device of the present invention is, for example, as a single hardware entity, an input unit to which a keyboard or the like can be connected, an output unit to which a liquid crystal display or the like can be connected, and a communication device (for example, a communication cable) capable of communicating outside the hardware entity. Communication unit, CPU (Central Processing Unit, cache memory, registers, etc.) to which can be connected, RAM and ROM as memory, external storage device as hard hardware, and input, output, and communication units of these , CPU, RAM, ROM, and external storage device have a connecting bus so that data can be exchanged. Further, if necessary, a device (drive) or the like capable of reading and writing a recording medium such as a CD-ROM may be provided in the hardware entity. A physical entity equipped with such hardware resources includes a general-purpose computer and the like.

The external storage device of the hardware entity stores the program required to realize the above-mentioned functions and the data required for processing this program (not limited to the external storage device, for example, reading a program). It may be stored in a ROM, which is a dedicated storage device). Further, the data obtained by the processing of these programs is appropriately stored in a RAM, an external storage device, or the like.

In the hardware entity, each program stored in the external storage device (or ROM, etc.) and the data necessary for processing each program are read into the memory as needed, and are appropriately interpreted, executed, and processed by the CPU. .. As a result, the CPU realizes a predetermined function (each constituent unit represented as the above-mentioned ... unit, ... means, etc.).

The present invention is not limited to the above-described embodiment, and can be appropriately modified without departing from the spirit of the present invention. Further, the processes described in the above-described embodiment are not only executed in chronological order according to the order described, but may also be executed in parallel or individually as required by the processing capacity of the device that executes the processes. ..

As described above, when the processing function in the hardware entity (device of the present invention) described in the above embodiment is realized by a computer, the processing content of the function that the hardware entity should have is described by a program. Then, by executing this program on the computer, the processing function in the above hardware entity is realized on the computer.

The program that describes this processing content can be recorded on a computer-readable recording medium. The computer-readable recording medium may be, for example, a magnetic recording device, an optical disk, a photomagnetic recording medium, a semiconductor memory, or the like. Specifically, for example, a hard disk device, a flexible disk, a magnetic tape, or the like as a magnetic recording device is used as an optical disk, and a DVD (Digital Versatile Disc), a DVD-RAM (Random Access Memory), or a CD-ROM (Compact Disc Read Only) is used as an optical disk. Memory), CD-R (Recordable) / RW (ReWritable), etc., MO (Magneto-Optical disc), etc. as a magneto-optical recording medium, EP-ROM (Electronically Erasable and Programmable-Read Only Memory), etc. as a semiconductor memory Can be used.

The distribution of this program is carried out, for example, by selling, transferring, renting, etc., a portable recording medium such as a DVD or CD-ROM on which the program is recorded. Further, the program may be stored in the storage device of the server computer, and the program may be distributed by transferring the program from the server computer to another computer via the network.

A computer that executes such a program first stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. Then, when the process is executed, the computer reads the program stored in its own storage device and executes the process according to the read program. Further, as another execution form of this program, a computer may read the program directly from a portable recording medium and execute processing according to the program, and further, the program is transferred from the server computer to this computer. It is also possible to execute the process according to the received program one by one each time. In addition, the above processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition without transferring the program from the server computer to this computer. May be. The program in this embodiment includes information to be used for processing by a computer and equivalent to the program (data that is not a direct command to the computer but has a property of defining the processing of the computer, etc.).

Further, in this form, the hardware entity is configured by executing a predetermined program on the computer, but at least a part of these processing contents may be realized in terms of hardware.

The above description of the embodiment of the present invention is presented for the purpose of illustration and description. There is no intention to be exhaustive and no intention to limit the invention to the exact form disclosed. Deformations and variations are possible from the above teachings. The embodiments are in various embodiments and in various ways to provide the best illustration of the principles of the invention and to be suitable for practical use by those skilled in the art. It is selected and expressed so that it can be used by adding transformations. All such variations and variations are within the scope of the invention as defined by the appended claims, interpreted according to the width given fairly, legally and impartially.

Claims

An attribute type classification unit that directly assigns an identifier, a quasi-identifier, or any other type to the attributes that make up the database as the type of the attribute.
An anonymized database generator including an anonymized database generator that anonymizes the value of the attribute that constitutes the database by using a method according to the type of the attribute and generates an anonymized database.
An attribute type classification unit that directly assigns an identifier, a quasi-identifier, or any other type to the attributes that make up the database as the type of the attribute.
When the user determines that the type of the attribute is not appropriate for the attributes that make up the database, the attribute type correction unit that corrects the type of the attribute to the type of unprocessed
An anonymized database generator including an anonymized database generator that anonymizes the value of the attribute that constitutes the database by using a method according to the type of the attribute and generates an anonymized database.
The anonymized database generator according to claim 1 or 2.
The attribute type classification unit is characterized in that a type is assigned by using some methods among a pattern matching method, a regular expression method, a check digit generation algorithm method, a range check method, and a correlation method. Anonymized database generator.
The anonymized database generator according to claim 1 or 2.
The anonymized database generation unit
When the type of the attribute that constitutes the database is a direct identifier, anonymize it using some of the methods of deleting items and creating a temporary ID.
An anonymization database generator characterized in that when the type of the attribute constituting the database is a quasi-identifier, it is anonymized by using a method satisfying k-anonymity.
The anonymized database generator according to claim 4.
The anonymization database generation unit deletes, generalizes, rounds, swaps, adds noise, microaggregates, top coding, bottom coding, outlier processing, and randomizes when the types of attributes constituting the database are other. Of these, an anonymization database generator characterized by anonymization using several methods.
An attribute type classification step in which the anonymized database generator directly assigns an identifier, a quasi-identifier, or any other type to the attributes constituting the database as the type of the attribute.
Anonymized database generation device generates an anonymized database by anonymizing the value of the attribute with respect to the attribute constituting the database by using a method according to the type of the attribute. How to generate an anonymized database, including.
An attribute type classification step in which the anonymized database generator directly assigns an identifier, a quasi-identifier, or any other type to the attributes constituting the database as the type of the attribute.
When the anonymized database generator determines that the type of the attribute is not appropriate for the attributes constituting the database, the attribute type correction step of correcting the type of the attribute to the type of no processing, and the attribute type correction step.
Anonymized database generation device generates an anonymized database by anonymizing the value of the attribute with respect to the attribute constituting the database by using a method according to the type of the attribute. How to generate an anonymized database, including.
A program for operating a computer as an anonymized database generator according to any one of claims 1 to 5.