WO2021220404A1 - Dispositif de génération de base de données anonymisée, procédé de génération de base de données anonymisée, et programme - Google Patents

Dispositif de génération de base de données anonymisée, procédé de génération de base de données anonymisée, et programme Download PDF

Info

Publication number
WO2021220404A1
WO2021220404A1 PCT/JP2020/018127 JP2020018127W WO2021220404A1 WO 2021220404 A1 WO2021220404 A1 WO 2021220404A1 JP 2020018127 W JP2020018127 W JP 2020018127W WO 2021220404 A1 WO2021220404 A1 WO 2021220404A1
Authority
WO
WIPO (PCT)
Prior art keywords
attribute
database
type
anonymized
identifier
Prior art date
Application number
PCT/JP2020/018127
Other languages
English (en)
Japanese (ja)
Inventor
聡 長谷川
Original Assignee
日本電信電話株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電信電話株式会社 filed Critical 日本電信電話株式会社
Priority to PCT/JP2020/018127 priority Critical patent/WO2021220404A1/fr
Priority to JP2022518490A priority patent/JP7405248B2/ja
Publication of WO2021220404A1 publication Critical patent/WO2021220404A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules

Definitions

  • the present invention relates to a technique for anonymizing a database.
  • Anonymously processed information refers to information that has been processed so that a specific individual cannot be identified so that the personal information cannot be restored.
  • the requirements for anonymously processed information are stipulated by the laws of each country (for example, the Personal Information Protection Law in Japan), and the processing methods (for example, deletion or replacement) described in Non-Patent Document 1 and Non-Patent Document 2 are used. Therefore, it is necessary to process personal information so as to meet the requirements.
  • an object of the present invention is to provide a technique capable of generating anonymously processed information without having knowledge of law or processing method.
  • One aspect of the present invention is an attribute type classification unit that directly assigns an identifier, a quasi-identifier, or any other type to an attribute constituting the database as the type of the attribute, and an attribute constituting the database. Therefore, it includes an anonymized database generation unit that anonymizes the value of the attribute by using a method according to the type of the attribute and generates an anonymized database.
  • One aspect of the present invention is an attribute type classification unit that directly assigns an identifier, a quasi-identifier, or any other type to an attribute constituting the database as the type of the attribute, and an attribute constituting the database.
  • the attribute type correction unit that corrects the type of the attribute to the type of no processing and the attribute that constitutes the database are adjusted according to the type of the attribute.
  • It includes an anonymization database generation unit that anonymizes the value of the attribute by using the above method and generates an anonymization database.
  • (Caret) represents a superscript.
  • x y ⁇ z means that y z is a superscript for x
  • x y ⁇ z means that y z is a subscript for x
  • _ (underscore) represents a subscript.
  • x y_z means that y z is a superscript for x
  • x y_z means that y z is a subscript for x.
  • the target for generating anonymized processed information is a database, and an anonymized database in which the data of the database is anonymized is generated.
  • the procedure for generating the anonymized database in each embodiment will be described.
  • the attributes of the database are classified into direct identifiers, quasi-identifiers, and others.
  • Direct identifiers, quasi-identifiers, and others are called types.
  • a direct identifier is an attribute that can identify a specific individual by itself.
  • a quasi-identifier is an attribute that can identify a specific individual in combination with other attributes. Others refer to attributes that do not correspond to either direct identifiers or quasi-identifiers.
  • the attribute data is anonymously processed by using an appropriate processing method according to the type.
  • the anonymized database generation device 100 takes the database as an input, generates an anonymized database, and outputs the anonymized database.
  • FIG. 1 is a block diagram showing the configuration of the anonymized database generation device 100.
  • FIG. 2 is a flowchart showing the operation of the anonymized database generation device 100.
  • the anonymization database generation device 100 includes an attribute type classification unit 110, an anonymization database generation unit 120, and a recording unit 190.
  • the recording unit 190 is a configuration unit that appropriately records information necessary for processing of the anonymized database generation device 100. In the recording unit 190, for example, a database to be anonymized is recorded.
  • the attribute type classification unit 110 takes the database as an input, assigns a direct identifier, a quasi-identifier, or any other type as the type of the attribute to each of the attributes constituting the database, and outputs the classification result. do.
  • Examples of direct identifiers and quasi-identifiers include name, email address, my number, basic pension number, resident's card code, telephone number, passport number, credit card number, and so on.
  • Examples of quasi-identifiers include age, address, gender, and date of birth.
  • an attribute that correlates with a certain quasi-identifier is also treated as a quasi-identifier.
  • the name which is a direct identifier
  • the address and the gender which are quasi-identifiers, can also be discriminated by pattern matching in which the list of addresses and the list of genders are a set of predetermined data.
  • the regular expression method is a method of determining whether or not the attribute data to be classified is a direct identifier / quasi-identifier by determining whether or not it corresponds to a predetermined regular expression. be.
  • the e-mail address, telephone number, and passport number which are direct identifiers, can be determined by whether or not they correspond to a predetermined regular expression.
  • the method by the check digit generation algorithm is a direct identifier by determining whether or not the data of the attribute to be classified is the data generated by using a predetermined check digit generation algorithm. It is a method of determining whether or not there is.
  • the My Number and resident's card code which are direct identifiers, can be determined by whether or not they are data generated using a check digit generation algorithm called Modulus11Weight234567. Further, the credit card number, which is a direct identifier, can be determined by whether or not the data is generated by using a check digit generation algorithm called Luhn Algorithm.
  • the method by range check is a method of determining whether or not it is a quasi-identifier by determining whether or not the data of the attribute to be classified is included in a predetermined data range.
  • the age which is a quasi-identifier, can be determined by whether or not it is included in the data range with ⁇ 0, 1, ..., 119, 120 ⁇ as a predetermined data range.
  • the correlation method is a method of determining whether or not an attribute to be classified is a quasi-identifier by determining whether or not there is a correlation with a certain quasi-identifier.
  • Pearson correlation is used when both the attribute to be classified and the quasi-identifier used for judgment are quantitative attributes. If one of the attribute to be classified and the quasi-identifier used for judgment is a quantitative attribute and the other is a qualitative attribute, the correlation ratio is used.
  • both the attribute to be classified and the quasi-identifier used for judgment are qualitative attributes
  • the number of associations of Cramer is used.
  • a qualitative attribute is an attribute that takes a value other than a numerical value as an attribute value such as gender
  • a quantitative attribute is an attribute that takes a numerical value as an attribute value such as age.
  • age, address, gender, and date of birth can be used. At that time, if the distribution of the quasi-identifier data is uniform, do not use it for the judgment of the presence or absence of correlation. It may be. By doing so, it is possible to reduce an error in determining whether or not the attribute to be classified is a quasi-identifier.
  • the attribute type determination unit 110 assigns a type by using some of a pattern matching method, a regular expression method, a check digit generation algorithm method, a range check method, and a correlation method. Can be configured in. That is, the attribute type classification unit 110 assigns one or more methods selected from a pattern matching method, a regular expression method, a check digit generation algorithm method, a range check method, and a correlation method to the attributes constituting the database. The type is determined by sequentially applying to the attribute, and the classification result including the set of the attribute and the type given to the attribute is generated and output.
  • the anonymization database generation unit 120 inputs the database and the classification result output in S110, and for each of the attributes constituting the database, the value of the attribute is used by a method according to the type of the attribute. Anonymize, generate an anonymized database, and output.
  • Item deletion is a method of anonymizing by deleting all the values of the attributes to be anonymized (that is, deleting the attribute items themselves).
  • Temporary ID conversion is a method of anonymizing by converting the value of an attribute to be anonymized into an ID using a hash function or the like.
  • Deletion is a method of anonymizing by deleting a part or all of the values of attributes to be anonymized.
  • (2) Generalization Generalization is a method of anonymizing by replacing the value of an attribute to be anonymized by using a higher-level concept.
  • (3) Rounding Rounding is a method of anonymizing an attribute by replacing it with a value obtained by rounding or rounding down the value of the attribute when the attribute to be anonymized is a quantitative attribute.
  • (4) Swapping Swapping is a method of anonymizing by (probabilistically) exchanging the values of attributes to be anonymized between records.
  • Addition of noise Noise addition is anonymization by adding a random value generated according to a certain (probability) distribution to the value of the attribute when the attribute to be anonymized is a quantitative attribute. How to do it.
  • Microaggregation is a method of anonymizing by grouping the values of attributes to be anonymized and replacing the values of the group with representative values.
  • Top coding When the attribute to be anonymized is a quantitative attribute, top coding is a method of anonymizing by collecting a particularly large numerical value with respect to the value of the attribute.
  • Bottom coding When the attribute to be anonymized is a quantitative attribute, bottom coding is a method of anonymizing by collecting a numerical value particularly small with respect to the value of the attribute.
  • Outlier processing is a method of anonymizing by deleting a peculiar value (outlier value) included in an attribute to be anonymized and performing processing such as top coding and bottom coding.
  • Randomization Randomization is a method of anonymizing by (probabilistically) replacing the value of an attribute to be anonymized with another value.
  • the anonymization database generation unit 120 anonymizes the attribute that constitutes the database by using some of the methods of deleting items and creating a temporary ID. If the type is a quasi-identifier, anonymize it using a method that satisfies k-anonymity, and if the type of attributes that make up the database is other, delete, generalize, round, swap, add noise, microaggregation. , Top coding, bottom coding, outlier processing, and randomization can be configured to be anonymized using several methods.
  • the anonymized database generator 200 takes the database as an input, generates an anonymized database, and outputs the anonymized database.
  • FIG. 3 is a block diagram showing the configuration of the anonymized database generation device 200.
  • FIG. 4 is a flowchart showing the operation of the anonymized database generator 200.
  • the anonymization database generation device 200 includes an attribute type classification unit 110, an attribute type correction unit 210, an anonymization database generation unit 120, and a recording unit 190.
  • the recording unit 190 is a configuration unit that appropriately records information necessary for processing of the anonymized database generation device 200. In the recording unit 190, for example, a database to be anonymized is recorded.
  • the attribute type classification unit 110 takes the database as an input, assigns a direct identifier, a quasi-identifier, or any other type as the type of the attribute to each of the attributes constituting the database, and outputs the classification result. do.
  • the attribute type correction unit 210 takes the classification result output in S110 as an input, and when the user determines that the attribute type is not appropriate for each of the attributes constituting the database, the attribute type correction unit 210 determines the attribute type. It is corrected to the type of no processing, and the classification result reflecting the correction is output.
  • the user inputs a correction instruction to the attribute type correction unit 210 using, for example, an input unit (not shown).
  • the anonymization database generation unit 120 takes the database and the classification result output in S210 as inputs, and for each of the attributes constituting the database, the value of the attribute is used by a method according to the type of the attribute. Anonymize, generate an anonymized database, and output. Here, if the attribute type is unprocessed, the anonymization process is not executed.
  • the processing in each component will be described using the database shown in FIG. 5 as an example.
  • the database has seven attributes identified as (a), (b), (c), (d), (e), (f), (g).
  • (1) Processing by the attribute type classification unit 110 and the attribute type correction unit 210 The attribute type classification unit 110 determines whether the above seven attributes directly correspond to an identifier, a quasi-identifier, or any other, and generates a classification result. do.
  • the attribute type correction unit 210 generates a classification result corrected based on the correction instruction by the user.
  • the attribute (a) it is determined that the attribute is a name by pattern matching using a list of names, and an identifier is directly assigned as the type of the attribute (a).
  • attribute (b) since it follows the check digit generation algorithm called Modulus11Weight234567, it is determined that the attribute is my number, and an identifier is directly assigned as the type of attribute (b).
  • attribute (c) it is determined that the attribute is gender by pattern matching using a list of genders, and a quasi-identifier is assigned as the type of attribute (c).
  • the attribute (d) it is determined that the attribute is an address by pattern matching using a list of addresses, and a quasi-identifier is assigned as the type of the attribute (d).
  • a range check with ⁇ 0, 1,..., 119, 120 ⁇ as the data range determines that the attribute is age, and assigns a quasi-identifier as the type of attribute (e).
  • the user determines that the type obtained by the processing in the attribute type classification unit 110 is inappropriate, and corrects the type of the attribute (g) without processing. Classify.
  • the anonymization database generation unit 120 executes the anonymization processing by using the method according to the type obtained in the processing of (1).
  • Attribute (a) and attribute (b) are direct identifiers, so anonymization processing is executed using item deletion.
  • FIG. 6 is a diagram showing an example of a functional configuration of a computer that realizes each of the above-mentioned devices (that is, each node).
  • the processing in each of the above-mentioned devices can be carried out by causing the recording unit 2020 to read a program for causing the computer to function as each of the above-mentioned devices, and operating the control unit 2010, the input unit 2030, the output unit 2040, and the like.
  • the device of the present invention is, for example, as a single hardware entity, an input unit to which a keyboard or the like can be connected, an output unit to which a liquid crystal display or the like can be connected, and a communication device (for example, a communication cable) capable of communicating outside the hardware entity.
  • Communication unit CPU (Central Processing Unit, cache memory, registers, etc.) to which can be connected, RAM and ROM as memory, external storage device as hard hardware, and input, output, and communication units of these , CPU, RAM, ROM, and external storage device have a connecting bus so that data can be exchanged.
  • a device (drive) or the like capable of reading and writing a recording medium such as a CD-ROM may be provided in the hardware entity.
  • a physical entity equipped with such hardware resources includes a general-purpose computer and the like.
  • the external storage device of the hardware entity stores the program required to realize the above-mentioned functions and the data required for processing this program (not limited to the external storage device, for example, reading a program). It may be stored in a ROM, which is a dedicated storage device). Further, the data obtained by the processing of these programs is appropriately stored in a RAM, an external storage device, or the like.
  • each program stored in the external storage device (or ROM, etc.) and the data necessary for processing each program are read into the memory as needed, and are appropriately interpreted, executed, and processed by the CPU. ..
  • the CPU realizes a predetermined function (each constituent unit represented as the above-mentioned ... unit, ... means, etc.).
  • the present invention is not limited to the above-described embodiment, and can be appropriately modified without departing from the spirit of the present invention. Further, the processes described in the above-described embodiment are not only executed in chronological order according to the order described, but may also be executed in parallel or individually as required by the processing capacity of the device that executes the processes. ..
  • the processing function in the hardware entity (device of the present invention) described in the above embodiment is realized by a computer
  • the processing content of the function that the hardware entity should have is described by a program.
  • the processing function in the above hardware entity is realized on the computer.
  • the program that describes this processing content can be recorded on a computer-readable recording medium.
  • the computer-readable recording medium may be, for example, a magnetic recording device, an optical disk, a photomagnetic recording medium, a semiconductor memory, or the like.
  • a hard disk device, a flexible disk, a magnetic tape, or the like as a magnetic recording device is used as an optical disk
  • a DVD (Digital Versatile Disc), a DVD-RAM (Random Access Memory), or a CD-ROM (Compact Disc Read Only) is used as an optical disk.
  • Memory CD-R (Recordable) / RW (ReWritable), etc.
  • MO Magnetto-Optical disc
  • EP-ROM Electroically Erasable and Programmable-Read Only Memory
  • semiconductor memory can be used.
  • the distribution of this program is carried out, for example, by selling, transferring, renting, etc., a portable recording medium such as a DVD or CD-ROM on which the program is recorded. Further, the program may be stored in the storage device of the server computer, and the program may be distributed by transferring the program from the server computer to another computer via the network.
  • a computer that executes such a program first stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. Then, when the process is executed, the computer reads the program stored in its own storage device and executes the process according to the read program. Further, as another execution form of this program, a computer may read the program directly from a portable recording medium and execute processing according to the program, and further, the program is transferred from the server computer to this computer. It is also possible to execute the process according to the received program one by one each time. In addition, the above processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition without transferring the program from the server computer to this computer. May be.
  • the program in this embodiment includes information to be used for processing by a computer and equivalent to the program (data that is not a direct command to the computer but has a property of defining the processing of the computer, etc.).
  • the hardware entity is configured by executing a predetermined program on the computer, but at least a part of these processing contents may be realized in terms of hardware.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Bioethics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

La présente invention concerne une technique qui permet la génération d'informations traitées anonymes sans connaissance légale ou sans connaissance de procédés de traitement. La présente invention comprend : une unité de classification de type d'attribut qui transmet, à un attribut formant une base de données en tant que type de l'attribut, un type qui est un identifiant direct, un quasi-identifiant, ou similaire ; et une unité de génération de base de données anonymisée qui génère une base de données anonymisée par anonymisation, pour l'attribut formant la base de données, de la valeur de l'attribut à l'aide d'un procédé correspondant au type de l'attribut.
PCT/JP2020/018127 2020-04-28 2020-04-28 Dispositif de génération de base de données anonymisée, procédé de génération de base de données anonymisée, et programme WO2021220404A1 (fr)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/JP2020/018127 WO2021220404A1 (fr) 2020-04-28 2020-04-28 Dispositif de génération de base de données anonymisée, procédé de génération de base de données anonymisée, et programme
JP2022518490A JP7405248B2 (ja) 2020-04-28 2020-04-28 匿名化データベース生成装置、匿名化データベース生成方法、プログラム

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2020/018127 WO2021220404A1 (fr) 2020-04-28 2020-04-28 Dispositif de génération de base de données anonymisée, procédé de génération de base de données anonymisée, et programme

Publications (1)

Publication Number Publication Date
WO2021220404A1 true WO2021220404A1 (fr) 2021-11-04

Family

ID=78332317

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2020/018127 WO2021220404A1 (fr) 2020-04-28 2020-04-28 Dispositif de génération de base de données anonymisée, procédé de génération de base de données anonymisée, et programme

Country Status (2)

Country Link
JP (1) JP7405248B2 (fr)
WO (1) WO2021220404A1 (fr)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2010127216A2 (fr) * 2009-05-01 2010-11-04 Telcodia Technologies, Inc. Détermination automatisée de quasi-identificateurs à l'aide d'une analyse de programme
JP2014013458A (ja) * 2012-07-03 2014-01-23 Hitachi Systems Ltd サービス提供方法及びサービス提供システム
JP2015114871A (ja) * 2013-12-12 2015-06-22 Kddi株式会社 公開情報のプライバシー保護装置、公開情報のプライバシー保護方法およびプログラム

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2010127216A2 (fr) * 2009-05-01 2010-11-04 Telcodia Technologies, Inc. Détermination automatisée de quasi-identificateurs à l'aide d'une analyse de programme
JP2014013458A (ja) * 2012-07-03 2014-01-23 Hitachi Systems Ltd サービス提供方法及びサービス提供システム
JP2015114871A (ja) * 2013-12-12 2015-06-22 Kddi株式会社 公開情報のプライバシー保護装置、公開情報のプライバシー保護方法およびプログラム

Also Published As

Publication number Publication date
JPWO2021220404A1 (fr) 2021-11-04
JP7405248B2 (ja) 2023-12-26

Similar Documents

Publication Publication Date Title
US10108914B2 (en) Method and system for morphing object types in enterprise content management systems
Hentschel et al. Critical success factors for the implementation and adoption of cloud services in SMEs
US11375015B2 (en) Dynamic routing of file system objects
Mans et al. Business process mining success
TW201423447A (zh) 動態資料遮罩方法以及資料庫系統
Garcia-Arce et al. Comparison of machine learning algorithms for the prediction of preventable hospital readmissions
US20220335156A1 (en) Dynamic Data Dissemination Under Declarative Data Subject Constraint
Yaghini et al. A hybrid simulated annealing and column generation approach for capacitated multicommodity network design
Zúñiga et al. Master data management maturity model for the microfinance sector in Peru
Cai et al. Improving the efficiency of clinical trial recruitment using an ensemble machine learning to assist with eligibility screening
US20110137987A1 (en) Automatically generating compliance questionnaires
US20180260432A1 (en) Identity management
WO2021220404A1 (fr) Dispositif de génération de base de données anonymisée, procédé de génération de base de données anonymisée, et programme
JP2017215868A (ja) 匿名化処理装置、匿名化処理方法、及びプログラム
US20220083604A1 (en) Mapping of personally-identifiable information to a person based on natural language coreference resolution
JP2022182573A (ja) 会計処理装置、会計処理方法、および、会計処理プログラム
JP7104520B2 (ja) 源泉税関連業務支援装置、源泉税関連業務支援方法、および源泉税関連業務支援プログラム
Adkinson Orellana et al. A new approach for dynamic and risk-based data anonymization
JP6927771B2 (ja) 販売管理装置、販売管理方法、および、販売管理プログラム
WO2021220402A1 (fr) Dispositif de détermination de quasi-identificateur, procédé de détermination de quasi-identificateur et programme
Saruwatari et al. Estimation of business rules using associations analysis
JP5875535B2 (ja) 匿名化装置、匿名化方法、プログラム
US11972021B2 (en) Anonymization apparatus, anonymization method, and program
JP5875536B2 (ja) 匿名化装置、匿名化方法、プログラム
US20240005024A1 (en) Order preserving dataset obfuscation

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20934147

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2022518490

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20934147

Country of ref document: EP

Kind code of ref document: A1