KR20180120443A - Improved K-anonymity Model based Dataset De-identification Method and Apparatus - Google Patents
Improved K-anonymity Model based Dataset De-identification Method and Apparatus Download PDFInfo
- Publication number
- KR20180120443A KR20180120443A KR1020170054395A KR20170054395A KR20180120443A KR 20180120443 A KR20180120443 A KR 20180120443A KR 1020170054395 A KR1020170054395 A KR 1020170054395A KR 20170054395 A KR20170054395 A KR 20170054395A KR 20180120443 A KR20180120443 A KR 20180120443A
- Authority
- KR
- South Korea
- Prior art keywords
- data set
- node
- anonymity
- record
- improved
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/2228—Indexing structures
- G06F16/2246—Trees, e.g. B+trees
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/60—Protecting data
- G06F21/62—Protecting access to data via a platform, e.g. using keys or access control rules
- G06F21/6218—Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
- G06F21/6245—Protecting personal data, e.g. for financial or medical purposes
- G06F21/6254—Protecting personal data, e.g. for financial or medical purposes by anonymising data, e.g. decorrelating personal data from the owner's identification
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Bioethics (AREA)
- Medical Informatics (AREA)
- Computer Hardware Design (AREA)
- Computer Security & Cryptography (AREA)
- Data Mining & Analysis (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention relates to an improved K-anonymity model-using data set non-discrimination method and apparatus, wherein the method comprises setting a value corresponding to the K-anonymity requirement, If there is a record set satisfying the K-anonymity requirement by sequentially converting the data set according to a generalization condition of each node of the hierarchical lattice from a first node of a hierarchical lattice composed of a combination of a category hierarchy tree, Extracting and separating the set of records from the dataset, and generating an unidentified dataset including the extracted set of records. According to the present invention, it is advantageous to perform non-discrimination on a data set so as to satisfy the K-anonymity requirement while minimizing information loss. In addition, there is an advantage that data processing time for non-identification can be shortened by generating a non-identifying data set by separating and extracting from a record set satisfying the K-anonymity requirement.
Description
Field of the Invention [0002] The present invention relates to a method and apparatus for data set identification, and more particularly to a method and apparatus for data set identification using an improved K-anonymity model.
Personal Information Ratio Identification means measures that make it difficult to identify a person by easily combining with other information by deleting or replacing some or all of personal information from data containing personal information to utilize big data. For example, it is called personal information non-discrimination that measures such as name, resident registration number, etc., which are commonly used, are treated as data that can directly identify the individual.
However, if the non-discriminated data have a unique value, it may be easier to estimate who is the subject of the data by combining with other disclosed information. This is a link attack that identifies an individual by combining it with other public information.
K-anonymity is defined as the existence of at least k records with the same quasi-identifier attribute values as the age and the residence area in the public data set. In order to prevent re-identification of the non-identified personal information, K- Anonymity is required.
Figure 1 illustrates a data set that satisfies the K-anonymity requirement.
Referring to Fig. 1, attribute values of age, sex, and zip code, which are semi-identifier attributes, are the same for records t1 and t2, records t3 and t4, and records t5, t6 and t7. That is, there are at least two records with the same value of the semi-identifier attribute. Thus, the K-value satisfies the K-anonymity requirement.
In general, the data set non-discrimination method using the existing K-anonymity model defines a hierarchical lattice consisting of a combination of category hierarchy trees for each quasi-identifier property, performs data categorization and K- And the anonymity satisfaction check process is repeated to perform the data set non-discrimination.
However, as the hierarchical lattice is moved to an upper node, the risk of privacy invasion becomes lower as the generalization of attribute value of quasi-identifiers increases, that is, the degree of non-identification increases.
SUMMARY OF THE INVENTION Accordingly, it is an object of the present invention to provide a method and apparatus for data-set non-discrimination using an improved K-anonymity model that satisfies the K-anonymity requirement but minimizes the loss of information.
According to an aspect of the present invention, there is provided an improved K-anonymity model using data set non-discrimination method comprising the steps of: (a) setting a value corresponding to the K-anonymity requirement; (b) And a generalized hierarchical tree according to a quasi-identifier attribute of the hierarchical grid, the sequencer sequentially converts the data set according to the generalization condition of each node of the hierarchical grid to obtain a record set satisfying the K- If so, extracting and separating the recordset from the transformed data set, and (c) generating an unidentified dataset including the extracted and separated recordset.
The step (b) includes the steps of: (b-1) transforming a data set according to a generalization condition of an n-th node, (b-2) (B-3) extracting if there is a record set satisfying the K-anonymity requirement in the transformed data set according to the generalization condition of the n-th node, .
The method may stop step (b) if the transformed data set at the n-th node satisfies the k-anonymity requirement.
The non-identified data set may be a record extracted from the first node to the (n-1) th node and a record included in the data set converted from the nth node.
The method may stop the step (b) if the remaining number of records is less than a predetermined number after extracting a record from the transformed data set according to the generalization condition of the n-th node.
The non-identified data set may be a record extracted from the first node to the n-th node.
The step (b) may be repeatedly performed by selecting the next node in a width priority manner.
The record set may be a set of records in which the values of the semi-identifier attributes are the same.
The method may further comprise generating a hierarchical grid consisting of a combination of generalization hierarchy trees per quasi identifier attribute of the dataset.
According to an aspect of the present invention, there is provided an improved K-anonymity model using data set non-discrimination apparatus, comprising: an input unit configured to set a value corresponding to a K-anonymity requirement; If there is a record set satisfying the K-anonymity requirement by converting the data set according to a generalization condition of each node of the hierarchical lattice sequentially from a first node of a hierarchical lattice made up of a combination of generalized hierarchical trees, And a data non-discrimination unit for extracting and separating the record set from the data set and generating an un-identified data set including the extracted and separated record set.
The data non-discrimination unit converts the data set according to the generalization condition of the n-th node, and checks whether the converted data set satisfies the k-anonymity requirement according to the generalization condition of the n-th node, An anonymity requirement can be extracted and separated if there is a record set satisfying the K-anonymity requirement in the converted data set.
The data non-discrimination unit may further include a data extracting unit for extracting a record extracted from the first node to the (n-1) th node and a data set transformed from the n-th node, if the data set transformed at the n-th node satisfies the k- Lt; RTI ID = 0.0 > non-identified < / RTI >
Wherein the data non-identifying unit extracts a record from the transformed data set according to the generalization condition of the n-th node, and if the remaining number of records is less than a predetermined number, An identified data set can be generated.
The data non-identifying unit may perform a repetition by selecting a next node in a breadth first manner by extracting a record set satisfying the K-anonymity requirement by converting a data set according to a generalization condition of a selected node .
The apparatus may further include a hierarchical grid generator for generating a hierarchical grid consisting of a combination of generalization hierarchical trees for each quasi identifier attribute of the data set.
According to the present invention, it is advantageous to perform non-discrimination on a data set so as to satisfy the K-anonymity requirement while minimizing information loss. In addition, there is an advantage that data processing time for non-identification can be shortened by generating a non-identifying data set by separating and extracting from a record set satisfying the K-anonymity requirement.
Figure 1 illustrates a data set that satisfies the K-anonymity requirement.
2 is a block diagram illustrating a configuration of an improved K-anonymity model utilizing data set non-discrimination device in accordance with an embodiment of the present invention.
3 is a diagram illustrating a generalization hierarchical tree according to a quasi-identifier attribute according to an embodiment of the present invention.
FIG. 4 is a diagram illustrating a hierarchical grid consisting of a combination of hierarchical trees according to an embodiment of the present invention.
5 is a flow chart provided to illustrate the operation of the improved K-anonymity model utilizing data set non-identifying device in accordance with an embodiment of the present invention.
6 is a diagram illustrating an original data set to be non-identified.
FIGS. 7 to 9 are diagrams illustrating data sets converted according to the generalization condition of each node of the hierarchical grid.
FIG. 10 is a diagram illustrating a finally generated non-identified data set.
Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings so that those skilled in the art can easily carry out the present invention.
2 is a block diagram illustrating a configuration of an improved K-anonymity model utilizing data set non-discrimination device in accordance with an embodiment of the present invention.
Referring to FIG. 2, a data set non-identifying
The data set non-identifying
The
3 is a diagram illustrating a generalization hierarchical tree according to a quasi-identifier attribute according to an embodiment of the present invention.
In FIG. 3, 'gender' is classified into level 0 (S0) and level 1 (S2), and can distinguish 'male' from 'female' at level 0 (S0) * 'In a generalized hierarchical tree structure. The age is classified into level 0 (A0), level 1 (A2), level 2 (A2), and the actual age at level 0 (A0) 39 ',' 40 ~ 49 ', etc., and then masking with' * 'at the next higher level 2 (A2). Finally, 'zip code' is layered into level 0 (Z0), level 1 (Z1), level 2 (Z2) and level 3 (Z3) 500 units, categorization in level 2 (Z2) to 1,000 units, and masking with '*' in the last level 3 (Z3). Of course, it is also possible to make the generalization hierarchy tree level for each quasi-identifier attribute less granular or more granular than that illustrated in FIG.
The
The
The
The hierarchical grid according to the present invention can be defined as a graph representing the number of all cases of the generalized hierarchical tree combination according to the quasi-identifier attribute. Each node in the hierarchical grid represents a combination of generalized hierarchical tree nodes by semi-identifier attributes that may exist.
FIG. 4 is a diagram illustrating a hierarchical grid consisting of a combination of hierarchical trees according to an embodiment of the present invention.
The hierarchical lattice as illustrated in FIG. 4 can be generated for the generalized hierarchical tree according to the quasi identifier attributes of 'sex', 'age', and 'postal code' illustrated in FIG.
4, the lowest node <S0, A0, Z0> having the lowest generalization level of '0' is located at the lowest position, and the generalization level is increased by '1' The top node <S1, A2, Z3> is illustrated in the topmost position.
The data non-identifying
5 is a flow chart provided to illustrate the operation of the improved K-anonymity model utilizing data set non-identifying device in accordance with an embodiment of the present invention.
Referring to FIG. 5, the
Next, the
The data
If there is no record set satisfying the K-anonymity requirement in the converted data set (S560-N), or if the number of records remaining in the extracted data set exceeds the K value S580-N), and repeats steps S540 to S580 for the n + 1th node of the hierarchical grid.
If the number of records remaining in the data set is smaller than the value K in step S580-Y, the process proceeds to step S590. In step S590, a non-identified data set including the extracted records is generated (S590).
If the data set converted according to the n-th node generalization condition satisfies the K-anonymity requirement (S550-Y), the process proceeds to step S590, and the extracted record data from the first node to the n- And the nth node generalization condition (S590), the non-identified data set may be generated as a record included in the converted data set.
The process of generating the non-identified data set by setting the K value to 2 for the original data set illustrated in FIG. 6 and applying the improved K-anonymity algorithm will be described.
FIG. 6 is a diagram illustrating an original data set to be non-identified; FIGS. 7 to 9 are diagrams illustrating a data set converted according to a generalization condition of each node of the hierarchical grid; Lt; RTI ID = 0.0 > ID < / RTI >
Referring again to FIG. 5, the generalization condition of the first node <S0, A0, Z0>, which is the lowest node of the hierarchical lattice in the state where n = 1 (S530) The quasi-identifier attribute value of the original data set shown in Fig. 6 is maintained without data conversion. That is, step S540 may be omitted for the first node.
Since the original data set illustrated in FIG. 6 does not satisfy the K-anonymity requirement (S550-N) and there is no record set satisfying the K-anonymity requirement (S560-N), N is changed to 2 S565).
One of the <S0, A0, Z1>, <S0, A1, Z0>, <S1, A0, Z0> can be selected as the second node according to the principle of width priority after the first node <S0, A0, Z0> have. S0, A1, Z0>, <S0, A1, Z0>, and <S1, A0, Z0> can be determined differently according to the embodiment, Node.
When the data is transformed by applying the generalization condition of the node < S0, A1, Z0 > to the original data set of Fig. 6, the quasi identifier attribute value can be transformed as illustrated in Fig.
The data set illustrated in FIG. 7 as a whole does not satisfy the K-anonymity requirement (S550-N). However, the three records (1, 4, 8) have the same semi-identifier attribute values as 'M', '20 to 30', and '13053'. That is, the first record set consisting of three records (1, 4, 8) satisfies the K-anonymity requirement when the K value is 2 (S560-Y). Accordingly, the data
When the data conversion is performed by applying the generalization condition of the node < S1, A1, Z1 > to the data set in which the three
The data set illustrated in FIG. 8 as a whole does not satisfy the K-anonymity requirement (S550-N). However, in the four records (2, 3, 9, 10), the value of the semi-identifier attribute is the same as '*', '41-50', and '14000 ~ 14500'. The attribute values of the two
8, when the second set of records and the third set of records are separated from each other in the data set illustrated in FIG. 8, the number of records remaining in the data set is one, which is smaller than the K value (2) (S580-Y). Therefore, the data
Referring to FIG. 10, the non-identified data set satisfies the K-anonymity requirement because there are at least two records having the same quasi-identifier attribute values. In addition, since records (1, 4, 8) retain their gender attribute values without loss, it can be seen that the data loss is less than when the non-discrimination is performed by the K-anonymity algorithm.
Embodiments of the present invention include a computer-readable medium having program instructions for performing various computer-implemented operations. The medium records a program for executing the above-described method. The medium may include program instructions, data files, data structures, etc., alone or in combination. Examples of such media include magnetic media such as hard disks, floppy disks and magnetic tape, optical recording media such as CD and DVD, programmed instructions such as floptical disk and magneto-optical media, ROM, RAM, And a hardware device configured to store and execute the program. Or such medium may be a transmission medium, such as optical or metal lines, waveguides, etc., including a carrier wave that transmits a signal specifying a program command, data structure, or the like. Examples of program instructions include machine language code such as those produced by a compiler, as well as high-level language code that can be executed by a computer using an interpreter or the like.
While the present invention has been particularly shown and described with reference to exemplary embodiments thereof, it is to be understood that the invention is not limited to the disclosed exemplary embodiments, Of the right.
100: Data set non-discrimination device
110: input unit
130: Database
150:
170:
190:
Claims (14)
(b) transforming the data set according to a generalization condition of each node of the hierarchical lattice sequentially from a first node of a hierarchical lattice composed of a combination of a category hierarchy tree for each quasi-identifier attribute of the dataset, If there is a record set that satisfies, extracting and separating the record set from the transformed data set, and
(c) generating an unidentified dataset including the extracted and separated recordset
An improved K-anonymity model utilizing a data set non-identifying method.
The step (b)
(b-1) transforming the data set according to the generalization condition of the n-th node,
(b-2) checking whether the transformed data set satisfies the k-anonymity requirement according to the generalization condition of the n-th node, and
(b-3) if there is a record set satisfying the K-anonymity requirement in the data set converted according to the generalization condition of the n-th node,
An improved K-anonymity model utilizing a data set non-identifying method.
If the transformed data set at the nth node satisfies the k-anonymity requirement, stopping step (b)
Wherein the non-identified data set comprises:
An improved K-anonymity model utilizing data set non-discrimination method comprising a record extracted from the first node to an (n-1) th node and a record included in a data set converted from the n-th node.
If the remaining number of records is less than a predetermined number after extracting a record in the converted data set according to the generalization condition of the n-th node, the step (b)
Wherein the non-identified data set comprises:
And an extracted record from the first node to the n-th node.
An improved K-anonymity model using a data set non-discrimination method in which the step (b) is repeatedly performed by selecting the next node in a width priority manner.
Wherein the set of records is a set of records in which the values of the quasi-identifier attributes are the same.
A step of generating a hierarchical lattice consisting of a combination of a category hierarchy tree for each quasi-identifier attribute of the data set
And an improved K-anonymity model utilization data set non-discrimination method.
The data set is transformed according to the generalization condition of each node of the hierarchical lattice sequentially from the first node of the hierarchical lattice composed of the combination of the category hierarchy tree by the quasi-identifier attribute of the dataset, If there is a set, extracting the record set from the transformed data set, and generating a non-identified data set including the extracted and separated record set,
And an improved K-anonymity model utilizing data set non-identifying device.
Wherein the data non-
The method of claim 1, further comprising: converting a data set according to a generalization condition of an n-th node, checking whether a data set converted according to a generalization condition of the n-th node satisfies the k-anonymity requirement, An improved K-anonymity model utilizing a dataset non-identifying device for extracting if there is a record set in the dataset that satisfies the K-anonymity requirement.
Wherein the data non-
If the data set transformed at the n-th node satisfies the k-anonymity requirement, the record extracted from the first node to the (n-1) -th node and the record included in the transformed data set at the n- An improved K-anonymity model utilizing dataset non-identifying device to generate an identified dataset.
Wherein the data non-
If the remaining number of records remaining after extracting a record in the transformed data set according to the generalization condition of the n-th node is less than a predetermined number, generating the non-identified data set from the first to n-th nodes Using an improved K-anonymity model.
Wherein the data non-
An improved K-anonymity model using repeatedly selecting the next node in a width-wise manner to extract and isolate a set of records satisfying the K-anonymity requirement by converting the data set according to the generalization condition of the selected node Set non-identifying device.
Wherein the set of records is a set of records in which the values of the quasi-identifier attributes are the same.
A hierarchical grid generator for generating a hierarchical grid consisting of a combination of a category hierarchy tree for each quasi-identifier attribute of a dataset
Further comprising: an improved K-anonymity model utilizing data set non-identifying device.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR1020170054395A KR20180120443A (en) | 2017-04-27 | 2017-04-27 | Improved K-anonymity Model based Dataset De-identification Method and Apparatus |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR1020170054395A KR20180120443A (en) | 2017-04-27 | 2017-04-27 | Improved K-anonymity Model based Dataset De-identification Method and Apparatus |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
KR1020190134678A Division KR20190124195A (en) | 2019-10-28 | 2019-10-28 | Improved K-anonymity Model based Dataset De-identification Method and Apparatus |
Publications (1)
Publication Number | Publication Date |
---|---|
KR20180120443A true KR20180120443A (en) | 2018-11-06 |
Family
ID=64329538
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
KR1020170054395A KR20180120443A (en) | 2017-04-27 | 2017-04-27 | Improved K-anonymity Model based Dataset De-identification Method and Apparatus |
Country Status (1)
Country | Link |
---|---|
KR (1) | KR20180120443A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20210109299A (en) | 2020-02-27 | 2021-09-06 | 호서대학교 산학협력단 | Method for distributed de-identification of large graph data |
-
2017
- 2017-04-27 KR KR1020170054395A patent/KR20180120443A/en active Application Filing
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20210109299A (en) | 2020-02-27 | 2021-09-06 | 호서대학교 산학협력단 | Method for distributed de-identification of large graph data |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Al-Rwaidan et al. | The impact of cloud-based solutions on digital transformation of HR practices | |
DE102013209868B4 (en) | Querying and integrating structured and unstructured data | |
CN111417954B (en) | Data de-identification based on detection of allowable configurations of data de-identification process | |
Bhardwaj et al. | Implementation of ID3 algorithm | |
Kumar et al. | Process innovation methods on business process reengineering | |
CN106709017A (en) | Big data-based aid decision making method | |
Lee | Computer-assisted appraisal and selection of archival materials | |
Ari et al. | Examination by Multinomial Logistic Regression Model of the factors affecting the types of domestic violence against women: A case of Turkey | |
Modi et al. | A comparative analysis of feature selection methods and associated machine learning algorithms on Wisconsin breast cancer dataset (WBCD) | |
KR20180120443A (en) | Improved K-anonymity Model based Dataset De-identification Method and Apparatus | |
Yoo et al. | A practical military ontology construction for the intelligent army tactical command information system | |
WO2019155198A1 (en) | Computer-implemented method for training a plurality of computers | |
Elwood et al. | Application of fuzzy pattern recognition of seismic damage to concrete structures | |
KR20190124195A (en) | Improved K-anonymity Model based Dataset De-identification Method and Apparatus | |
KR102588688B1 (en) | Method and system for analyzing data | |
Bureva et al. | Hierarchical generalized net model of the process of selecting a method for clustering | |
Williams et al. | Automatic task completion flows from web APIs | |
KR101904901B1 (en) | Original data management method and apparatus using id issuance | |
Li | An extensive review on recent deep learning applications | |
KR101612635B1 (en) | Clinical decision support system and method for evidence adaption using external resources | |
Dimian et al. | Human Resources in the Context of Digitalization | |
Marzukhi et al. | Framework of Knowledge-Based System for United Nations Peacekeeping Operations Using Data Mining Technique | |
JP2016091529A (en) | Hybrid rule reasoning apparatus and method thereof | |
KR102311710B1 (en) | Key Generating Apparatus and Method for Combining de-Identification Data Set | |
Jovanovic et al. | Evaluation of environmental impacts using backpropagation neural network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
A201 | Request for examination | ||
E902 | Notification of reason for refusal | ||
E601 | Decision to refuse application | ||
A107 | Divisional application of patent |