KR20180120443A

KR20180120443A - Improved K-anonymity Model based Dataset De-identification Method and Apparatus

Info

Publication number: KR20180120443A
Application number: KR1020170054395A
Authority: KR
Inventors: 심기창; 김동례
Original assignee: (주)이지서티
Priority date: 2017-04-27
Filing date: 2017-04-27
Publication date: 2018-11-06

Abstract

The present invention relates to an improved K-anonymity model-using data set non-discrimination method and apparatus, wherein the method comprises setting a value corresponding to the K-anonymity requirement, If there is a record set satisfying the K-anonymity requirement by sequentially converting the data set according to a generalization condition of each node of the hierarchical lattice from a first node of a hierarchical lattice composed of a combination of a category hierarchy tree, Extracting and separating the set of records from the dataset, and generating an unidentified dataset including the extracted set of records. According to the present invention, it is advantageous to perform non-discrimination on a data set so as to satisfy the K-anonymity requirement while minimizing information loss. In addition, there is an advantage that data processing time for non-identification can be shortened by generating a non-identifying data set by separating and extracting from a record set satisfying the K-anonymity requirement.

Description

METHOD AND APPARATUS FOR DISTRIBUTING DATA SETTING WITH IMPROVED K-ANNIKITY MODEL [0002]

Field of the Invention [0002] The present invention relates to a method and apparatus for data set identification, and more particularly to a method and apparatus for data set identification using an improved K-anonymity model.

Personal Information Ratio Identification means measures that make it difficult to identify a person by easily combining with other information by deleting or replacing some or all of personal information from data containing personal information to utilize big data. For example, it is called personal information non-discrimination that measures such as name, resident registration number, etc., which are commonly used, are treated as data that can directly identify the individual.

However, if the non-discriminated data have a unique value, it may be easier to estimate who is the subject of the data by combining with other disclosed information. This is a link attack that identifies an individual by combining it with other public information.

K-anonymity is defined as the existence of at least k records with the same quasi-identifier attribute values as the age and the residence area in the public data set. In order to prevent re-identification of the non-identified personal information, K- Anonymity is required.

Figure 1 illustrates a data set that satisfies the K-anonymity requirement.

Referring to Fig. 1, attribute values of age, sex, and zip code, which are semi-identifier attributes, are the same for records t1 and t2, records t3 and t4, and records t5, t6 and t7. That is, there are at least two records with the same value of the semi-identifier attribute. Thus, the K-value satisfies the K-anonymity requirement.

In general, the data set non-discrimination method using the existing K-anonymity model defines a hierarchical lattice consisting of a combination of category hierarchy trees for each quasi-identifier property, performs data categorization and K- And the anonymity satisfaction check process is repeated to perform the data set non-discrimination.

However, as the hierarchical lattice is moved to an upper node, the risk of privacy invasion becomes lower as the generalization of attribute value of quasi-identifiers increases, that is, the degree of non-identification increases.

SUMMARY OF THE INVENTION Accordingly, it is an object of the present invention to provide a method and apparatus for data-set non-discrimination using an improved K-anonymity model that satisfies the K-anonymity requirement but minimizes the loss of information.

According to an aspect of the present invention, there is provided an improved K-anonymity model using data set non-discrimination method comprising the steps of: (a) setting a value corresponding to the K-anonymity requirement; (b) And a generalized hierarchical tree according to a quasi-identifier attribute of the hierarchical grid, the sequencer sequentially converts the data set according to the generalization condition of each node of the hierarchical grid to obtain a record set satisfying the K- If so, extracting and separating the recordset from the transformed data set, and (c) generating an unidentified dataset including the extracted and separated recordset.

The step (b) includes the steps of: (b-1) transforming a data set according to a generalization condition of an n-th node, (b-2) (B-3) extracting if there is a record set satisfying the K-anonymity requirement in the transformed data set according to the generalization condition of the n-th node, .

The method may stop step (b) if the transformed data set at the n-th node satisfies the k-anonymity requirement.

The non-identified data set may be a record extracted from the first node to the (n-1) th node and a record included in the data set converted from the nth node.

The method may stop the step (b) if the remaining number of records is less than a predetermined number after extracting a record from the transformed data set according to the generalization condition of the n-th node.

The non-identified data set may be a record extracted from the first node to the n-th node.

The step (b) may be repeatedly performed by selecting the next node in a width priority manner.

The record set may be a set of records in which the values of the semi-identifier attributes are the same.

The method may further comprise generating a hierarchical grid consisting of a combination of generalization hierarchy trees per quasi identifier attribute of the dataset.

According to an aspect of the present invention, there is provided an improved K-anonymity model using data set non-discrimination apparatus, comprising: an input unit configured to set a value corresponding to a K-anonymity requirement; If there is a record set satisfying the K-anonymity requirement by converting the data set according to a generalization condition of each node of the hierarchical lattice sequentially from a first node of a hierarchical lattice made up of a combination of generalized hierarchical trees, And a data non-discrimination unit for extracting and separating the record set from the data set and generating an un-identified data set including the extracted and separated record set.

The data non-discrimination unit converts the data set according to the generalization condition of the n-th node, and checks whether the converted data set satisfies the k-anonymity requirement according to the generalization condition of the n-th node, An anonymity requirement can be extracted and separated if there is a record set satisfying the K-anonymity requirement in the converted data set.

The data non-discrimination unit may further include a data extracting unit for extracting a record extracted from the first node to the (n-1) th node and a data set transformed from the n-th node, if the data set transformed at the n-th node satisfies the k- Lt; RTI ID = 0.0 > non-identified < / RTI >

Wherein the data non-identifying unit extracts a record from the transformed data set according to the generalization condition of the n-th node, and if the remaining number of records is less than a predetermined number, An identified data set can be generated.

The data non-identifying unit may perform a repetition by selecting a next node in a breadth first manner by extracting a record set satisfying the K-anonymity requirement by converting a data set according to a generalization condition of a selected node .

The apparatus may further include a hierarchical grid generator for generating a hierarchical grid consisting of a combination of generalization hierarchical trees for each quasi identifier attribute of the data set.

According to the present invention, it is advantageous to perform non-discrimination on a data set so as to satisfy the K-anonymity requirement while minimizing information loss. In addition, there is an advantage that data processing time for non-identification can be shortened by generating a non-identifying data set by separating and extracting from a record set satisfying the K-anonymity requirement.

Figure 1 illustrates a data set that satisfies the K-anonymity requirement.
2 is a block diagram illustrating a configuration of an improved K-anonymity model utilizing data set non-discrimination device in accordance with an embodiment of the present invention.
3 is a diagram illustrating a generalization hierarchical tree according to a quasi-identifier attribute according to an embodiment of the present invention.
FIG. 4 is a diagram illustrating a hierarchical grid consisting of a combination of hierarchical trees according to an embodiment of the present invention.
5 is a flow chart provided to illustrate the operation of the improved K-anonymity model utilizing data set non-identifying device in accordance with an embodiment of the present invention.
6 is a diagram illustrating an original data set to be non-identified.
FIGS. 7 to 9 are diagrams illustrating data sets converted according to the generalization condition of each node of the hierarchical grid.
FIG. 10 is a diagram illustrating a finally generated non-identified data set.

Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings so that those skilled in the art can easily carry out the present invention.

2 is a block diagram illustrating a configuration of an improved K-anonymity model utilizing data set non-discrimination device in accordance with an embodiment of the present invention.

Referring to FIG. 2, a data set non-identifying apparatus 100 according to an exemplary embodiment of the present invention performs a function of non-identifying a data set using an improved K-anonymity model.

The data set non-identifying apparatus 100 may include an input unit 110, a database 130, a control unit 150, a hierarchical grid generating unit 170, and a data non-identifying unit 190.

The input unit 110 may set a k value corresponding to the K-anonymity requirement from the user. Also, according to the embodiment, the input unit 110 may receive category hierarchy tree data for each of the semi-identifier attributes to be applied when the original data set stored in the database 130 is unidentified by the user. Here, the quasi-identifier means an identifiable level that can not be directly identified, such as a zip code, age, etc., but is inferable.

3 is a diagram illustrating a generalization hierarchical tree according to a quasi-identifier attribute according to an embodiment of the present invention.

In FIG. 3, 'gender' is classified into level 0 (S0) and level 1 (S2), and can distinguish 'male' from 'female' at level 0 (S0) * 'In a generalized hierarchical tree structure. The age is classified into level 0 (A0), level 1 (A2), level 2 (A2), and the actual age at level 0 (A0) 39 ',' 40 ~ 49 ', etc., and then masking with' * 'at the next higher level 2 (A2). Finally, 'zip code' is layered into level 0 (Z0), level 1 (Z1), level 2 (Z2) and level 3 (Z3) 500 units, categorization in level 2 (Z2) to 1,000 units, and masking with '*' in the last level 3 (Z3). Of course, it is also possible to make the generalization hierarchy tree level for each quasi-identifier attribute less granular or more granular than that illustrated in FIG.

The database 130 has a function of constructing and storing data processed or managed by the data set identification device 100 as a database and provides a function of collecting, storing and managing data sets called so-called big data . In particular, the database 130 according to the present invention may store an original data set that has not been un-identified and a non-identified data set that has been un-identified.

The control unit 150 controls the overall operation of the data set non-discrimination apparatus 100. Specifically, the control unit 150 applies the K-anonymity algorithm modified by the data non-discrimination unit 190 according to the K value set by the user through the input unit 110 to perform non-discrimination on the non- Can be controlled. In addition, the controller 150 may control the hierarchical grid generator 170 to generate a hierarchical grid based on the generalized hierarchical tree data for each of the semi-identifier attributes set or inputted by the user. Of course, according to the embodiment, the layer grid already generated may be inputted through the input unit 110 and used.

The hierarchical grid generator 170 generates a hierarchical grid consisting of a combination of a category hierarchy tree for each quasi identifier attribute of the data set.

The hierarchical grid according to the present invention can be defined as a graph representing the number of all cases of the generalized hierarchical tree combination according to the quasi-identifier attribute. Each node in the hierarchical grid represents a combination of generalized hierarchical tree nodes by semi-identifier attributes that may exist.

FIG. 4 is a diagram illustrating a hierarchical grid consisting of a combination of hierarchical trees according to an embodiment of the present invention.

The hierarchical lattice as illustrated in FIG. 4 can be generated for the generalized hierarchical tree according to the quasi identifier attributes of 'sex', 'age', and 'postal code' illustrated in FIG.

4, the lowest node <S0, A0, Z0> having the lowest generalization level of '0' is located at the lowest position, and the generalization level is increased by '1' The top node <S1, A2, Z3> is illustrated in the topmost position.

The data non-identifying unit 190 can perform the non-discrimination by applying an improved K-anonymity algorithm to the original data set, i.e., the non-identifying target data set. Specifically, the data non-discrimination unit 190 converts the data set according to the generalization condition of each node of the hierarchical lattice to check whether it satisfies the K-anonymity requirement, and if there is a record set satisfying the K- Repeat the process of extracting and extracting the recordset from the dataset. The data non-identifying unit 190 may generate an unidentified data set using the extracted and separated recordsets satisfying the K-anonymity requirement.

5 is a flow chart provided to illustrate the operation of the improved K-anonymity model utilizing data set non-identifying device in accordance with an embodiment of the present invention.

Referring to FIG. 5, the hierarchical grid generator 170 may generate a hierarchical grid consisting of a combination of category hierarchy trees for each quasi-identifier attribute of the data set (S510). Step S510 may be performed in the data set non-identifying apparatus 100, but it may be prepared in advance by the user and input to the data set non-identifying apparatus 100 according to the embodiment.

Next, the input unit 110 may set a K value corresponding to the K-anonymity requirement from the user (S520).

The data non-identifying unit 190 performs data conversion according to the generalization condition of the n-th node of the hierarchical layer (S540), and checks whether the converted data set satisfies the K-anonymity requirement in operation S540 (S550 ). If the converted data set does not satisfy the K-anonymity requirement (S550-N), it is checked whether there is a record set satisfying the K-anonymity requirement in the converted data set (S560). If there is a record set satisfying the K-anonymity requirement in the converted data set (S560-Y), a record set satisfying the K-anonymity requirement is extracted from the converted data set (S570).

If there is no record set satisfying the K-anonymity requirement in the converted data set (S560-N), or if the number of records remaining in the extracted data set exceeds the K value S580-N), and repeats steps S540 to S580 for the n + 1th node of the hierarchical grid.

If the number of records remaining in the data set is smaller than the value K in step S580-Y, the process proceeds to step S590. In step S590, a non-identified data set including the extracted records is generated (S590).

If the data set converted according to the n-th node generalization condition satisfies the K-anonymity requirement (S550-Y), the process proceeds to step S590, and the extracted record data from the first node to the n- And the nth node generalization condition (S590), the non-identified data set may be generated as a record included in the converted data set.

The process of generating the non-identified data set by setting the K value to 2 for the original data set illustrated in FIG. 6 and applying the improved K-anonymity algorithm will be described.

FIG. 6 is a diagram illustrating an original data set to be non-identified; FIGS. 7 to 9 are diagrams illustrating a data set converted according to a generalization condition of each node of the hierarchical grid; Lt; RTI ID = 0.0 > ID < / RTI >

Referring again to FIG. 5, the generalization condition of the first node <S0, A0, Z0>, which is the lowest node of the hierarchical lattice in the state where n = 1 (S530) The quasi-identifier attribute value of the original data set shown in Fig. 6 is maintained without data conversion. That is, step S540 may be omitted for the first node.

Since the original data set illustrated in FIG. 6 does not satisfy the K-anonymity requirement (S550-N) and there is no record set satisfying the K-anonymity requirement (S560-N), N is changed to 2 S565).

One of the <S0, A0, Z1>, <S0, A1, Z0>, <S1, A0, Z0> can be selected as the second node according to the principle of width priority after the first node <S0, A0, Z0> have. S0, A1, Z0>, <S0, A1, Z0>, and <S1, A0, Z0> can be determined differently according to the embodiment, Node.

When the data is transformed by applying the generalization condition of the node < S0, A1, Z0 > to the original data set of Fig. 6, the quasi identifier attribute value can be transformed as illustrated in Fig.

The data set illustrated in FIG. 7 as a whole does not satisfy the K-anonymity requirement (S550-N). However, the three records (1, 4, 8) have the same semi-identifier attribute values as 'M', '20 to 30', and '13053'. That is, the first record set consisting of three records (1, 4, 8) satisfies the K-anonymity requirement when the K value is 2 (S560-Y). Accordingly, the data non-identifying unit 190 may extract and separate the first set of records consisting of three records (1, 4, 8) in the data set of FIG. 7 (S570). Next, since the number of records remaining after separating the three records (1, 4, 8) in the data set is 7 and is greater than the K value (2) (S580-N), the data non- And the next node is selected in a width-wise manner (S565).

When the data conversion is performed by applying the generalization condition of the node < S1, A1, Z1 > to the data set in which the three records 1, 4 and 8 are extracted and separated, Can be converted.

The data set illustrated in FIG. 8 as a whole does not satisfy the K-anonymity requirement (S550-N). However, in the four records (2, 3, 9, 10), the value of the semi-identifier attribute is the same as '*', '41-50', and '14000 ~ 14500'. The attribute values of the two records 6 and 7 are the same as '*', '41 -50', and '14500 ~ 15000'. That is, the third record set consisting of the second record set consisting of four records (2, 3, 9, 10) and the two records (6, 7) satisfies the K-anonymity requirement when the K value is 2 S560-Y). Therefore, the data non-identifying unit 190 can extract and separate records 2, 3, 9, and 10 belonging to the second set of records and records 6 and 7 belonging to the third set of records in the data set of FIG. 8 (S570).

8, when the second set of records and the third set of records are separated from each other in the data set illustrated in FIG. 8, the number of records remaining in the data set is one, which is smaller than the K value (2) (S580-Y). Therefore, the data non-identifying unit 190 deletes the records illustrated in FIG. 9, and records the records belonging to the first, second, and third sets of records, The identified data set may be generated (S590).

Referring to FIG. 10, the non-identified data set satisfies the K-anonymity requirement because there are at least two records having the same quasi-identifier attribute values. In addition, since records (1, 4, 8) retain their gender attribute values without loss, it can be seen that the data loss is less than when the non-discrimination is performed by the K-anonymity algorithm.

Embodiments of the present invention include a computer-readable medium having program instructions for performing various computer-implemented operations. The medium records a program for executing the above-described method. The medium may include program instructions, data files, data structures, etc., alone or in combination. Examples of such media include magnetic media such as hard disks, floppy disks and magnetic tape, optical recording media such as CD and DVD, programmed instructions such as floptical disk and magneto-optical media, ROM, RAM, And a hardware device configured to store and execute the program. Or such medium may be a transmission medium, such as optical or metal lines, waveguides, etc., including a carrier wave that transmits a signal specifying a program command, data structure, or the like. Examples of program instructions include machine language code such as those produced by a compiler, as well as high-level language code that can be executed by a computer using an interpreter or the like.

While the present invention has been particularly shown and described with reference to exemplary embodiments thereof, it is to be understood that the invention is not limited to the disclosed exemplary embodiments, Of the right.

100: Data set non-discrimination device
110: input unit
130: Database
150:
170:
190:

Claims

(a) setting a value corresponding to the K-anonymity requirement,
(b) transforming the data set according to a generalization condition of each node of the hierarchical lattice sequentially from a first node of a hierarchical lattice composed of a combination of a category hierarchy tree for each quasi-identifier attribute of the dataset, If there is a record set that satisfies, extracting and separating the record set from the transformed data set, and
(c) generating an unidentified dataset including the extracted and separated recordset
An improved K-anonymity model utilizing a data set non-identifying method.

The method of claim 1,
The step (b)
(b-1) transforming the data set according to the generalization condition of the n-th node,
(b-2) checking whether the transformed data set satisfies the k-anonymity requirement according to the generalization condition of the n-th node, and
(b-3) if there is a record set satisfying the K-anonymity requirement in the data set converted according to the generalization condition of the n-th node,
An improved K-anonymity model utilizing a data set non-identifying method.

3. The method of claim 2,
If the transformed data set at the nth node satisfies the k-anonymity requirement, stopping step (b)
Wherein the non-identified data set comprises:
An improved K-anonymity model utilizing data set non-discrimination method comprising a record extracted from the first node to an (n-1) th node and a record included in a data set converted from the n-th node.

In paragraph 2
If the remaining number of records is less than a predetermined number after extracting a record in the converted data set according to the generalization condition of the n-th node, the step (b)
Wherein the non-identified data set comprises:
And an extracted record from the first node to the n-th node.

4. The method according to claim 3 or 4,
An improved K-anonymity model using a data set non-discrimination method in which the step (b) is repeatedly performed by selecting the next node in a width priority manner.

The method of claim 1,
Wherein the set of records is a set of records in which the values of the quasi-identifier attributes are the same.

The method of claim 1,
A step of generating a hierarchical lattice consisting of a combination of a category hierarchy tree for each quasi-identifier attribute of the data set
And an improved K-anonymity model utilization data set non-discrimination method.

An input unit for setting a value corresponding to the K-anonymity requirement, and
The data set is transformed according to the generalization condition of each node of the hierarchical lattice sequentially from the first node of the hierarchical lattice composed of the combination of the category hierarchy tree by the quasi-identifier attribute of the dataset, If there is a set, extracting the record set from the transformed data set, and generating a non-identified data set including the extracted and separated record set,
And an improved K-anonymity model utilizing data set non-identifying device.

9. The method of claim 8,
Wherein the data non-
The method of claim 1, further comprising: converting a data set according to a generalization condition of an n-th node, checking whether a data set converted according to a generalization condition of the n-th node satisfies the k-anonymity requirement, An improved K-anonymity model utilizing a dataset non-identifying device for extracting if there is a record set in the dataset that satisfies the K-anonymity requirement.

The method of claim 9,
Wherein the data non-
If the data set transformed at the n-th node satisfies the k-anonymity requirement, the record extracted from the first node to the (n-1) -th node and the record included in the transformed data set at the n- An improved K-anonymity model utilizing dataset non-identifying device to generate an identified dataset.

The method of claim 9,
Wherein the data non-
If the remaining number of records remaining after extracting a record in the transformed data set according to the generalization condition of the n-th node is less than a predetermined number, generating the non-identified data set from the first to n-th nodes Using an improved K-anonymity model.

11. The method according to claim 10 or 11,
Wherein the data non-
An improved K-anonymity model using repeatedly selecting the next node in a width-wise manner to extract and isolate a set of records satisfying the K-anonymity requirement by converting the data set according to the generalization condition of the selected node Set non-identifying device.

9. The method of claim 8,
Wherein the set of records is a set of records in which the values of the quasi-identifier attributes are the same.

9. The method of claim 8,
A hierarchical grid generator for generating a hierarchical grid consisting of a combination of a category hierarchy tree for each quasi-identifier attribute of a dataset
Further comprising: an improved K-anonymity model utilizing data set non-identifying device.