WO2014030302A1

WO2014030302A1 - Information processing device for executing anonymization and anonymization processing method

Info

Publication number: WO2014030302A1
Application number: PCT/JP2013/004624
Authority: WO
Inventors: 由起豊田
Original assignee: 日本電気株式会社
Priority date: 2012-08-20
Filing date: 2013-07-31
Publication date: 2014-02-27
Also published as: JPWO2014030302A1

Abstract

This invention provides an information processing device for anonymizing a dataset so as to conform to a utilization purpose. The information processing device calculates the data loss amount corresponding to each attribute included in a first piece of personal data to be anonymized, decides on the attribute to be manipulated on the basis of priorities corresponding to the respective attributes and the data loss amount, and generates and outputs a second piece of personal data in which the attribute value of the decided attribute of the first piece of personal data has been manipulated.

Description

Information processing apparatus and anonymization processing method for performing anonymization

The present invention relates to an information processing apparatus, anonymization processing method, and program for anonymizing personal data.

The digitization of medical information is progressing, and medical information is being accumulated in medical institutions and health insurance associations. Wide use of such medical information is considered to lead to the development of epidemiological research, medical technology and new drug development. Therefore, it is expected that the medical information can be used in research institutions while ensuring the privacy of the stored medical information.

Anonymization is one of the methods for ensuring privacy in the use of information. This anonymization is a technique for performing processing for preventing identification of individuals on data including information that is not desired to be known to others, such as the above-described medical information. Hereinafter, a batch of data to be processed as described above is referred to as a data set. In addition, a lump of data corresponding to each individual constituting the data set is called a personal data record. Further, the minimum unit information such as the age of the individual and the name of the disease affected by the individual constituting the personal data record is called an attribute.

Non-Patent Document 1 discloses k-anonymization, which is one of representative techniques for anonymization. In this k-anonymization, each individual data record included in the data set is processed so that the individual specific probability is 1 / k (k-k of anonymization) or less, and a certain anonymity is guaranteed. Technology. Processing in k-anonymization is, for example, processing such that the value of a specific attribute is made ambiguous (also called generalization) among a plurality of personal data records constituting a data set. It is.

Generalization k-anonymization has a top-down approach and a bottom-up approach. The top-down approach is a method of embodying attribute values contained in the most generalized personal data record within a range where k-anonymity does not break down. The bottom-up approach is a method of generalizing the original values of unprocessed personal data records so as to ensure k-anonymity.

Non-Patent Document 2 shows one of the typical top-down approaches. The method disclosed in Non-Patent Document 2 is a method for anonymizing a personal data record of a data set by processing it as follows in order to satisfy k-anonymity in a data set.

In the top-down approach, the initial state is a state in which the values of all attributes to be anonymized are generalized to the same value for each attribute among all personal data records in the data set to be anonymized.

In the first step, one attribute is selected from the attributes to be anonymized.

In the second step, the median value of the attribute values selected in the first step included in each of all the personal data records is obtained.

The third step divides the personal data records into two groups based on the attribute values of the personal data records with the calculated median as a boundary.

When the process from the first step to the third step is repeated and the number of personal data records in each group does not satisfy k (k-anonymization or k-anonymity “k”, and so on) The process ends. In this way, grouping attribute values from a most generalized initial state with a certain value as a boundary is called division. Note that the group output as a result is the group immediately before the number of personal data records in each group does not satisfy k.

In addition, Non-Patent Document 3 shows one of the representative methods of the bottom-up approach. The technique disclosed in Non-Patent Document 3 is an anonymization technique in which a certain attribute value is generalized from an original value so that personal data satisfies k-anonymity in a certain data set.

Patent Document 1 discloses a data anonymization device incorporating k-anonymization. The data anonymization device of Patent Literature 1 generates a complete graph connecting all personal data records constituting a data set with edges, divides the complete graph into clusters, and generalizes attributes in divided cluster units. . Thus, the data anonymization device realizes k-anonymization by a top-down approach.

JP 2012-022315 A

However, the techniques described in the above-mentioned patent documents and non-patent documents have a problem that the data set cannot be anonymized so as to match the purpose of use.

The reason for this is that in the anonymization technology disclosed in the above-mentioned patent document and non-patent document, the attributes to be processed are selected and processed in an order not related to the purpose of use without considering the purpose of use. Because.
[Object of the invention]
The objective of this invention is providing the information processing apparatus, the anonymization processing method, and program which solve the problem mentioned above.

The information processing apparatus of the present invention calculates an information loss amount corresponding to each attribute included in the first personal data to be anonymized, and outputs information loss amount calculation means, and priority corresponding to each of the attributes Anonymity to determine the attribute to be processed based on the degree and the amount of information loss, generate second personal data obtained by processing the attribute value of the determined attribute of the first personal data, and output the second personal data Processing means.

In the anonymization processing method of the present invention, the computer calculates and outputs an information loss amount corresponding to each of the attributes included in the first personal data to be anonymized, and the priority corresponding to each of the attributes The attribute to be processed is determined based on the information loss amount, and second personal data obtained by processing the attribute value of the determined attribute of the first personal data is generated and output.

The non-volatile recording medium of the present invention calculates the information loss amount corresponding to each of the attributes included in the first personal data to be anonymized, outputs the process, the priority corresponding to each of the attributes, and the Processing for determining the attribute to be processed based on the amount of information loss, and processing for generating and outputting second personal data obtained by processing the attribute value of the determined attribute of the first personal data A program for causing the computer to execute is recorded.

The present invention has the effect that the data set can be anonymized so as to match the purpose of use.

FIG. 1 is a block diagram illustrating a configuration of the anonymization device according to the first embodiment. FIG. 2 is a block diagram illustrating a configuration of a system including the anonymization device according to the first to third embodiments. FIG. 3 is a diagram showing an example of personal data in the first and second embodiments. FIG. 4 is a diagram illustrating an example of anonymized personal data in the first and second embodiments. FIG. 5 is a block diagram illustrating a hardware configuration of a computer that realizes the anonymization apparatus according to the first to third embodiments. FIG. 6 is a flowchart illustrating the operation of the anonymization device according to the first embodiment. FIG. 7 is a block diagram illustrating a configuration of the anonymization device according to the second embodiment. FIG. 8 is a diagram illustrating an example of priority determination information according to the second embodiment. FIG. 9A is a sequence diagram illustrating an operation of the anonymization device according to the second exemplary embodiment. FIG. 9B is a sequence diagram illustrating an operation of the anonymization device according to the second exemplary embodiment. FIG. 10A is a sequence diagram illustrating an operation of the anonymization device according to the second exemplary embodiment. FIG. 10B is a sequence diagram illustrating an operation of the anonymization device according to the second exemplary embodiment. FIG. 11 is a diagram illustrating an example of a data set in a generalized state according to the second embodiment. FIG. 12 is a diagram illustrating an example of division value candidates in the second embodiment. FIG. 13 is a diagram illustrating an example of division value candidates in the second embodiment. It is a figure which shows the example of a division value candidate. FIG. 14 is a diagram illustrating an image in which a data set is divided. FIG. 15 is a block diagram illustrating a configuration of the anonymization device according to the third embodiment. FIG. 16 is a diagram illustrating an example of priority determination information according to the third embodiment. FIG. 17 is a diagram illustrating an example of personal data according to the third embodiment. FIG. 18A is a sequence diagram illustrating an operation of the anonymization device according to the third exemplary embodiment. FIG. 18B is a sequence diagram illustrating an operation of the anonymization device according to the third exemplary embodiment. FIG. 18C is a sequence diagram illustrating an operation of the anonymization device according to the third exemplary embodiment. FIG. 19 is a diagram illustrating an example of personal data in an intermediate stage of anonymization processing by the anonymization device according to the third exemplary embodiment. FIG. 20 is a diagram illustrating an example of personal data in the middle of anonymization processing by the anonymization device according to the third exemplary embodiment. FIG. 21 is a diagram illustrating an example of personal data at an intermediate stage of anonymization processing by the anonymization device according to the third exemplary embodiment.

Embodiments for carrying out the present invention will be described in detail with reference to the drawings. In addition, in each embodiment described in each drawing and specification, the same code | symbol is given to the component provided with the same function.

<<<< first embodiment >>>>
FIG. 1 is a block diagram showing a configuration of an anonymization device (also called an information processing device) 310 according to the first embodiment of the present invention.

As shown in FIG. 1, the anonymization device 310 of this embodiment includes an information loss amount calculation unit 312 and an anonymization processing unit 313. The constituent elements shown in FIG. 1 may be constituent elements in hardware units or constituent elements divided into functional units of a computer device. Here, the components shown in FIG. 1 will be described as components divided into functional units of the computer apparatus.

FIG. 2 is a block diagram showing a configuration of a system including the anonymization device 310 according to the first embodiment of the present invention.

As shown in FIG. 2, the system includes a personal data storage device 100, an anonymized personal data storage device 200, and an anonymization device 310.

The personal data storage device 100 stores a data set (hereinafter referred to as a data set sp) that is personal data to be anonymized (first personal data). The data set sp is a set of data records (hereinafter referred to as data records rp). The data record rp includes attribute values of a plurality of attributes corresponding to a specific individual.

Personal data is, for example, medical information held by medical institutions. In this case, the attribute values included in the data record rp are attribute values of attributes such as name, date of birth, date of medical care, and disease name.

FIG. 3 is a diagram illustrating an example of a data set sp110 that is personal data stored in the personal data storage device 100. The data set sp110 includes a plurality of data records rp111.

The data record rp111 includes attribute values of “name”, “birth year”, “care date”, and “disease name”. Here, the attribute of “name” is an identifier. The attributes of “birth year” and “medical care date” are quasi-identifiers that, when combined, may identify an individual. The attribute of “disease name” is sensitive information that is not desired to be known to others. These attributes used as quasi-identifiers or sensitive information are examples. That is, in the anonymization apparatus 310, it is arbitrary which attribute among the attributes included in the data set sp110 is treated as a quasi-identifier or sensitive information.

As shown in FIG. 3, for example, a data record rp111 having a name attribute “patientA” has an attribute of “1949” as an attribute of birth year, “201006” as an attribute of medical treatment date, and an attribute of “DiseaseA” as an attribute of wound name Contains a value. The personal data set sp110 shown in FIG. 3 is an example, and the data record rp111 may include attribute values of arbitrary attributes as quasi-identifiers and sensitive information, respectively.

The anonymized personal data storage device 200 stores an anonymized data set (hereinafter referred to as anonymized data set sa) that is anonymized personal data (second personal data). The anonymized data set sa is a set of anonymized data records (hereinafter referred to as anonymized data records ra) in which the data record rp111 is anonymized.

FIG. 4 is a diagram illustrating an example of the anonymized data set sa210 that is anonymized personal data stored in the anonymized personal data storage device 200. The anonymized data set sa210 is a data set after the attribute value of the data set sp110 is processed (for example, generalized) by the anonymization device 310 to be anonymized. That is, the anonymized data set sa210 includes an anonymized data record ra211 obtained by processing the data record rp111 instead of the data record rp111.

In the anonymized data set sa210 shown in FIG. 4, the “birth year” and “medical care date” are processed (generalized), and the attribute value of the quasi-identifier is obscured compared to the data set sp110.

The information loss amount calculation unit 312 calculates and outputs an information loss amount (hereinafter referred to as an information loss amount ILA) corresponding to each attribute in the data set sp110.

Here, the information loss amount ILA is the amount of attribute information abstraction (hereinafter referred to as information abstraction ia) that increases when any attribute included in the data set sp110 is processed. The information abstraction ia is the abstraction of attribute information, that is, the attribute value of the attribute.

The information loss amount calculation unit 312 may calculate the information loss amount ILA using various methods described below, as necessary.

For example, as a first method, the information loss amount calculation unit 312 divides the range of attribute values after generalization of an attribute by the range of attribute values before generalization of the same attribute for one attribute. The amount of information loss in the data record (hereinafter referred to as information loss amount ILR) is calculated. Next, the information loss amount calculation unit 312 adds the information loss amount ILR by the number of data records to calculate the information loss amount ILA.

Specifically, the information loss amount calculation unit 312 calculates the information loss amount ILA of each attribute to be anonymized when the data set sp110 illustrated in FIG. .

In this case, the attribute value ranges of the anonymization target attributes before and after generalization are the same. Therefore, the information loss amount calculation unit 312 always calculates the information loss amount ILR for one attribute value as “1”.

Next, the information loss amount calculation unit 312 adds the information loss amount ILR corresponding to the number (20) of the data records rp111, and calculates “20” as the information loss amount ILA. In this way, the information loss amount calculation unit 312 calculates “20” as the information loss amount ILA for any attribute of the data set sp110.

Further, the information loss amount calculation unit 312 divides the attribute value of the “birth year” attribute of the data set sp110 illustrated in FIG. 3 into data records rp111 of “1956” or less and “1961” or more, and generalizes them. If it is assumed, the information loss amount ILA of each attribute to be anonymized is calculated as follows.

In this case, regarding the range of the attribute value of the “birth year” attribute before generalization, the minimum value is “1943” and the maximum value is “1977”. Further, the attribute value range of the “birth year” attribute after generalization (the data record rp111 whose attribute value of the “birth year” attribute to be generalized is “1956” or less) has a minimum value of “1943”, The maximum value is “1956”.

Therefore, the information loss amount calculation unit 312 calculates the information loss amount ILR-birth-ul 1956 of the attribute of “birth year” whose attribute value of “birth year” is “1956” or less as follows.

“(1956-1943) ÷ (1977-1943) = 0.382”
Next, the information loss amount calculation unit 312 adds the information loss amount ILR-birth-ul 1956 corresponding to the number (nine) of data records rp111 whose attribute value of the “birth year” attribute is “1956” or less. "3.438" is calculated as the information loss amount ILA-birth-ul 1956 of the data record rp111 whose attribute value is "1956" or less.

Further, the attribute value range of the “birth year” attribute after generalization (the data record rp111 whose attribute value of the “birth year” attribute to be generalized is “1961” or more) has a minimum value of “1961”. The maximum value is “1977”. Accordingly, the information loss amount calculation unit 312 calculates the information loss amount ILR-birth-ov1961 of the attribute of “birth year” with an attribute value of “1961” or more as “(1977-1961) ÷ (1977−). 1943) = 0.471 ”.

Next, the information loss amount calculation unit 312 adds the information loss amount ILR-birth-ov1961 corresponding to the number (11) of data records rp111 whose attribute value of the “birth year” attribute is “1961” or more. "5.181" is calculated as the information loss amount ILA-first-ov1961 of the data record rp111 having an attribute value of "1961" or more.

Next, the information loss amount calculation unit 312 adds the information loss amount ILA-birth-ul 1956 and the information loss amount ILA-birth-ov1961, and sets the information loss amount ILA-birth with the attribute “birth year” as “8. 619 "is calculated.

Similarly, the attribute value range of the “medical care date” attribute before generalization has a minimum value of “200512” and a maximum value of “201107”. Also, the attribute value range of the “medical year” attribute after generalization (data record rp111 whose generalized “birth year” attribute value is “1956” or less) has a minimum value of “200512”. Yes, the maximum value is “201107”. Therefore, the information loss amount calculation unit 312 sets the information loss amount ILR-mc-ul 1956 of the attribute “medical year” of the data record rp111 whose attribute value of the “birth year” attribute is “1956” or less to “1”. And calculate.

Next, the information loss amount calculation unit 312 adds the information loss amount ILR-mc-ul 1956 by the number (nine) of data records rp111 whose attribute value of the “birth year” attribute is “1956” or less. "9" is calculated as the information loss amount ILA-mc-ul 1956 of the attribute of "medical care date" of the data record rp111 having the attribute value of "1956" or less.

Also, the attribute value range of the “medical year” attribute after the generalization (data record rp111 whose attribute value of the “birth year” attribute to be generalized is “1961” or more) is “20000612” as the minimum value range. Yes, the maximum value is “201107”. Therefore, the information loss amount calculation unit 312 sets the information loss amount ILR-mc-ov1961 of the attribute “medical year” of the data record rp111 having the attribute value of “birth year” of “1961” or more to “0. 832 ".

Next, the information loss amount calculation unit 312 adds the information loss amount ILR-mc-ov1961 by the number (11) of data records rp111 whose attribute value of the “birth year” attribute is “1961” or more. "9.152" is calculated as the information loss amount ILA-mc-ov1961 of the attribute of "medical care date" of the data record rp111 having an attribute value of "1961" or more.

Next, the information loss amount calculation unit 312 adds the information loss amount ILA-mc-ul 1956 and the information loss amount ILA-mc-ov1961 to obtain “18 as an information loss amount ILA-mc whose attribute is“ medical date ”. .152 ".

The above is the description of the first method.

As a second method, the information loss amount calculation unit 312 may calculate the information loss amount ILA as follows. First, the information loss amount calculation unit 312 calculates the ratio of the number of attribute value types of the attribute after generalization and before generalization as the information loss amount ILR of one data record. Next, the information loss amount calculation unit 312 adds the information loss amount ILR by the number of data records to calculate the information loss amount ILA.

Based on the priority determination information stored in a means (not shown) (for example, a storage means (not shown) in the anonymization processing unit 313), the anonymization processing unit 313 determines the priority of each attribute (hereinafter, priority p). Called). Further, the anonymization processing unit 313 determines an attribute to be processed based on the priority p and the information loss amount ILA calculated by the information loss amount calculation unit 312. In other words, the anonymization processing unit 313 performs processing so as to reduce the loss of information in the entire anonymized data set sa210 by using the priority p and considering the purpose of use, and using the information loss amount ILA. Determine the attributes.

Here, the priority determination information is information for determining the priority p. The priority p is information indicating the degree of preventing the information abstraction ia possessed by each attribute included in the data set sp110 (data record rp111) from increasing (preventing loss of information preferentially). That is, the priority p indicates the priority of anonymization so that the increase in the information abstraction ia for the data set sp110 in the anonymized data set sa210 is made smaller for any of a plurality of attributes.

For example, the anonymization processing unit 313 calculates an evaluation value obtained by calculating (for example, multiplying) the priority p and the information loss amount ILA for each attribute. The anonymization processing unit 313 may acquire an evaluation value corresponding to a combination of the specific priority p and the specific information loss amount ILA from a unit (not shown). The calculation for calculating the evaluation value is a calculation for calculating the evaluation value larger as the priority p is higher if the information loss amount ILA is constant. The calculation for calculating the evaluation value is a calculation for calculating the evaluation value larger as the information loss amount ILA is larger if the priority p is constant. The same applies to the case where an evaluation value corresponding to a combination of a specific priority p and a specific information loss amount ILA is acquired.

Subsequently, for example, the anonymization processing unit 313 determines an attribute to be generalized so that an attribute with a smaller evaluation value is generalized so that an attribute with a larger evaluation value is not generalized.

Note that the anonymization processing unit 313 may determine an attribute to be generalized so that an attribute with a larger evaluation value is generalized so that an attribute with a smaller evaluation value is not generalized. In this case, the calculation for calculating the evaluation value is a calculation for calculating a smaller evaluation value as the priority p is higher if the information loss amount ILA is constant, and as the information loss amount ILA is larger if the priority p is constant. . The same applies to the calculation when obtaining an evaluation value corresponding to the combination of the specific priority p and the specific information loss amount ILA.

Next, the anonymization processing unit 313 generates and outputs an anonymized data set sa210 obtained by processing the determined attribute of the data set sp110. Note that the anonymization processing unit 313 may generate and output information on the difference of the anonymized data set sa210 with respect to the data set sp110.

The anonymization processing unit 313 may evaluate anonymity of the processed data set. Here, the processed data set is any one of a part and the whole of the data set when those attributes are processed. Subsequently, when the result of evaluating the anonymity is a predetermined content, the anonymization processing unit 313 treats the processed data set as any one of the anonymized data set part and the whole. The anonymized personal data storage device 200 may be recorded.

This completes the description of each component divided into functional units of the computer device of the anonymization device 310.

Next, components of the anonymization device 310 in units of hardware will be described.

FIG. 5 is a diagram illustrating a hardware configuration of a computer 700 that realizes the anonymization apparatus 310 according to the present embodiment.

As shown in FIG. 5, the computer 700 includes a CPU (Central Processing Unit) 701, a storage unit 702, a storage device 703, an input unit 704, an output unit 705, and a communication unit 706. Furthermore, the computer 700 includes a recording medium (or storage medium) 707 supplied from the outside. The recording medium 707 may be a non-volatile recording medium that stores information non-temporarily.

The CPU 701 controls the overall operation of the computer 700 by operating an operating system (not shown). The CPU 701 reads a program and data from a recording medium 707 mounted on the storage device 703, for example, and writes the read program and data to the storage unit 702. Here, the program is, for example, a program that causes the computer 700 to execute an operation of a flowchart shown in FIG.

The CPU 701 executes various processes as the information loss amount calculation unit 312 and the anonymization processing unit 313 shown in FIG. 1 according to the read program and based on the read data.

Note that the CPU 701 may download a program or data to the storage unit 702 from an external computer (not shown) connected to a communication network (not shown).

The storage unit 702 stores programs and data. The storage unit 702 may include the personal data storage device 100 and the anonymized personal data storage device 200.

The storage device 703 is, for example, an optical disk, a flexible disk, a magnetic optical disk, an external hard disk, and a semiconductor memory, and includes a recording medium 707. The storage device 703 records the program so that it can be read by a computer. Further, the storage device 703 may record data so as to be readable by a computer. The storage device 703 may include a personal data storage device 100 and an anonymized personal data storage device 200.

The input unit 704 is realized by, for example, a mouse, a keyboard, a built-in key button, and the like, and is used for an input operation. The input unit 704 is not limited to a mouse, a keyboard, and a built-in key button, and may be a touch panel, an accelerometer, a gyro sensor, a camera, or the like.

The output unit 705 is realized by a display, for example, and is used for confirming the output.

The communication unit 706 implements an interface with the personal data storage device 100, the anonymized personal data storage device 200, and other external devices (not shown). The communication unit 706 is included as part of the anonymization processing unit 313.

As described above, the functional unit block of the anonymization device 310 shown in FIG. 1 is realized by the computer 700 having the hardware configuration shown in FIG. However, the means for realizing each unit included in the computer 700 is not limited to the above. In other words, the computer 700 may be realized by one physically coupled device, or may be realized by two or more physically separated devices connected by wire or wirelessly and by a plurality of these devices. .

Note that the recording medium 707 in which the above-described program code is recorded may be supplied to the computer 700, and the CPU 701 may read and execute the program code stored in the recording medium 707. Alternatively, the CPU 701 may store the code of the program stored in the recording medium 707 in the storage unit 702, the storage device 703, or both. That is, the present embodiment includes an embodiment of a recording medium 707 that stores a program (software) executed by the computer 700 (CPU 701) temporarily or non-temporarily.

This completes the description of each component in hardware units of the computer 700 that implements the anonymization device 310 in the present embodiment.

Next, the operation of this embodiment will be described in detail with reference to FIGS.

FIG. 6 is a flowchart showing the operation of the anonymization device 310 in this embodiment.

The information loss amount calculation unit 312 calculates the information loss amount ILA for each anonymization target attribute of the data set sp110 (step S601).

Next, the anonymization processing unit 313 determines the priority p of each attribute based on the information for determining the priority p (step S602).

Next, the anonymization processing unit 313 determines an attribute to be processed based on the information loss amount ILA and the priority p (step S603).

Next, the anonymization processing unit 313 processes the determined attribute of the data record rp111 (step S604).

Next, the anonymization processing unit 313 outputs the data record rp111 in which the attribute is processed (step S605).

The first effect of the present embodiment described above is that the data set can be anonymized by controlling to match the purpose of use.

The reason is that the following configuration is included. That is, first, the information loss amount calculation unit 312 calculates and outputs an information loss amount ILA corresponding to each attribute. Second, the anonymization processing unit 313 determines an attribute to be processed based on the priority p and the information loss amount ILA, and processes the determined attribute.

The second effect of the present embodiment described above is that it is possible to reduce the loss of information in the anonymized data set in addition to the first effect. In other words, this second effect enables both anonymizing the data set by controlling it to match the purpose of use and reducing the loss of information in the anonymized data set. Is a point. This is because anonymization is performed only considering that it matches the purpose of use, so that it is possible to prevent the general loss of attributes other than the attribute that suppresses processing and the loss of information as a whole as a whole. It will be possible.

The reason is the same as the first effect. That is, the anonymization processing unit 313 determines the attribute to be processed based on both the priority p and the information loss amount ILA.
<<< Second Embodiment >>>
Next, a second embodiment of the present invention will be described in detail with reference to the drawings. Hereinafter, the description overlapping with the above description is omitted as long as the description of the present embodiment is not obscured.

FIG. 7 is a block diagram showing a configuration of the anonymization device 320 according to the present embodiment. The anonymization apparatus 320 of this embodiment performs anonymization by a top-down approach.

As shown in FIG. 7, the anonymization device 320 includes a priority determination information storage unit 321, an information loss amount calculation unit 322, and an anonymization processing unit 323.

It should be noted that the anonymization device 320 may be included in the system shown in FIG.

The priority determination information storage unit 321 stores information for determining the priority p. Information for determining the priority p is preset by the user of the system. Further, the information for determining the priority p may be received in advance from an external system by the division attribute determining unit 3233 via the communication unit 706 shown in FIG.

FIG. 8 is a diagram illustrating an example of the priority determination information 3210 stored in the priority determination information storage unit 321. As shown in FIG. 8, the priority determination information 3210 includes a set of an index and a weight (also referred to as priority). Here, the index is a value that uniquely determines the weight. The weight corresponds to each of the indexes and is a number indicating the importance of the attribute. In FIG. 8, for example, the weight corresponding to “5” of the index is “16”.

Note that, regardless of the example of FIG. 8, the index is not limited to five, and may be any number of two or more. Further, the index is not limited to numerals, and may be written in alphabets or the like, or may be attribute names (hereinafter also referred to as attribute names).

Further, the weight may be an arbitrary numerical value that can be used for calculation of an evaluation value described later.

Also, the priority determination information storage unit 321 may store a calculation formula (for example, “weight = 2 × (index−1)”) for calculating a weight for the input of the index as the priority determination information. Good.

The information loss amount calculation unit 322 calculates and outputs the information loss amount ILA of each attribute in the data set sp110.

The anonymization processing unit 323 includes a division attribute determination unit 3233, a division value determination unit 3234, an anonymity evaluation unit 3235, and a generalization execution unit 3236.

The division attribute determination unit 3233 uses the priority determination information 3210 stored in the priority determination information storage unit 321, for example, based on the index of each attribute input from the input unit 704 illustrated in FIG. 5. Generate weights for.

Next, the division attribute determination unit 3233 determines the attribute of the division axis (also referred to as an attribute to be processed, hereinafter referred to as an attribute) based on the generated weight and the information loss amount ILA.

The split attribute is an attribute that is split based on the attribute value of the split attribute when the data set (for example, the data set sp110) is split. Here, dividing a data set means grouping data records included in the data set. That is, when dividing the data set (for example, the data set sp110), the division attribute determination unit 3233 performs the division based on the attribute value range of the division attribute. The range is, for example, a value larger than a certain value and a smaller value. Alternatively, the range may be a geographic region, a type of thing, or an association with an event.

The division value determination unit 3234 determines the division value of the division attribute so as to satisfy the necessary anonymity. For example, when the attribute value is indicated by a numerical value, the division value is a numerical value within a possible range of the attribute value. Alternatively, the range may be a set of identification information (for example, prefecture names) indicating the area when the attribute value is a geographical area. Moreover, the range may be identification information (for example, what is performed outdoors) that classifies the type when the attribute value is a type of thing (for example, hobby). Moreover, the range may be the presence or absence of relevance when the attribute value is relevance with an event.

The anonymity evaluation unit 3235 determines whether each divided data set satisfies the required anonymity when a certain data set is divided. Specifically, the anonymity evaluation unit 3235, for example, when a certain data set is divided into two groups, the data so that each of the two groups includes at least k data records rp111. Determine whether the set can be split. Here, the k “k” s are k-anonymity or k-anonymization “k”. The same applies to the subsequent k.

The generalization execution unit 3236 generalizes (processes) the attribute value of the determined attribute based on the determined division value, and outputs it.

The anonymization device 320 described above may be realized by the computer 700 shown in FIG. 5 similarly to the anonymization device 310 shown in FIG.

Next, the operation of this embodiment will be described in detail with reference to the drawings.

FIG. 9A, FIG. 9B, FIG. 10A and FIG. 10B are sequence diagrams showing the operation of this embodiment.

9A, the division attribute determination unit 3233 receives, for example, an input of a division attribute determination request by a system user from the input unit 704 shown in FIG. 5 (step S801).

Here, the split attribute determination request includes, for example, k-anonymity k value “5” and attribute name and corresponding index “birth year: 4, medical year: 1”.

Note that the user who uses the anonymized data set specifies a larger index value for an attribute whose degree of generalization (processing) is desired to be suppressed.

Next, the division attribute determination unit 3233 stores, for example, in the storage unit 702 illustrated in FIG. 5, the value “5” of k included in the received division attribute determination request, the attribute name, and the corresponding index “birth year: 4, medical year” “Month: 1” is stored (step S802).

Next, the division attribute determination unit 3233 uses the priority determination information 3210 to generate a weight based on the attribute name and the corresponding index “birth year: 4, medical year: 1” (step S803).

Here, the division attribute determination unit 3233 calculates the weight corresponding to the attribute “birth year” as “8” and the weight corresponding to the attribute “medical year” as “1”.

Next, the division attribute determination unit 3233 transmits an information loss amount ILA calculation request to the information loss amount calculation unit 322 (step S804).
Next, the information loss amount calculation unit 322 that has received the calculation request for the information loss amount ILA transmits a request for acquiring the data set sp110 (hereinafter also referred to as a personal data acquisition request) to the personal data storage device 100 (step S805). ).

Next, the information loss amount calculation unit 322 that has received the data set sp110 calculates the information loss amount ILA, and transmits the calculated information loss amount ILA to the division attribute determination unit 3233 (step S806).

Here, the operation of calculating the information loss amount ILA of the information loss amount calculation unit 322 will be described in detail.

The information loss amount calculation unit 322 calculates the information loss amount ILR of one data record rp111 using, for example, the following formula 1.

[Equation 1]

Here, pta-max is the maximum attribute value after generalization. Also, pta-min is the minimum attribute value after generalization. Ptb-max is the maximum value of the attribute value before generalization. Ptb-min is the minimum attribute value before generalization.

Since this embodiment is an anonymization embodiment using a top-down approach, it is assumed that the attribute values of the attributes to be anonymized in the data set sp110 are generalized so that they all have the same value.

FIG. 11 is a diagram showing the data set st120 when the attribute values of the attributes to be anonymized in the data set sp110 shown in FIG. 3 are generalized to the same value. That is, the data set st120 shown in FIG. 11 is a data set in which the data set sp110 is generalized to the maximum.

In this case, pta-max is, for example, “1977” (“1977” of “1943 to 1977”), which is the maximum attribute value of the attribute whose attribute name is “birth year” in the data set st120 shown in FIG. is there. Further, pta-min is, for example, “1943” (“1943” of “1943 to 1977”) that is the minimum value of the attribute whose attribute name is “birth year” in the data set st120. Also, ptb-max is, for example, “1977” which is the maximum attribute value of the attribute whose attribute name is “birth year” in the data set sp110 shown in FIG. Also, ptb-min is, for example, “1943” which is the minimum attribute value of the attribute whose name is “birth year” in the data set sp110.

Therefore, “1” is calculated as follows for the information loss amount ILR-birth of one data record rp111 with the attribute name “year of birth”.

“Information loss amount ILR-birth” = (1977-1973) ÷ (1977-1943) = 1
In addition, the number of data records rp111 included in the data set sp110 is 20. Accordingly, the information loss amount ILA-birth of the attribute whose attribute name is “birth year” is calculated as “1” as follows.

(“Information Loss ILR-birth”) × (number of data records rp111) = 1 × 20 = 20
Similarly, “20” is calculated as the total information loss amount ILA-mc of the medical treatment date.

The information loss amount calculation unit 322 calculates the ratio of the number of attribute value types of the attribute after generalization and before generalization as the information loss amount ILR of one data record rp111. May be.

The above is the detailed description of the operation of calculating the information loss amount ILA of the information loss amount calculation unit 322.

Returning to the description of FIG. 9A. In the above description, the processing order of step S803, step S804, step S805, and step S806 may be any order. That is, the order may be reversed or simultaneous.

Next, the division attribute determination unit 3233 determines a division attribute (step S807).

Here, the operation of determining the division attribute by the division attribute determination unit 3233 will be described in detail.

The division attribute determination unit 3233 calculates an evaluation value using an evaluation formula including the weight and the information loss amount ILA, and determines a division attribute. Formula 2 shown below is an example of an evaluation formula.

[Equation 2]
Evaluation value = weight × information loss amount ILA (Expression 2)
For example, the evaluation value of the attribute whose attribute name is “birth year” in the data set sp110 is “160” because the weight is “8” and the information loss amount ILA-birth is “20”. Similarly, the evaluation value of the attribute having the attribute name “medical care date” is “20” because the weight is “1” and the information loss amount ILA-mc is “20”.

Next, the division attribute determination unit 3233 determines the attribute having the largest calculated evaluation value as the division attribute. For example, in the case of the data set sp110, since the evaluation value of the attribute whose attribute name is “birth year” is larger than the evaluation value of the attribute whose attribute name is “medical care month”, the divided attribute determination unit 3233 ”Is determined as the split attribute.

The formula for calculating the evaluation value is not limited to the formula 2, but the higher the priority p (for example, the value indicating that the higher the priority is, like the “weight” in the formula 2), and the amount of information loss. An arbitrary evaluation formula may be used such that the larger the ILA, the larger the calculation result.

The above is the description of the operation of determining the division attribute by the division attribute determination unit 3233.

Next, in FIG. 9B, the division attribute determination unit 3233 transmits a division value determination request to the division value determination unit 3234 (step S808). The division value determination request includes the “birth year” of the attribute name of the division attribute determined by the division attribute determination unit 3233.

The division value determination unit 3234 that has received the division value determination request transmits a personal data acquisition request to the personal data storage device 100. (Step S809)
The division value determining unit 3234 that has received the data set sp110 determines a division value (step S810).

Here, the operation of determining the division value by the division value determination unit 3234 will be described in detail.

The division value is a threshold value when dividing the data set with the specified attribute as the division axis. For example, the division value “birth year: 1956” indicates that the data set sp110 is divided into the data record rp111 whose attribute of “birth year” is “1956” or less and the data record rp111 exceeding “1956”.

FIG. 12 is a diagram illustrating an example of the division value candidates 1101 to 1111 of the data set sp110.

First, as illustrated in FIG. 12, the division value determination unit 3234 arranges the data records rp111 of the data set sp110 in the order in which the attribute values are determined in ascending order of the attribute values.

Next, the division value determination unit 3234 extracts division value candidates 1101 to 1111. Divided value candidates 1101 to 1111 extracted by the divided value determining unit 3234 include the first half part (also called third personal data) and the second half part (also called fourth personal data) of the divided data set sp110. This is a candidate for a division value in which the number of data records rp111 is k or more. For example, assuming that the attribute value “1951” in the attribute of “birth year” is a divided value, the first half includes five data records rp111 whose attribute value is 1951 or less. Further, the latter half includes 15 data records rp111 of 1952 or more. In this case, each of the first half and the second half is 5 or more.

In the data set sp110 shown in FIG. 12, the division value determining unit 3234 extracts division value candidates 1101 to 1111.

Next, the division value determination unit 3234 calculates an information loss amount ILA corresponding to each of the division value candidates 1101 to 1111. For example, the division value determining unit 3234 calculates the information loss amount ILA using Equation 1. Note that the division value determination unit 3234 may calculate the information loss amount ILA not only using the equation 1 but also using another calculation equation.

Specifically, for example, when the data set sp110 is divided by the division value candidate 1105, the division value determination unit 3234 calculates the information loss amount ILA as follows.

As shown in FIG. 12, the division value determination unit 3234 sorts the data set sp110 in ascending order according to the attribute value of the division attribute “birth year”. When the data set sp110 is divided by the division value of the division value candidate 1105, one information loss amount ILR of the divided first half data record rp111 is (1956-1943) / (1977-1943) = 0.382. It is.

Therefore, the total of the information loss amount ILR in the first half is 0.382 × 9 = 3.438 because the number of data records rp111 is nine.

In addition, one information loss amount ILR of the divided second half data record rp111 is (1977-1961) ÷ (1977-1943) = 0.471.

Therefore, the total of the information loss amount ILR in the latter half is 0.471 × 11 = 5.181 because the number of data records rp111 is 11.

Therefore, the total information loss amount ILA when divided by the division value candidate 1105 is 3.438 + 5.181 = 8.619.

Similarly, the information loss amounts ILA of the attributes “birth year” when divided by the division value candidates 1101 to 1104 are “11.76”, “12.47”, “10.67”. And “10.23”. Similarly, the information loss amounts ILA of the attribute “birth year” when divided by the division value candidates 1106 to 1111 calculated in the same manner are “10.00”, “10.05”, “9” .88 "," 10.14 "," 10.70 "and" 10.73 ".

The division value determining unit 3234 that has calculated the information loss amount ILA for each of the division value candidates 1101 to 1111 determines “birth year: 1956” of the division value candidate 1105 having the smallest information loss amount ILA as the division value.

The above is the description of the operation of determining the division value by the division value determination unit 3234.

Returning to the description of FIG. 9B. Next, the division value determination unit 3234 that has determined the division value transmits the determined division value “birth year: 1956” to the anonymity evaluation unit 3235 (step S811). In other words, the divided value determination unit 3234 requests the anonymity evaluation unit 3235 to evaluate anonymity.

The anonymity evaluation unit 3235 that has received the division value “birth year: 1956” performs anonymity evaluation (step S812).

Here, the operation of anonymity evaluation by the anonymity evaluation unit 3235 will be described in detail.

Anonymity evaluation means evaluating whether or not an anonymity index is satisfied. When the anonymity evaluation unit 3235 further divides the first half (third personal data) and the second half (fourth personal data) of the data set sp110, the further division is performed. Evaluate whether the part satisfies the anonymity index. That is, it is evaluated whether or not the number of data records rp111 is 2k or more for each of the first half and the latter half.

The anonymity evaluation unit 3235 counts the number of data records rp111 of the first half and the second half divided by the received division value. For example, when dividing by the division value “birth year: 1956”, the anonymity evaluation unit 3235 counts the number of data records rp111 in the first half as nine and the number of data records rp111 in the latter half as eleven. .

The above is the description of the anonymity evaluation operation by the anonymity evaluation unit 3235.

Returning to the description of FIG. 9B. Next, from step S813, the anonymization processing unit 323 determines the portion evaluated by the anonymity evaluation unit 3235 that the anonymity index is not satisfied (for example, the first half portion divided by the division value “birth year: 1956”). The process of step S815 is executed. In addition, the anonymization processing unit 323 performs the processing from step S821 onward for the portion evaluated by the anonymity evaluation unit 3235 that satisfies the anonymity index (for example, the latter half portion divided by the division value “birth year: 1956”) Execute.

The number of data records rp111 in the first half divided by the division value “birth year: 1956” was less than 2k. Therefore, the anonymity evaluation unit 3235 transmits a generalization execution request including “birth year: 1943 to 1956” to the generalization execution unit 3236 (step S813).
Upon receiving the generalization execution request, the generalization execution unit 3236 generalizes the data records rp111 having attribute values “1943” to “1956” of the “year of birth” attribute (step S814).

Specifically, the generalization execution unit 3236 sets the attribute value of the “birth year” attribute to “1943 to 1956” in the data record rp111 corresponding to the attribute value of “birth year” from “1943” to “1956”. In addition, the attribute value of the attribute “medical treatment date” is rewritten to “200512 to 201107”.

Next, the generalization execution unit 3236 records the rewritten data record rp111 in the anonymized personal data storage device 200 (step S815). In other words, the generalization execution unit 3236 registers the anonymized personal data in the anonymized personal data storage device 200.

The number of data records rp111 in the second half divided by the division value “birth year: 1956” was 2k or more. Therefore, the anonymization device 320 sets the divided second half portion (fourth personal data) of the data set sp110 as a new data set sp (new first personal data), and performs the processing after step S821 (second time). Anonymize).

10A, the anonymity evaluation unit 3235 transmits a subdivision request including “birth year: 1961 to 1977” to the division attribute determination unit 3233 (step S821).

Next, the division attribute determination unit 3233 that has received the subdivision request uses the priority determination information 3210 to generate a weight based on the attribute name and the corresponding index “birth year: 4, medical year: 1”. (Step S822)
Here, the division attribute determination unit 3233 calculates the weight corresponding to the attribute “birth year” as “8” and the weight corresponding to the attribute “medical year” as “1”.

Next, the division attribute determination unit 3233 requests the information loss amount calculation unit 322 to calculate the information loss amount ILA (step S823).
Next, the information loss amount calculation unit 322 that has received the calculation request for the information loss amount ILA sends the personal data storage device 100 the data record rp111 (“1961” to “1977” attribute values of the “birth year” attribute). An acquisition request for the latter half of the data set sp110 is transmitted (step S824). In other words, the information loss amount calculation unit 322 requests the personal data storage device 100 to acquire personal data.

Next, the information loss amount calculation unit 322 that has received the latter half of the data set sp110 calculates the information loss amount ILA, and transmits the calculated information loss amount ILA to the division attribute determination unit 3233 (step S825).

Here, the information loss amount calculation unit 322 calculates the information loss amount ILA-birth-ov1961 of the attribute “birth year” for the latter half of the data set sp110 as follows.

(1977-1961) ÷ (1977-1943) × 11 = 5.181
Further, the information loss amount calculation unit 322 calculates the information loss amount ILA-mc-ov1961 having the attribute of “medical care date” as follows.

(201107-200512 ÷ (201107-200612) = 9.152
Next, the division attribute determination unit 3233 determines a division attribute (step S826).
For example, the division attribute determination unit 3233 uses Equation 2 to evaluate the “birth year” attribute because the weight of the “birth year” attribute is “8” and the information loss amount ILA-birth-ov1961 is “5.181”. “41.448” is calculated as the value. Similarly, for the attribute of “medical care date”, the division attribute determination unit 3233 has a weight of “1” and an information loss amount ILA-mc-ov1961 of “9.152”. As a result, “9.152” is calculated.

Next, since the evaluation value of the attribute of “birth year” is larger than the evaluation value of the attribute of “medical care date”, the division attribute determination unit 3233 determines the attribute whose attribute name is “birth year” as the division attribute.

Next, in FIG. 10B, the division attribute determination unit 3233 transmits a division value determination request including the attribute name “birth year” to the division value determination unit 3234 (step S827).

The division value determination unit 3234 that has received the division value determination request transmits a personal data acquisition request to the personal data storage device 100 (step S828).

Here, the division attribute determination unit 3233 requests to acquire the data record rp111 (for example, the second half of the data set sp110) that is the object of the second anonymization.

The division value determining unit 3234 that has received the target data record rp111 determines a division value (step S829).

FIG. 13 is a diagram illustrating an example of a divided value candidate 1121 and a divided value candidate 1122 of the data set sp130 (new first personal data) that is a divided second half of the data set sp110.

First, as illustrated in FIG. 13, the division value determination unit 3234 arranges the data records rp111 of the data set sp130 in the order of the attribute values of the attributes determined by the division attribute determination unit 3233.

Next, the division value determination unit 3234 extracts division value candidates. In the data set sp130 illustrated in FIG. 13, the division value determining unit 3234 extracts the division value candidate 1121 and the division value candidate 1122 as division value candidates.

Next, the division value determining unit 3234 calculates an information loss amount ILA-birth for each of the division value candidate 1121 and the division value candidate 1122. In the case of the data set sp130 shown in FIG. 13, the division value determination unit 3234 has information loss amounts ILA-birth obtained by dividing the division value candidate 1121 and the division value candidate 1122 by “5.565” and “4. 820 ". Subsequently, the division value determination unit 3234 determines “the year of birth: 1963” of the division value candidate 1122 having the smallest information loss amount ILA-birth as the division value.

Next, the division value determination unit 3234 that has determined the division value transmits the determined division value “birth year: 1963” to the anonymity evaluation unit 3235 (step S830). In other words, the divided value determination unit 3234 requests the anonymity evaluation unit 3235 to evaluate anonymity.

The anonymity evaluation part 3235 which received division value "birth year: 1963" performs anonymity evaluation (step S831).

The anonymity evaluation unit 3235 counts the number of each data record rp111 divided by the division value “birth year: 1963”. FIG. 14 is a diagram illustrating an image in which the data set sp130 illustrated in FIG. 13 is divided when the division value determination unit 3234 determines the division value as the division value candidate 1222 “birth year: 1963”. In the example illustrated in FIG. 14, the anonymity evaluation unit 3235 has six data records rp111 of the data set sp140 of the first half part after the division, and data records rp111 of the data set sp150 of the second half part after the division. Is counted as 5.

The number of data records rp111 in each of the data set sp140 and the data set sp150 is less than 2k. Therefore, the anonymity evaluation unit 3235 transmits a generalization execution request including “birth year: 1961 to 1963” and a generalization execution request including “birth year: 1964 to 1977” to the generalization execution unit 3236 (step S813).

Upon receiving the generalization execution request, the generalization execution unit 3236 receives the generalization of the data record rp111 having the attribute value “1961” to “1963” and the attribute value of the “birth year” attribute “1964”. ”To“ 1977 ”is generalized (step S814).
As shown in FIG. 14, the data set sp140 is generalized to have an attribute value of “1961 to 1963” for an attribute of “birth year” and an attribute value of an attribute of “medical year” to “20062 to 201105”. In addition, the attribute value of the “birth year” attribute is generalized to “1964-1977”, and the attribute value of the “medical care month” attribute is generalized to “200706-201104” in the data set sp150.

Next, the generalization execution unit 3236 records the generalized data record rp111 in the anonymized personal data storage device 200 (step S815).
As in the effect of the first embodiment, the effect in the above-described embodiment is that the data set is anonymized by controlling to match the purpose of use, and the loss of information in the anonymized data set is reduced. It is a point that can be reduced.

The reason is that the division attribute determination unit 3233 generates an evaluation value based on the priority p and the information loss amount ILA, and determines an attribute to be generalized based on the generated evaluation value.
<<< Third Embodiment >>>
Next, a third embodiment of the present invention will be described in detail with reference to the drawings. Hereinafter, the description overlapping with the above description is omitted as long as the description of the present embodiment is not obscured.

FIG. 15 is a block diagram showing a configuration of the anonymization device 330 according to the present embodiment. The anonymization apparatus 330 of this embodiment performs anonymization by a bottom-up approach.

As illustrated in FIG. 15, the anonymization device 330 includes a priority determination information storage unit 321, an information loss amount calculation unit 332, and an anonymization processing unit 333.

Note that the anonymization device 330 may be included in the system shown in FIG. 2 instead of the anonymization device 310.

FIG. 16 is a diagram illustrating an example of priority determination information 3310 stored in the priority determination information storage unit 321. As illustrated in FIG. 16, the priority determination information 3310 according to the present embodiment includes one or more sets of priority order, attribute name, and threshold value. The priority order indicates, for example, the order in which the attribute specified by the corresponding attribute name is generalized. For example, when the value obtained by subtracting the information loss amount ILA of the higher priority attribute from the information loss amount ILA of the lower priority attribute exceeds the threshold, the lower priority attribute is generalized. Indicates the value when the order to be performed is first.

In FIG. 16, the priority indicates that the smaller the number, the higher the priority. That is, in FIG. 16, the higher priority attribute is an attribute whose attribute name is “age”, and the lower priority attribute is an attribute whose attribute name is “2011 medical care month”.

Note that the priority determination information may include a set of priority and attribute name. In this case, the anonymization processing unit 333 may hold the threshold value in an internal storage unit (not shown), for example.

The information loss amount calculation unit 332 calculates and outputs an information loss amount ILA due to generalization. For example, the information loss amount calculation unit 332 counts the number of different attribute values included in the attribute of the personal data, and sets this as the information loss amount ILA.

The anonymization processing unit 333 includes a generalization attribute determination unit 3333, a generalization execution unit 3336, and an anonymity evaluation unit 3335.

The generalization attribute determination unit 3333 determines an attribute to be generalized. For example, the generalization attribute determination unit 3333 determines an attribute to be generalized as follows. First, the generalized attribute determination unit 3333 calculates the information loss amount difference by subtracting the information loss amount ILA of the attribute having the higher priority from the information loss amount ILA of the attribute having the lower priority. Next, the generalization attribute determination unit 3333 compares the information loss amount difference with the threshold value of the attribute with the higher priority. Then, when the information loss amount difference is equal to or greater than the threshold, the generalization attribute determination unit 3333 determines to generalize the lower priority attribute. In addition, when the information loss amount difference is less than the threshold, the generalization attribute determination unit 3333 determines to generalize the attribute with the higher priority.

The generalization attribute determination unit 3333 may determine the generalization attribute as follows. First, in the generalization attribute determination unit 3333, the priority p of the attribute with the higher priority is the threshold value of the attribute, and the priority p of the attribute with the lower priority is “0”. The generalization attribute determination unit 3333 calculates an evaluation value for each of these attributes using an evaluation formula of evaluation value = information loss amount ILA + priority p. Next, the generalization attribute determination unit 3333 determines an attribute having a large evaluation value as an attribute to be generalized. If the evaluation value is the same for the attribute with the higher priority and the attribute with the lower priority, the generalized attribute determination unit 3333 determines, for example, the attribute with the higher priority as the attribute to be generalized. You may do it.

The generalization execution unit 3336 generalizes the attribute determined by the generalization attribute determination unit 3333.

The anonymity evaluation unit 3335 determines whether the data set generalized by the generalization execution unit 3336 satisfies the anonymity index.

FIG. 17 is a diagram showing an example of a data set sp160 stored in the personal data storage device 100 of the present embodiment. Each of the data records rp161 of the data set sp160 illustrated in FIG. 17 includes attribute values of attributes of “name”, “age”, “2011 medical care month”, and “disease name”. In the present embodiment, “age” and “2011 medical care date” are set as anonymization attributes (quasi-identifiers).

The anonymization device 330 described above may be realized by the computer 700 shown in FIG. 5 similarly to the anonymization device 310 shown in FIG.

18A, 18B, and 18C are sequence diagrams illustrating the operation of the anonymization device 330 according to the present embodiment.

18A, the generalization attribute determination unit 3333 receives an input of an anonymization execution request by a system user from the input unit 704 shown in FIG. 5, for example (step S841).

Here, the anonymization execution request includes, for example, the value of k-anonymization k (for example, “3”).

Next, the generalization attribute determination unit 3333 that has received the anonymization execution request transmits a priority determination information acquisition request to the priority determination information storage unit 321 (step S842).

Next, the generalized attribute determination unit 3333 that has received the priority determination information 3310 as a response to the priority determination information acquisition request transmits an information loss amount calculation request to the information loss amount calculation unit 332. (Step S843).

Next, the information loss amount calculation unit 332 that has received the information loss amount calculation request transmits a personal data acquisition request to the personal data storage device 100. (Step S844).

Next, the information loss amount calculation unit 332 that has received the data set sp160 as a response to the personal data acquisition request calculates the information loss amount ILA, and transmits the calculated information loss amount ILA to the generalized attribute determination unit 3333. (Step S845).

Here, the information loss amount calculation unit 332 calculates the information loss amount ILA by the number of types of attribute values. That is, the information loss amount calculation unit 332 calculates the information loss amount ILAbirth of the “age” attribute as “12” because the attribute value of the “age” attribute has 12 types. Further, the information loss amount calculation unit 332 calculates the information loss amount ILAmc2011 of the attribute of “2011 medical care month” as “10” because there are ten types of attribute values of the attribute of “2011 medical care month”.

Next, the generalization attribute determination unit 3333 that has received the information loss amount ILA determines an attribute to be generalized (step S846).

For example, the generalization attribute determination unit 3333 uses the received priority determination information 3310 to determine the attribute to be generalized based on the received information loss amount ILA.

For example, the generalization attribute determination unit 3333 determines the information loss of “age” that is the attribute of “1” from the information loss amount ILA of “medical care month of 2011” that is the attribute of “2”. The amount of information loss is calculated by subtracting the amount ILA. That is, 10−12 = −2 is calculated. Next, the generalization attribute determination unit 3333 compares the information loss amount difference with the threshold value of “age” (“3”) that is the attribute having the priority “1”. In this case, since −2 <3, the generalization attribute determination unit 3333 determines to generalize “age”, which is the attribute having the priority “1”.

Note that the generalization attribute determination unit 3333 may determine the attribute to be generalized using the method described in the first embodiment.

Next, the generalization attribute determination unit 3333 transmits a generalization execution request including the attribute name of the attribute determined to be generalized (in this case, “age”) to the generalization attribute execution unit (step S847).

Next, the generalization execution unit 3336 that has received the generalization execution request generalizes the data set sp160 shown in FIG. 17 as the data set sp162 shown in FIG. 19 (step S848).

FIG. 19 is a diagram illustrating an example of a data set in the middle of anonymization processing (partially generalized) by the anonymization device 330 of the present embodiment.

Next, the generalization execution unit 3336 transmits an anonymity evaluation request including the data set sp162 to the anonymity evaluation unit 3335 (step S849).

Note that the generalization execution unit 3336 may store the data set sp162 in the storage unit 702 illustrated in FIG. 5 and transmit an anonymity evaluation request including the stored address to the anonymity evaluation unit 3335. The same applies to the anonymity evaluation request below.

Next, the anonymity evaluation unit 3335 that has received the anonymity evaluation request evaluates the anonymity of the data set sp162. In the case of the data set sp162 of FIG. 19, the anonymity evaluation unit 3335 determines that the value of “k-anonymity” (“3”) is not satisfied for the attribute “medical care month” (step S850).

Next, in FIG. 18B, the anonymity evaluation unit 3335 transmits a generalization attribute determination request to the generalization attribute determination unit 3333 (step S851).
Next, the generalization attribute determination unit 3333 that has received the generalization attribute determination request transmits an information loss amount calculation request to the information loss amount calculation unit 332 (step S852).

Next, the information loss amount calculation unit 332 that has received the information loss amount calculation request calculates the information loss amount ILA, and transmits the calculated information loss amount ILA to the generalization attribute determination unit 3333 (step S853).

Here, in the case of the data set sp162 shown in FIG. 19, the types of attribute values of the attribute of “age” are four types of “21 to 24”, “31 to 40”, “41 to 51”, and “52 to 58”. It is. In addition, there are ten types of attribute values for the attribute “2011 medical care month”. Therefore, the information loss amount calculation unit 332 sets the information loss amount ILA-birth and the information loss amount ILA-mc2011 corresponding to the attributes of “age” and “2011 medical care month” to “4” and “10”, respectively. calculate.

Next, the generalization attribute determination unit 3333 that has received the information loss amount ILA determines an attribute to be generalized (step S854).

The information loss amount ILA-birth of the attribute “age” with the priority “1” is “4”, and the information loss amount ILA-mc2011 of the attribute “2011 medical care month” with the priority “2” is “10”. Therefore, the information loss amount difference is as follows.

10-4 = 6
The generalization attribute determination unit 3333 compares the difference in information loss amount (“6”) with the threshold value (“3”) of “age” that is the attribute having the priority “1”. In this case, since 6> 3, the generalization attribute determination unit 3333 determines to generalize “2011 medical care month” that is the attribute having the priority “2”.

Next, the generalization attribute determination unit 3333 transmits a generalization execution request including the attribute name determined to be generalized (in this case, “2011 medical care month”) to the generalization attribute execution unit (step S855).

Next, the generalization execution unit 3336 that received the generalization execution request generalizes the data set sp162 shown in FIG. 19 to the data set sp163 shown in FIG. 20 (step S856).

FIG. 20 is a diagram illustrating an example of a data set in the middle of the anonymization process (partially generalized) by the anonymization apparatus 330 of the present embodiment.

Next, the generalization execution unit 3336 transmits an anonymity evaluation request including the data set sp163 to the anonymity evaluation unit 3335 (step S857).
Next, the anonymity evaluation part 3335 which received the anonymity evaluation request | requirement evaluates the anonymity of the data set sp163. In the case of the data set sp163 shown in FIG. 20, the anonymity evaluation unit 3335 has k-anonymity k values (“3” for the combination of the “medical care month” attribute and the “2011 medical care month” attribute. ]) Is not satisfied (step S858).

Next, the anonymity evaluation unit 3335 transmits a generalization attribute determination request to the generalization attribute determination unit 3333 (step S859).
Next, in FIG. 18C, the generalization attribute determination unit 3333 that has received the generalization attribute determination request transmits an information loss amount calculation request to the information loss amount calculation unit 332 (step S860).

Next, the information loss amount calculation unit 332 that has received the information loss amount calculation request calculates the information loss amount ILA, and transmits the calculated information loss amount ILA to the generalization attribute determination unit 3333 (step S861).
Here, in the case of the data set sp163 illustrated in FIG. 20, there are four types of attribute values of the attribute “age” and four types of attribute values of the attribute “2011 medical care month”. Therefore, the information loss amount calculation unit 332 calculates the information loss amount ILA-birth and the information loss amount ILA-mc2011 corresponding to the respective attributes of “age” and “2011 medical care month” as “4”. .

Next, the generalization attribute determination unit 3333 that has received the information loss amount ILA determines an attribute to be generalized (step S862).

The information loss amount ILA-birth of the attribute “age” with the priority “1” is “4”, and the information loss amount ILA-mc2011 of the attribute “2011 medical care month” with the priority “2” is “4”. Therefore, the information loss amount difference is as follows.

4-4 = 0
The generalization attribute determination unit 3333 compares the difference in information loss amount (“0”) with the threshold value (“3”) of “age” that is the attribute having the priority “1”. In this case, since 0 <3, the generalization attribute determination unit 3333 determines to generalize the “age” that is the attribute having the priority “1”.

Next, the generalization attribute determination unit 3333 transmits a generalization execution request including the attribute name of the attribute determined to be generalized (in this case, “age”) to the generalization execution unit 3336 (step S863).

Next, the generalization execution unit 3336 that received the generalization execution request generalizes the data set sp163 shown in FIG. 20 to the data set sp164 shown in FIG. 21 (step S864).

FIG. 21 is a diagram illustrating an example of a data set that has been anonymized by the anonymization device 330 of the present embodiment.

Next, the generalization execution unit 3336 transmits an anonymity evaluation request including the data set sp164 to the anonymity evaluation unit 3335 (step S865).
Next, the anonymity evaluation part 3335 which received the anonymity evaluation request | requirement evaluates the anonymity of the data set sp164. In the case of the data set sp164 shown in FIG. 21, the anonymity evaluation unit 3335 determines that the data set sp164 satisfies k-anonymity (step S866).

Next, the anonymity evaluation unit 3335 transmits the data set sp164 that satisfies the anonymity to the anonymized personal data storage device 200 (step S867).

The anonymized personal data storage unit 2a that has received the data set sp164 stores the data set sp164 as an anonymized data set st120 (anonymized personal data). (Step S868)
As in the effect of the first embodiment, the effect in the above-described embodiment is that the data set is anonymized by controlling to match the purpose of use, and the loss of information in the anonymized data set is reduced. It is a point that can be made compatible with reduction.

The reason is that the generalization attribute determination unit 3333 generates an evaluation value based on the priority order, the threshold value, and the information loss amount ILA, and determines an attribute to be generalized based on the generated evaluation value. .

Each component described in each of the above embodiments does not necessarily need to be an independent entity. For example, each component may be realized as a module with a plurality of components. In addition, each component may be realized by a plurality of modules. Each component may be configured such that a certain component is a part of another component. Each component may be configured such that a part of a certain component overlaps a part of another component.

In the embodiments described above, each component and a module that realizes each component may be realized by hardware if necessary. Moreover, each component and the module which implement | achieves each component may be implement | achieved by a computer and a program. Each component and a module that realizes each component may be realized by mixing hardware modules, computers, and programs.

The program is provided by being recorded on a non-volatile computer-readable recording medium such as a magnetic disk or a semiconductor memory, and is read by the computer when the computer is started up. The read program causes the computer to function as a component in each of the above-described embodiments by controlling the operation of the computer.

In each of the embodiments described above, a plurality of operations are described in order in the form of a flowchart. However, the order of description does not limit the order in which the plurality of operations are executed. For this reason, when each embodiment is implemented, the order of the plurality of operations can be changed within a range that does not hinder the contents.

Furthermore, in each embodiment described above, a plurality of operations are not limited to being executed at different timings. For example, another operation may occur during the execution of a certain operation, or the execution timing of a certain operation and another operation may partially or entirely overlap.

Furthermore, in each of the embodiments described above, it is described that a certain operation becomes a trigger for another operation, but the description does not limit all relationships between the certain operation and other operations. For this reason, when each embodiment is implemented, the relationship between the plurality of operations can be changed within a range that does not hinder the contents. The specific description of each operation of each component does not limit each operation of each component. For this reason, each specific operation | movement of each component may be changed in the range which does not cause trouble with respect to a functional, performance, and other characteristic in implementing each embodiment.

As mentioned above, although this invention was demonstrated with reference to each embodiment and an Example, this invention is not limited to the said embodiment and Example. Various changes that can be understood by those skilled in the art can be made to the configuration and details of the present invention within the scope of the present invention.

The present invention has been described above with reference to the embodiments, but the present invention is not limited to the above embodiments. Various changes that can be understood by those skilled in the art can be made to the configuration and details of the present invention within the scope of the present invention.

This application claims priority based on Japanese Patent Application No. 2012-181684 filed on August 20, 2012, the entire disclosure of which is incorporated herein.

100 personal data storage 110 data set sp
111 data record rp
130 Data set sp
140 Data set sp
150 Data set sp
160 Data set sp
161 data record rp
162 Data set sp
163 Data set sp
164 Data set sp
200 Anonymized personal data storage device 210 Anonymized data set sa
211 anonymized data record 310 anonymization device 312 information loss amount calculation unit 313 anonymization processing unit 320 anonymization device 321 priority determination information storage unit 322 information loss amount calculation unit 323 anonymization processing unit 700 computer 701 CPU
702 Storage unit 703 Storage device 704 Input unit 705 Output unit 706 Communication unit 707 Recording medium 1105 Division value candidate 1121 Division value candidate 1122 Division value candidate 3210 Priority determination information 3233 Division attribute determination unit 3234 Division value determination unit 3235 Anonymity evaluation unit 3236 Generalization execution units 1101 to 1104 Division value candidates 1101 to 1111 Division value candidates 1106 to 1111 Division value candidates

Claims

Calculating an information loss amount corresponding to each of the attributes included in the first personal data to be anonymized, and outputting the information loss amount calculating means;
A second individual who determines the attribute to be processed based on the priority corresponding to each of the attributes and the amount of information loss, and processes the attribute value of the determined attribute of the first personal data Anonymization processing means for generating and outputting data;
An information processing apparatus including:
The information processing apparatus according to claim 1, wherein the priority indicates which of the attributes causes less information loss in the processed second personal data.
Further comprising priority determination information storage means for storing information for determining the priority;
The information processing apparatus according to claim 1, wherein the anonymization processing unit determines the priority based on information for determining the priority.
The anonymization processing means uses an evaluation formula such that if the information loss amount is constant, the higher the priority is, and if the priority is constant, the larger the information loss amount is, the larger the calculation result is. The information processing apparatus according to any one of claims 1 to 3, wherein an evaluation value is calculated, and the attribute having the maximum calculated evaluation value is determined as an attribute to be processed.
The information processing apparatus according to claim 4, wherein the evaluation formula includes an operation of multiplying an information loss amount and a priority.
The information processing apparatus according to claim 4, wherein the evaluation formula includes an operation of adding the information loss amount and the priority.
The anonymization processing means is:
The information loss amount is calculated when the first personal data is generalized to the same attribute value of each of the attributes to be anonymized, and based on the calculated information loss amount A split attribute determining means for determining a split attribute to be converted to,
The division value of the division attribute is determined so that the amount of information loss when the first personal data is divided and the attribute value of the division attribute is generalized with the determined division attribute as an axis is minimized. A dividing value determining means;
Anonymity evaluation means for determining whether or not further division is possible for each of the third personal data and the fourth personal data generated by dividing the first personal data with the determined division value;
Generalization execution means for generalizing and outputting the attribute values of the division attributes of the third personal data and the fourth personal data determined that the anonymization evaluation means is not possible to be further divided,
The split attribute determining means and the split value determining means are configured to change the third personal data and the fourth personal data determined by the anonymity evaluation means to be further split into new first individuals. It processes as data. The information processing apparatus of any one of Claim 1 thru | or 6 characterized by the above-mentioned.
The anonymization processing means is:
The information loss amount corresponding to each of the attributes to be anonymized of the first personal data is calculated, and the generalized attribute to be generalized based on the calculated information loss amount and the priority Generalization attribute determination means for determining
Generalization executing means for generating fifth personal data by generalizing the attribute value of the determined generalization attribute included in the first personal data;
When it is determined whether or not the fifth personal data has predetermined anonymity, and it is determined that the fifth personal data has predetermined anonymity, the fifth personal data Anonymity evaluation means for outputting as the second personal data,
The generalization attribute determination unit and the generalization execution unit, when the anonymity evaluation unit determines that the fifth personal data does not have a predetermined anonymity, The information processing apparatus according to any one of claims 1 to 6, wherein the information processing apparatus is processed as one piece of personal data.
Computer
Calculate and output the amount of information loss corresponding to each attribute included in the first personal data to be anonymized,
Determine the attribute to be processed based on the priority corresponding to each of the attributes and the amount of information loss,
Generating and outputting second personal data obtained by processing an attribute value of the determined attribute of the first personal data;
Anonymization processing method.
The computer is
If the information loss amount is constant, the higher the priority is, and if the priority is constant, the evaluation value is calculated using an evaluation formula such that the larger the information loss amount is, the larger the calculation result is.
The anonymization processing method according to claim 9, wherein the attribute having the maximum calculated evaluation value is determined as an attribute to be processed.
A process of calculating and outputting an information loss amount corresponding to each attribute included in the first personal data to be anonymized;
Processing for determining the attribute to be processed based on the priority corresponding to each of the attributes and the information loss amount;
A non-volatile recording medium storing a program for causing a computer to generate and output second personal data obtained by processing the attribute value of the determined attribute of the first personal data.
The process for determining the attribute includes:
A process of calculating an evaluation value using an evaluation formula such that if the information loss amount is constant, the higher the priority is; and if the priority is constant, the information loss amount is larger, the calculation result is larger. ,
The process according to claim 11, further comprising: determining the attribute having the maximum calculated evaluation value as an attribute to be processed.