WO2016092830A1 - Dispositif de traitement d'informations, procédé de traitement d'informations et support d'enregistrement - Google Patents

Dispositif de traitement d'informations, procédé de traitement d'informations et support d'enregistrement Download PDF

Info

Publication number
WO2016092830A1
WO2016092830A1 PCT/JP2015/006113 JP2015006113W WO2016092830A1 WO 2016092830 A1 WO2016092830 A1 WO 2016092830A1 JP 2015006113 W JP2015006113 W JP 2015006113W WO 2016092830 A1 WO2016092830 A1 WO 2016092830A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
resampling
anonymized
record
matching
Prior art date
Application number
PCT/JP2015/006113
Other languages
English (en)
Japanese (ja)
Inventor
翼 高橋
Original Assignee
日本電気株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電気株式会社 filed Critical 日本電気株式会社
Priority to JP2016563507A priority Critical patent/JPWO2016092830A1/ja
Publication of WO2016092830A1 publication Critical patent/WO2016092830A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H10/00ICT specially adapted for the handling or processing of patient-related medical or healthcare data
    • G16H10/60ICT specially adapted for the handling or processing of patient-related medical or healthcare data for patient-specific data, e.g. for electronic patient records

Definitions

  • the present invention relates to information processing, and more particularly, to an information processing apparatus, information processing method, and recording medium that handle personal information (privacy information).
  • privacy information personal information
  • privacy information includes data relating to health care (health management) such as health management data, medical history, or a receipt (medical fee description).
  • health management data relating to health care
  • medical history such as health management data, medical history, or a receipt (medical fee description).
  • the privacy information information on the position of a person or terminal regarding movement or stay, such as a movement history or a use history of a wireless access point, can be cited.
  • a receipt which is privacy information
  • attribute data such as birth year, gender, injury and illness name, and drug name.
  • an attribute that characterizes an individual such as a year of birth or gender and may identify an individual based on a combination with other attributes is called a “quasi-identifier (QI)”.
  • QI quadsi-identifier
  • an attribute that is not desired to be known to others, such as an injury or illness name or a drug name is called “sensitive attribute (sensitive information: SA (Sensitive Attribute) or SV (Sensitive Value))”.
  • SA Sensitive Attribute
  • SV Sesitive Value
  • Secondary utilization means, for example, providing privacy information to a third party different from the service provider that generates or accumulates privacy information, and the third party It is used in the service.
  • secondary utilization is providing privacy information to a third party and requesting outsourcing such as analysis.
  • the secondary use of privacy information promotes the analysis or research of privacy information and leads to the enhancement of services using the analysis result or research result. Therefore, based on secondary utilization, a third party can enjoy the high benefits of privacy information.
  • the pharmaceutical company can analyze the co-occurrence relation or correlation of medicines based on the medical information. Using the medical information, the pharmaceutical company can know how the drug is used, and can analyze the usage state. However, it is generally difficult for pharmaceutical companies to obtain medical information.
  • data sets including privacy information are not actively used for secondary use due to concerns about privacy infringement.
  • a data set including a record including a user identifier (user ID) for uniquely identifying a service user and one or more pieces of sensitive information is stored in the information processing apparatus of the service provider.
  • the third party can specify the user of the service using the user identifier. Therefore, in such provision, a problem of privacy infringement may occur.
  • a third party may be able to identify an individual based on the combination of quasi-identifiers. That is, even if a data set from which user identifiers are removed is a privacy violation, if a certain individual can be identified based on a combination of quasi-identifiers.
  • Anonymization technology (Anonymization) is known as such technology (see, for example, Patent Document 1, Non-Patent Document 1, and Non-Patent Document 2).
  • Non-Patent Document 1 describes “k-anonymity”, which is a widely used anonymity index.
  • k-anonymity means that for all records, there are at least k records having the same quasi-identifier (or quasi-identifier pair) in the data set to be anonymized. It is an indicator to do.
  • a technique for satisfying k-anonymity for a data set to be anonymized is called “k-anonymization”.
  • k-anonymization converts the quasi-identifier of the target record so that there are at least k records having the same quasi-identifier (or quasi-identifier pair) in the data set to be anonymized.
  • a process such as generalization or cutoff is known.
  • the generalization processing is processing for converting original (original) detailed information into abstracted information.
  • the cut-off process is a process for deleting original detailed information.
  • Patent Document 1 describes a technique using k-anonymization.
  • the technique described in Patent Document 1 stores data received from a user terminal after conversion using encryption or the like, decrypts the stored data, and processes to satisfy k-anonymity, Sent to the service provider's server.
  • k-anonymization is anonymization that guarantees that the number of records associated with the same quasi-identifier is k or more.
  • k-anonymization processes k or more records so that the records for an individual cannot be narrowed down to at least fewer candidates than k. Based on this behavior, k-anonymization makes it difficult to identify or identify records. Therefore, it can be said that k-anonymity is an index representing the difficulty in identifying a record.
  • FIG. 3 is a diagram showing an example of privacy information.
  • the privacy information shown in FIG. 3 includes a name (Name), an age (Age), a postal code (Zip-code), and a disease (Disease) as attributes.
  • name name
  • Age age
  • Zip-code postal code
  • Disease disease
  • an attribute of name (Name) is a user identifier.
  • the attributes of age (Age) and zip code (Zip-code) are quasi-identifiers.
  • a disease attribute is a sensitive attribute.
  • the attribute name (Name) which is an explicit identifier (user identifier)
  • the quasi-identifier is processed. Anonymity is satisfied.
  • the quasi-identifier is processed so that the records of Alice, Bob, and Carol shown in FIG. 3 have the same quasi-identifier pair.
  • the quasi-identifiers are processed so that the David and Eve records have the same quasi-identifier pair.
  • the value of the attribute age (Age) is processed into a value representing the age range.
  • the postal code (Zip-code) of the attribute having five digits the value of the digit having the same value is left as it is, and the value of the digit having a different value is replaced with “*”.
  • Information loss occurs based on the action of processing the value of this attribute into an ambiguous value. This degree of information loss is called “information loss” or “information loss amount”.
  • measures of information loss indicators such as NCP (NormalizedCPPenalty), EM (Entropy Measure), or DM (Distortion Metric) are widely known.
  • k-Anonymity indices other than anonymity include (k, 1)-anonymity ((k, 1) -anonymity), (1, k)-anonymity ((1, k) -anonymity), (k, The indices k) -anonymity ((k, k) -anonymity) and k-concealment are known (see, for example, Non-Patent Document 2).
  • anonymity indicators are based on matching of records (t) included in the original data set (T) and records (t * ) included in the anonymized data set (T * ). It is an indicator of anonymity.
  • a certain anonymizing operator A and in the anonymizing operator A, when the relationship of “A (t) ⁇ t * ” or “A (t * ) ⁇ t” holds, t and t * Is “matching”, or “matching exists” between t and t * .
  • the “data set” may be simply referred to as “data”.
  • the “data set after anonymization” is simply referred to as “anonymized data set” or “anonymized data”.
  • original data the privacy information that is the source of anonymization
  • original data set The original data is subject to anonymization processing. Therefore, in the following description, data in a state where the user identifier is deleted from the original data is also referred to as “original data”.
  • Alice's record matches the first and third records of anonymized data.
  • FIG. 6 is a diagram showing an extracted graph (anonymized matching graph) showing matching in FIG.
  • the matching between the original data and the anonymized data records can be expressed using a bipartite graph with each record of the original data and the anonymized data as a vertex and the matching as an edge.
  • this bipartite graph is referred to as an anonymized matching graph.
  • (K, k) -anonymity is an index that guarantees that both (k, 1) -anonymity and (1, k) -anonymity are satisfied at the same time.
  • K-Concealment is an index that guarantees that an edge of an anonymized matching graph constituting a (k, 1) -anonymity matching is a perfect match.
  • perfect matching means that a set of edges of a bipartite graph showing matching becomes a set of edges that can connect all vertices and do not share end points.
  • bipartite graph one data is set as original data, and the other data is set as anonymized data. Then, with each record as a vertex, a bipartite graph is constructed by extending an edge between the vertices of the record in which matching exists.
  • one complete matching is a bijection between the record of the original data and the record of the anonymized data. In other words, there is a unique anonymized data record in a certain original data record.
  • the technique described in Non-Patent Document 2 is an anonymization technique that satisfies k-concealment. And the anonymization described in Non-Patent Document 2 is anonymization based on matching.
  • the technique described in Non-Patent Document 2 first generates anonymized data having the same set of quasi-identifiers as the original data from the original data.
  • the technique described in Non-Patent Document 2 generates a complete matching bipartite graph based on matching between the original data record and the anonymized data record.
  • the addition of an edge to the bipartite graph corresponds to an operation of anonymizing the record of anonymized data that is another end point so that the original data record that is the end point of the added edge is matched. That is, the anonymized data is processed based on the addition of sides. As a result, information loss occurs.
  • the anonymization technique described in Non-Patent Document 2 uses the three steps of adding edges as follows to satisfy k-concealment so as to reduce information loss. That is, the technique described in Non-Patent Document 2 satisfies (k, 1) -anonymity with low information loss by using edge addition in the first step. Then, the technique described in Non-Patent Document 2 satisfies (k, k) -anonymity with a small information loss by using edge addition in the second step. Then, the technique described in Non-Patent Document 2 uses the addition of edges in the third step to satisfy k-concealment with low information loss.
  • anonymized data is a value different from the format of the value of the original data that is an anonymization target. May have the form When the value format is different, an application designed for the value format in the original data cannot perform an operation using the anonymized data, or causes an inconvenience such as calculating an invalid value.
  • the format of the age value in the attribute of the original data shown in FIG. 3 is a format representing one numerical value.
  • the format of the attribute age (Age) value in the anonymized data shown in FIG. 4 is a format representing a range of numerical values.
  • the format of the value of the anonymized data may be different from the format of the value of the original data.
  • Patent Document 1 Non-Patent Document 1
  • Non-Patent Document 2 the techniques described in Patent Document 1, Non-Patent Document 1, and Non-Patent Document 2 have a problem that anonymized data cannot be applied to an application designed for original data.
  • the original data is T
  • the anonymized data is T *
  • the attribute A of the original data T
  • the attribute of the anonymized data T * obtained by anonymizing the attribute A of the original data T is A *
  • a set of attribute A values (hereinafter, this set is referred to as a “domain”) is D (A).
  • the device that performs anonymization sets the domain D (A * ) of the attribute A * of the anonymized data T * to the same value as the domain D (A) of the attribute A of the original data.
  • Resampling is known as such a rework technique. Resampling is a technique in which data created (projected) based on certain data is re-created (projected) based on another viewpoint.
  • Resampling can be realized using setting as an attribute value.
  • resampled data not only random selection but also anonymized data resampled using a value selected based on a predetermined selection method from a range of attributes in anonymized data.
  • the attribute value in the general resampling data is a value selected at random from the attribute value range of the anonymized data. Therefore, the property (for example, bias or distribution) of the attribute value in the resampling data is different from the property of the attribute in the original data. As a result, when using resampling data based on general resampling, it is difficult for a user of resampling data to guess the nature of the attribute of the original data. This will be further described with reference to the drawings.
  • FIG. 9 is a diagram illustrating an example of resampled data resampled based on the data illustrated in FIG. 4.
  • the top three records shown in FIG. 9 include values of attributes (age and zip code) resampled based on the range represented by the top three records shown in FIG.
  • the bias of the age attribute in the record shown in FIG. 9 and the record shown in FIG. 3 is confirmed. Specifically, as an example of bias, explanation will be given using average and variance.
  • the age values of the attributes of the three records from the top of the original data shown in FIG. 3 corresponding to the three records from the top shown in FIG. 9 are two “21” and one “30”. is there. Therefore, the average of the three records from the top of the original data is “24”, and the variance is “18”.
  • the age values of the attributes of the three records shown in FIG. 9 are “23”, “25”, and “29”.
  • the average of records generated based on resampling is “25.67 (rounded to the third decimal place)”, and the variance is “6.22 (rounded to the third decimal place)”.
  • the nature (dispersion in this case) in the original data shown in FIG. 3 is unknown to what extent in the resampling data shown in FIG. This is because information on properties (eg, bias or distribution) in the original data is missing from the resampling data.
  • the resampling data is data that maintains the properties of the original data as much as possible.
  • anonymized data it is possible to express the nature (for example, bias or distribution) of the original data.
  • the anonymized data needs to include the property of the original data. Therefore, anonymization data becomes a complicated format.
  • the anonymized data is data that is not easy to use. Even if the distribution of the original data is known, the property of the original data cannot be restored using the resampling data unless appropriate resampling is performed.
  • Patent Document 1 Non-Patent Document 1
  • Non-Patent Document 2 are not techniques related to resampling, and thus cannot solve the above problems.
  • An object of the present invention is to provide an information processing apparatus, an information processing method, and a recording medium that solve the above-described problems and generate resampling data in which a predetermined property of original data is stored.
  • An information processing apparatus relates to original data that is a set of records including an attribute that is an anonymization target and anonymized data that is data obtained by anonymizing an attribute included in the record of the original data.
  • the matching record extraction means that extracts the record group of the original data corresponding to the record of the anonymized data, and the attribute included in the record of the anonymized data based on the record group
  • the resampling source generating means for generating the resampling source used to generate the resampling data that is the value to be resampled, Sampling means.
  • a data processing method is a relationship between original data that is a set of records including attributes that are anonymization targets and anonymized data that is data obtained by anonymizing attributes included in the records of the original data
  • the record group of the original data corresponding to the record of the anonymized data is extracted, and based on the record group, the value set to the attribute included in the record of the anonymized data is set.
  • a resampling source used to generate sampling data is generated, and based on the resampling source, resampling data to be set to an attribute included in the anonymized data record is generated.
  • the recording medium has a relationship between original data that is a set of records including an attribute to be anonymized and anonymized data that is anonymized attributes included in the record of the original data. Based on the matching information indicating a certain matching, it is a value set to the attribute included in the record of the anonymized data based on the process of extracting the record group of the original data corresponding to the record of the anonymized data and the record group
  • a program for causing a computer to execute processing for generating a resampling source used for generating resampling data and processing for generating resampling data to be set in an attribute included in a record of anonymized data based on the resampling source Record computer-readable.
  • FIG. 1 is a block diagram showing an example of the configuration of the information processing apparatus according to the first embodiment of the present invention.
  • FIG. 2 is a block diagram illustrating an example of a modification of the configuration of the information processing apparatus according to the first embodiment.
  • FIG. 3 is a diagram illustrating an example of privacy information.
  • FIG. 4 is a diagram illustrating an example of information after the anonymization of the privacy information illustrated in FIG.
  • FIG. 5 is a diagram showing an example of matching between the privacy information shown in FIG. 3 and anonymized data anonymized so as to satisfy (1, k) -anonymity.
  • FIG. 6 is a diagram showing an anonymization matching graph in FIG.
  • FIG. 7 is a diagram showing an example of matching between the privacy information shown in FIG.
  • FIG. 8 is a diagram showing an example of matching between the privacy information shown in FIG. 3 and anonymized data that is anonymized so as to satisfy (k, k) -anonymity.
  • FIG. 9 is a diagram illustrating an example of data obtained by resampling the data illustrated in FIG.
  • FIG. 10 is a flowchart illustrating an example of the operation of the information processing apparatus according to the first embodiment.
  • FIG. 11 is an anonymized matching graph used for explanation.
  • FIG. 12 is a diagram illustrating an example of a continuous probability distribution.
  • FIG. 13 is a diagram illustrating an example of a probability distribution generated based on the age and the zip code.
  • FIG. 14 is a diagram illustrating an example of resampling data in the middle of processing.
  • FIG. 15 is a diagram illustrating an example of resampling data in the next process of FIG.
  • FIG. 16 is a diagram illustrating an example of a result after resampling with respect to all anonymized data records.
  • FIG. 17 is a diagram in which the resampling data of this embodiment and the resampling data shown in FIG. 9 are arranged in the probability distribution shown in FIG.
  • the target anonymity is a predetermined matching (correspondence relationship) between the original data that is an anonymization target and the anonymized data that is anonymized data.
  • anonymity there is no particular limitation as long as it is anonymity that can be expressed using a technique for expressing. Examples of such anonymity include k-anonymity, l-diversity, and t-closeness.
  • anonymity examples include k-anonymity, l-diversity, and t-closeness.
  • the matching expression means (for example, a graph) is not particularly limited as long as the above matching can be expressed.
  • this embodiment may use an anonymized matching graph satisfying arbitrary anonymity. That is, the anonymization matching graph is an example of matching information indicating matching.
  • the present embodiment may use a graph or a data expression technique (for example, a table or a database) using a data structure or information format that is equivalent to a matching graph.
  • an anonymized matching graph using a bipartite graph expression is used as an example of a graph handled by the embodiment of the present invention.
  • the anonymization matching graph may be a graph of a different format instead of the graph format as shown in FIG. 6, for example.
  • the anonymization matching graph may be a graph in which information representing the matching relationship between the records of the original data and the anonymization data (for example, a relational expression representing the relationship or character data) is stored.
  • information representing the matching relationship between the records of the original data and the anonymization data for example, a relational expression representing the relationship or character data
  • the embodiment of the present invention uses a data set including five records as a data set to be handled. However, this does not limit the data set handled by the embodiment of the present invention. Embodiments of the invention may use a data set that includes less than 5 records or more than 5 records.
  • FIG. 1 is a block diagram showing an example of the configuration of the information processing apparatus 10 according to the first embodiment of the present invention.
  • the information processing apparatus 10 includes a matching record extraction unit 11, a resampling source generation unit 12, and a resampling unit 13.
  • the matching record extraction unit 11 extracts a record group (M (t * )) that is a set of records (t) of original data corresponding to (matching) the record (t * ) of anonymized data.
  • the resampling source generation unit 12 generates a resampling source (hereinafter, a probability distribution is used as an example) used by the resampling unit 13 for data extraction based on the record group (M (t * )) of the original data. To do.
  • the resampling source generation unit 12 is not limited to the probability distribution, and may generate other information, for example, distribution information or probability.
  • information generated by the resampling source generation unit 12 including distribution information or probability is collectively referred to as probability distribution.
  • the resampling unit 13 extracts (generates) resampling data based on the probability distribution (resampling source).
  • Data set (original data) to be anonymized and resampled in the information processing apparatus 10 includes data (sensitive attribute (SA)) that is not preferably disclosed or used. Is a set. This data set is a collection of records including one or more attributes. The record includes at least one or more sensitive attributes.
  • the information processing apparatus 10 anonymizes the quasi-identifier (QI) of the data set (original data) so as not to identify an individual related to the sensitive information included in the record. As already explained, the anonymized data is anonymized data. Further, the information processing apparatus 10 generates (extracts) resampled data (resampling data) based on the anonymized data.
  • the original data is personal information shown in FIG.
  • age Age
  • zip code Zip-code
  • QI quasi-identifiers
  • SA sensitive attribute
  • Name an identifier. Therefore, the name is not included in the anonymized data.
  • the bipartite graph shown in FIG. 11 is an anonymized matching graph used for explanation. It is assumed that the anonymization matching graph shown in FIG. 11 already satisfies predetermined anonymity with respect to the original data. For reference, FIG. 11 displays original data and anonymized data.
  • FIG. 10 is a flowchart showing an example of the operation of the information processing apparatus 10 according to the first embodiment.
  • the matching record extraction unit 11 targets a single anonymized data record (hereinafter also referred to as a node) in the anonymized matching graph, and selects a record group (M (t * )) of original data that matches the record. Extract (step S101).
  • the resampling source generation unit 12 generates (constructs) a probability distribution used for resampling in the resampling unit 13 based on the record group M (t * ) (step S102).
  • the resampling source generation unit 12 may generate a probability distribution for each attribute. Alternatively, the resampling source generation unit 12 may generate a probability distribution (multi-dimensional probability distribution) for all attribute combinations. Alternatively, the resampling source generation unit 12 may generate a probability distribution by applying a general distribution estimation technique to the records included in the record group M (t * ).
  • the resampling source generation unit 12 generates (constructs) a probability distribution for each attribute.
  • a case where the resampling source generation unit 12 generates a probability distribution using the age (Age) of the attribute of the record t * 3 shown in FIG. 11 will be described.
  • the probability distribution generated by the resampling source generation unit 12 is a discrete probability distribution.
  • the resampling source generation unit 12 may generate a continuous probability distribution. For example, based on the assumption that each value (original data value) occurs in a normal distribution, the resampling source generation unit 12 sets the distribution peak height for each value according to the appearance ratio of each value.
  • FIG. 12 is a diagram showing an example of the probability distribution in this case.
  • the resampling source generation unit 12 may generate a probability distribution based on a combination of attributes. Also in this case, the resampling source generation unit 12 can generate a probability distribution as in the case of generating a probability distribution for each attribute.
  • FIG. 13 is a diagram showing an example of a probability distribution generated based on two attributes of an attribute age (Age) and an attribute zip code (Zip-code).
  • FIG. 13 is a plane of age and zip code, and the direction perpendicular to the plane is the probability.
  • FIG. 13 shows using the line which connected the probability of the same value like the contour line in a map.
  • “x” indicates a vertex in the vicinity thereof. That is, “x” indicates a place (maximum value) having the highest probability in the vicinity. That is, FIG. 13 shows that the probability distribution is a probability distribution including three local maximum values as a region having a high probability.
  • the resampling unit 13 generates resampling data for an attribute that is anonymized in the anonymized data (in this case, a quasi-identifier (QI)) based on the probability distribution generated by the resampling source generation unit 12.
  • the method for generating resampling data from the probability distribution in the resampling unit 13 is not particularly limited.
  • the resampling unit 13 may use, as resampling data, a value generated using a function that generates random numbers according to a probability distribution.
  • the probability distribution is a normal distribution.
  • functions that generate random numbers with a uniform distribution are widely supported. Therefore, the resampling unit 13 generates random numbers that follow a uniform distribution. Then, the resampling unit 13 may generate a random number according to the normal distribution from the generated uniform distribution random number using the Box-Muller's method.
  • the resampling unit 13 may generate a random number according to a normal distribution using a software product such as statistics.
  • the resampling unit 13 generates a random number according to a normal distribution by using a program language that supports a function that generates a random number according to a normal distribution (for example, the function random.gauss ()), such as Python, which is a program language. May be.
  • a program language that supports a function that generates a random number according to a normal distribution
  • Python which is a program language. May be.
  • the resampling unit 13 may extract (select) resampling data from a data set generated based on a probability distribution by a data generation unit (not shown).
  • the resampling unit 13 generates or extracts resampling data based on the probability distribution.
  • FIG. 14 is a diagram showing an example of resampling data in this state.
  • the resampling unit 13 uses the probability distribution generated by the resampling source generation unit 12 to generate a value to be set in the attribute of the anonymized data record. Then, the resampling unit 13 sets the generated value to the attribute of the anonymized data, and generates (extracts) the resampling data.
  • this operation is also simply referred to as resampling.
  • the probability distribution used for data extraction by the resampling unit 13 is a distribution generated based on the record group (M (t * )) of the original data corresponding to the record of the anonymized data. . Therefore, the data extracted by the resampling unit 13 is data extracted based on the distribution of the record group M (t * ) that is the original data. Therefore, there is a high possibility that the data extracted by the resampling unit 13 is extracted as data having properties close to the properties (for example, distribution or bias) of the original data.
  • the value of the attribute (for example, quasi-identifier) of the resampling data is generated based on the resampling source, that is, based on the probability distribution reflecting the nature of the original data.
  • the value of the sensitive attribute may be the value of the sensitive attribute in the original data record.
  • the information processing apparatus 10 similarly to the anonymized quasi-identifier, the information processing apparatus 10 generates a probability distribution (resampling source) of the sensitive attribute in the record group M (t * ), and based on the generated probability distribution, the sensitive attribute A value may be generated (or selected).
  • the information processing apparatus 10 executes the resampling process described above for all anonymized data records.
  • the information processing apparatus 10 ends the resampling operation when the resampling process is completed for all the anonymized data records.
  • the order of anonymized data records to be processed is not particularly limited.
  • the information processing apparatus 10 may determine the order using an arbitrary method (for example, random or round robin).
  • the information processing apparatus 10 may execute processing for a plurality of records simultaneously, that is, in parallel.
  • FIG. 15 illustrates an example of a result of the resampling performed by the information processing apparatus 10 on the record t * 1 of the anonymized data after the resampling illustrated in FIG.
  • the resampling unit 13 generates (extracts) “21” as the age of the record t * based on a normal distribution with an age of 21 on average.
  • FIG. 16 is a diagram illustrating an example of a result after the information processing apparatus 10 performs resampling on all anonymized data records.
  • the ages of the three records from the top of the resampling data are “21”, “29”, and “22”. In this case, the average is “24”, and the variance is “12.67 (rounded to the second decimal place)”.
  • the property of the resampling data shown in FIG. 16 (in this case, the variance indicating the bias) is closer to the property of the original data than the resampling data shown in FIG.
  • FIG. 17 is a diagram in which the resampling data of this embodiment and the resampling data shown in FIG. 9 are arranged in the probability distribution shown in FIG.
  • FIG. 17 is a diagram showing the ages of the three records from the top in FIG.
  • black circles are resampling data of this embodiment.
  • the white circle is the resampling data shown in FIG.
  • the resampling data of this embodiment is data with a high probability, that is, data having properties close to those of the original data.
  • resampling data has the same value format as the original data.
  • the information processing apparatus 10 can produce an effect of generating resampling data storing a predetermined property (for example, bias or distribution) of original data.
  • a predetermined property for example, bias or distribution
  • the matching record extraction unit 11 extracts a record group of the original data that matches the anonymized data based on the matching relationship between the original data and the anonymized data (for example, using an anonymized matching graph). Then, the resampling source generation unit 12 generates a resampling source (probability distribution) used for generating resampling data based on the record group of the original data. Then, the resampling unit 13 resamples the anonymized attribute (quasi-identifier) value based on this resampling source (probability distribution). This is because the information processing apparatus 10 can generate resampling data having the original data properties (for example, bias and distribution). The resampling data has the same value format as the original data.
  • the resampling data generated by the information processing apparatus 10 according to the first embodiment can be used in an application and an analyzer designed for the original data.
  • the application and analyzer can use the resampling data as data in which the properties of the original data are maintained.
  • the information processing apparatus 10 of the present embodiment protects the data set for privacy in order to generate resampling data. Furthermore, the information processing apparatus 10 can generate resampling data that can obtain results close to the properties of the original data in the application and the analyzer.
  • each component of the information processing apparatus 10 may be configured with a hardware circuit.
  • each component may be configured using a plurality of devices connected via a network.
  • the plurality of components may be configured with a single piece of hardware.
  • the information processing apparatus 10 may be realized as a computer device including a CPU (Central Processing Unit), a ROM (Read Only Memory), and a RAM (Random Access Memory).
  • the information processing apparatus 10 may be realized as a computer apparatus that further includes an input / output connection circuit (IOC: Input ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ / Output Circuit) and a network interface circuit (NIC: Network Interface Circuit).
  • IOC Input ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ / Output Circuit
  • NIC Network Interface Circuit
  • FIG. 2 is a block diagram showing an example of the configuration of the information processing apparatus 600 according to this modification.
  • the information processing apparatus 600 includes a CPU 610, a ROM 620, a RAM 630, an internal storage device 640, an IOC 650, and a NIC 680, and constitutes a computer device.
  • CPU 610 reads a program from ROM 620.
  • the CPU 610 controls the RAM 630, the internal storage device 640, the IOC 650, and the NIC 680 based on the read program.
  • the computer including the CPU 610 controls these configurations, and implements the functions as the matching record extraction unit 11, the resampling source generation unit 12, and the resampling unit 13 shown in FIG.
  • the CPU 610 may use the RAM 630 or the internal storage device 640 as a temporary storage of a program when realizing each function.
  • the CPU 610 may read the program included in the storage medium 700 storing the program so as to be readable by a computer using a storage medium reading device (not shown). Alternatively, the CPU 610 may receive a program from an external device (not shown) via the NIC 680, store the program in the RAM 630, and operate based on the stored program.
  • ROM 620 stores programs executed by CPU 610 and fixed data.
  • the ROM 620 is, for example, a P-ROM (Programmable-ROM) or a flash ROM.
  • the RAM 630 temporarily stores programs executed by the CPU 610 and data.
  • the RAM 630 is, for example, a D-RAM (Dynamic-RAM).
  • the internal storage device 640 stores data and programs stored in the information processing device 600 for a long period of time. Further, the internal storage device 640 may operate as a temporary storage device for the CPU 610.
  • the internal storage device 640 is, for example, a hard disk device, a magneto-optical disk device, an SSD (Solid State Drive), or a disk array device.
  • the ROM 620 and the internal storage device 640 are non-transitory storage media.
  • the RAM 630 is a volatile storage medium.
  • the CPU 610 can operate based on a program stored in the ROM 620, the internal storage device 640, or the RAM 630. That is, the CPU 610 can operate using a nonvolatile storage medium or a volatile storage medium.
  • the IOC 650 mediates data between the CPU 610, the input device 660, and the display device 670.
  • the IOC 650 is, for example, an IO interface card or a USB (Universal Serial Bus) card.
  • the input device 660 is a device that receives an input instruction from an operator of the information processing apparatus 600.
  • the input device 660 is, for example, a keyboard, a mouse, or a touch panel.
  • the display device 670 is a device that displays information to the operator of the information processing apparatus 600.
  • the display device 670 is a liquid crystal display, for example.
  • the NIC 680 relays data exchange with an external device (not shown) via the network.
  • the NIC 680 is, for example, a LAN (Local Area Network) card.
  • the information processing apparatus 600 configured in this way can obtain the same effects as the information processing apparatus 10.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Bioethics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Afin de générer des données de rééchantillonnage pour lesquelles une propriété prescrite des données de source est conservée, le dispositif de traitement d'informations de l'invention comprend : des moyens d'extraction d'enregistrements appariés qui, sur la base d'informations d'appariement indiquant un appariement qui constitue la relation entre des données de source (collection d'enregistrements contenant un attribut formant un sujet d'anonymisation) et des données anonymisées (données pour lesquelles l'attribut contenu dans les enregistrements des données de source a été anonymisé), extraient un groupe d'enregistrements de données de source correspondant à des enregistrements de données anonymisées ; des moyens de génération de source de rééchantillonnage qui, sur la base du groupe d'enregistrements, génèrent une source de rééchantillonnage utilisée pour générer des données de rééchantillonnage, lesquelles constituent des valeurs établies pour l'attribut contenu dans les enregistrements des données anonymisées ; et des moyens de rééchantillonnage qui, sur la base de la source de rééchantillonnage, génèrent des données de rééchantillonnage établies pour l'attribut contenu dans les enregistrements des données anonymisées.
PCT/JP2015/006113 2014-12-09 2015-12-08 Dispositif de traitement d'informations, procédé de traitement d'informations et support d'enregistrement WO2016092830A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2016563507A JPWO2016092830A1 (ja) 2014-12-09 2015-12-08 情報処理装置、情報処理方法、及び、記録媒体

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2014-248691 2014-12-09
JP2014248691 2014-12-09

Publications (1)

Publication Number Publication Date
WO2016092830A1 true WO2016092830A1 (fr) 2016-06-16

Family

ID=56107042

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2015/006113 WO2016092830A1 (fr) 2014-12-09 2015-12-08 Dispositif de traitement d'informations, procédé de traitement d'informations et support d'enregistrement

Country Status (2)

Country Link
JP (1) JPWO2016092830A1 (fr)
WO (1) WO2016092830A1 (fr)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2018055612A (ja) * 2016-09-30 2018-04-05 日本電信電話株式会社 匿名化テーブル生成装置、匿名化テーブル生成方法、プログラム
JP2020501254A (ja) * 2016-11-28 2020-01-16 シーメンス アクチエンゲゼルシヤフトSiemens Aktiengesellschaft データストックを匿名化するための方法およびシステム

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2013125374A (ja) * 2011-12-14 2013-06-24 Fujitsu Ltd 情報処理方法、装置及びプログラム

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2013125374A (ja) * 2011-12-14 2013-06-24 Fujitsu Ltd 情報処理方法、装置及びプログラム

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
KELSUKE MURAKAMI: "K-anonymization for large- scale data with less information loss", IPSJ SIG NOTES DATABASE SYSTEM (DBS) 2013-DBS-157, 15 July 2013 (2013-07-15) *
TSUBASA TAKAHASHI: "Tokumeika Data no Riyo ni Kansuru Ichikento", THE 7TH FORUM ON DATA ENGINEERING AND INFORMATION MANAGEMENT (DAI 13 KAI THE DATABASE SOCIETY OF JAPAN NENJI TAIKAI, 4 March 2015 (2015-03-04) *
YUKI TOYODA ET AL.: "A Rezept Data Anonymization considering Priority and Restrictions", THE 32RD JOINT CONFERENCE ON MEDICAL INFORMATION RONBUNSHU (DAI 13 KAI JAPAN ASSOCIATION FOR MEDICAL INFORMATICS GAKUJUTSU TAIKAI) JAPAN JOURNAL OF MEDICAL INFORMATION DAI 32 KAI SUPPL. JAPAN JOURNAL OF MEDICAL INFORMATICS, 14 November 2012 (2012-11-14), pages 774 - 777 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2018055612A (ja) * 2016-09-30 2018-04-05 日本電信電話株式会社 匿名化テーブル生成装置、匿名化テーブル生成方法、プログラム
JP2020501254A (ja) * 2016-11-28 2020-01-16 シーメンス アクチエンゲゼルシヤフトSiemens Aktiengesellschaft データストックを匿名化するための方法およびシステム
US11244073B2 (en) 2016-11-28 2022-02-08 Siemens Aktiengesellschaft Method and system for anonymising data stocks

Also Published As

Publication number Publication date
JPWO2016092830A1 (ja) 2017-09-14

Similar Documents

Publication Publication Date Title
US12100491B2 (en) Transaction validation via blockchain, systems and methods
Antwi et al. The case of HyperLedger Fabric as a blockchain solution for healthcare applications
Anjum et al. An efficient privacy mechanism for electronic health records
Sonkamble et al. Survey of interoperability in electronic health records management and proposed blockchain based framework: MyBlockEHR
US20140317756A1 (en) Anonymization apparatus, anonymization method, and computer program
US20170161519A1 (en) Information processing device, information processing method and recording medium
Pita et al. A Spark-based Workflow for Probabilistic Record Linkage of Healthcare Data.
WO2014181541A1 (fr) Dispositif de traitement d'informations vérifiant l'anonymat et procédé de vérification d'anonymat
CN111046237A (zh) 用户行为数据处理方法、装置、电子设备及可读介质
Swaminathan et al. A review on genomics APIs
US9148410B2 (en) Recording medium storing data processing program, data processing apparatus and data processing system
Urovi et al. Luce: A blockchain-based data sharing platform for monitoring data license accountability and compliance
CN113806350B (zh) 一种提高大数据交易平台安全性的管理方法及系统
WO2016092830A1 (fr) Dispositif de traitement d'informations, procédé de traitement d'informations et support d'enregistrement
JP6747438B2 (ja) 情報処理装置、情報処理方法、及び、プログラム
Zhang et al. Impact of primary to secondary care data sharing on care quality in NHS England hospitals
JP2019036249A (ja) 医療情報管理装置、医療情報管理方法及びプログラム
US20220343021A1 (en) Referential data grouping and tokenization for longitudinal use of de-identified data
US12039076B2 (en) Data management method, non-transitory computer readable medium, and data management system
Peng et al. Towards privacy preserving in 6G networks: Verifiable searchable symmetric encryption based on blockchain
CN115168752A (zh) 大数据查询方法、装置、电子设备及存储介质
US20220051343A1 (en) Life insurance policy application process and system
JP2016110472A (ja) 情報処理装置、情報処理法、及び、プログラム
JP2016115116A (ja) 情報処理装置、情報処理方法、及びプログラム
WO2016021039A1 (fr) SYSTÈME DE TRAITEMENT DE k-ANONYMISATION ET PROCÉDÉ DE TRAITEMENT DE k-ANONYMISATION

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15868295

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2016563507

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15868295

Country of ref document: EP

Kind code of ref document: A1