WO2017213281A1 - Procédé de désidentification de données volumineuses - Google Patents

Procédé de désidentification de données volumineuses Download PDF

Info

Publication number
WO2017213281A1
WO2017213281A1 PCT/KR2016/006206 KR2016006206W WO2017213281A1 WO 2017213281 A1 WO2017213281 A1 WO 2017213281A1 KR 2016006206 W KR2016006206 W KR 2016006206W WO 2017213281 A1 WO2017213281 A1 WO 2017213281A1
Authority
WO
WIPO (PCT)
Prior art keywords
abstraction
field
value
data
record
Prior art date
Application number
PCT/KR2016/006206
Other languages
English (en)
Korean (ko)
Inventor
이원석
Original Assignee
주식회사 그리즐리
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 주식회사 그리즐리 filed Critical 주식회사 그리즐리
Priority to JP2019517743A priority Critical patent/JP6829762B2/ja
Publication of WO2017213281A1 publication Critical patent/WO2017213281A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • G06F21/6254Protecting personal data, e.g. for financial or medical purposes by anonymising data, e.g. decorrelating personal data from the owner's identification

Definitions

  • the present invention relates to a non-discrimination processing method of big data, and more particularly, to a method of non-discrimination processing of big data which can freely distribute to an external system without fear of leakage of personal information, To a non-discrimination processing method of big data.
  • Big data refers to electronic data such as electronic commerce data, metadata, web logs, radio frequency identification (RFID) data, sensor network data, social network data, social data, and Internet text Data that includes both informal and semi-structured data that have not been used before, such as documents, Internet search indexes, and the like. Such data generally has a level of data that can not be handled by ordinary software tools and computer systems (Big Data).
  • RFID radio frequency identification
  • Big data is analyzed and utilized within the organization in which the data are collected. However, there is a difference in the attributes of the data collected according to the organization that collects the data, so it is necessary to utilize the data of another organization. In the case of organizations that lack the ability or system, there is a need to analyze the information unique to the organization from the big data of other organizations or their combination and use it for decision making.
  • Big Data because of the nature of Big Data, not only is the amount of data enormous, but most Big Data contains inevitably information about personal information, and there is a possibility that a legal dispute will arise due to the leakage of personally identifiable information As a result, Big Data has limited communication and distribution among organizations. Therefore, in order to avoid the occurrence of legal disputes related to the leakage of personal information, an organization capable of collecting big data, It is necessary to analyze the information required for a specific purpose and process it to the level of statistical information through clustering or statistical analysis. Therefore, Is that it is difficult to obtain the analysis data necessary for the unique business environment of the organization. There was a point.
  • the masking is to mask or delete the object information (e.g., 670101-10491910 ⁇ ************), and the replacement is to replace the information generated in correspondence with the object information (670101-10491910 ⁇ ID2311331), the semi-discrimination is semi-discrimination to represent only a part of the object information (for example, 670101-10491910 ⁇ 67-1) 10491910 ⁇ man).
  • the present invention has been made to solve such a problem, and it is an object of the present invention to provide a method and apparatus for preventing unnecessary re-identification of a specific individual at the time of distribution of big data,
  • the present invention provides a non-discrimination processing method of big data which can be safely used for distribution without having to receive permission for an individual.
  • the big data used for distribution is used for statistical analysis rather than using specific information for individual, and statistical analysis for the entire data is performed by first performing statistical analysis of a part of the data, And that there is no big difference in the results.
  • a method for processing non-discrimination of big data comprising the steps of: Storing data collected through a communication unit from a terminal connected through a network in a storage unit of the data server; And a data abstraction step in which the processing unit combines at least two records among the original records constituting the data to generate a record different from the original record, wherein the data abstraction step comprises: Setting at least one field of each field of the record as an abstract reference field, and setting at least one field among fields other than the abstract reference field as an abstract field; Selecting at least two records having the same value of the abstraction reference field among the original records; Wherein the abstraction reference field of the abstraction record is assigned to a corresponding field value common to the selected plurality of records, and the abstraction reference field of the abstraction record is assigned to the abbreviated field, Converting a value of the abstraction subject field of the abstraction record into a representative value representative of a corresponding field value of the selected plurality of records, and assigning the value to a value
  • a method for generating big data for distribution by selecting a field that can be a reference of statistical analysis and a field that can be a target of statistical analysis among various fields constituting big data By abstracting the original record to an abstracted record that has the field value different from the value of the original record and can maintain the original meaning of statistical analysis, new information with the value of statistical analysis can be obtained It is possible to provide big data that can fundamentally prevent backtracking through specific information of the individual and combinations thereof.
  • the abstraction reference field refers to a field for performing data abstraction.
  • the abstraction reference field is preprocessed by histogram, binning, It is preferable to select an abstraction reference field after converting it into card data.
  • the abstraction subject field is a field to which statistical values are to be calculated.
  • an average value is applied to a representative value of a corresponding field value of an abstraction record.
  • Maximum value is applied to a representative value of a corresponding field value of an abstraction record.
  • Maximum value is applied to a representative value of a corresponding field value of an abstraction record.
  • Maximum value is applied to a representative value of a corresponding field value of an abstraction record.
  • Maximum value maximum beam value
  • sampling function In the case of non-numerical data, it can be calculated by applying an integration function such as union, intersection, sampling, frequent action elements, clustering, and histogram.
  • the original record included in one abstraction record is selected in a certain number (N) for each abstraction record, but it is also possible to select a different number of original records to be included in each abstraction record.
  • the processing unit before the data abstraction step, includes: sorting the original record based on the value of the abstraction reference field; If it is determined that the value of the abstraction subject field among the records having the same value of the abstraction reference field has a deviation greater than a predetermined reference value in comparison with the value of the abstraction subject field of the other records based on the sorted original record, And removing the object from the object of abstraction.
  • the present invention is characterized in that before assigning the representative value to the value of the abstraction subject field, it is determined whether there is a record whose value of the abstraction subject field of the selected original record has the same value as the representative value, If there is a record, the representative value is corrected and assigned to another value not included in the value of the corresponding abstraction field of the selected original records.
  • the value of the abstraction reference field or the abstraction target field is related to the identification of the individual, the value of the corresponding field is divided into a group value And selects the field as an abstraction reference field or an abstraction field.
  • the "field having content related to the individual identification” is a field that can identify an individual by itself such as an individual's social security number, age, home address, or the like, or can easily identify an individual by combining with other data, Means a field having the value of the corresponding field as an element, means the information such as the age extracted from the resident registration number or the age, the city, city, and distance extracted from the home address.
  • the technique of extracting a certain field value into a group value as described above is a technique that is generally applied to non-discrimination of data as described above. According to the above-described characteristic of the present invention, data abstraction and abstraction are performed together It is possible to more reliably prevent backtracking through the specific information of the individual and the combination thereof.
  • a further aspect of the present invention is summarized as a method for generating abstract records, the method comprising: a distribution value field having distribution values of field values of the abstraction reference field of a plurality of original records included in the abstraction record as field values; And a distribution value field having distribution information of field values of the abstraction object field as field values is further included in the abstraction record.
  • the field values of the distribution value field can be calculated by a normal distribution function. Typical types are average, standard deviation, median, quartile-quartile distance (Q3-Q1), maximum value, Or the number of different attribute values.
  • the field value of the abstraction reference field of any one abstraction record is 40, Assuming that each age value of the original record is 43 years old, 47 years old, 42 years old, and that the field value of the distribution value field included in the abstraction record is set to an intermediate value, the field value Becomes 47.
  • two or more big data independently generated in a separate environment via the distribution value field can be used for various analyzes as needed, and the reliability of the statistical data can be further improved .
  • the value of the abstraction reference field or the abstraction target field to be selected is related to the identification of the individual, the value of the corresponding field is converted into a hash function and selected as a corresponding abstraction reference field or abstraction target field.
  • the Hesh function is an irreversible one-way function, and the characteristic that the original data value can not be reproduced from the hash value is applied.
  • It can be converted into a hash function g (x) which is defined and can be selected as the field value of the abstraction reference field or the abstraction object field.
  • the present invention is characterized in that after the step of setting the abstraction reference field and the abstraction subject field, the original record is sorted based on the abstraction reference field and then a plurality of records having the same value of the abstraction reference field are selected And generating the abstraction data by performing the data abstraction step. After generation of abstraction data according to one sorting method of the abstraction reference field is completed, another sorting method of the abstraction reference field is applied to sort the original records And then selects a plurality of records having the same value of the abstraction reference field according to the sorting order, and proceeds to the data abstraction step again.
  • any one original record is abstracted so as to be included in a plurality of abstraction records.
  • a plurality of abstraction records including the same original record may have various field values of the distribution value field corresponding to the abstraction reference field and the abstraction object field, and accordingly, the distribution value field may be variously linked Therefore, the reliability of the statistical data can be further improved.
  • the original data is composed of a personal table and a log table for each person's actions in the personal table
  • the abstraction data is made up of an abstract personal table and an abstract log table
  • the abstracting step further comprises the steps of: adding an identification field to the abstract image table by abstracting a plurality of log records of the image table into one abstract image record, A step of assigning an identification value to an identification field of the abridged historical record, and associating a value of a field capable of specifying individuals included in the abridged historical record with the identification value, A step of generating a list of abstraction objects that can be specified And generating the abstraction log table by abstraction of a plurality of log records of the log table into one abstraction log record through the data abstraction step, wherein the data abstraction step adds an identification field to the abstraction log table Abstracting a log record of a plurality of individuals included in the abstraction record among respective records of the log table into one abstraction log record by referring to the abstraction target list; And assigning an identification value including the assigned
  • the original data is composed of a personal table and a log table for each person's actions in the personal table
  • the abstraction data is made up of an abstract personal table and an abstract log table
  • the abstraction step includes abstraction of a plurality of log records of the log table into a single abstraction log record
  • the abstraction log abstraction step comprises: adding an identification field to the abstraction log table; Assigning an identification value to an identification field of the abstraction log record, and associating a value of a field capable of specifying individuals included in the abstraction log record with the identification value, A step of generating a list of abstraction objects that can be specified
  • the abstracting step abstracts a plurality of new image records of the image table into one abstract image record through the data abstraction step
  • the data abstraction step adds an identification field to the abstract image table Abstracting a plurality of individual records of a plurality of individuals included in the abstraction log record among the respective records of the personal table by referring to the abstraction target list into one abstraction log record; And assigning an identification value including the assigned identification
  • the present invention also provides a non-discrimination processing method of big data performed in a data server having a communication unit, a processing unit, and a storage unit, wherein the processing unit transmits data collected through the communication unit from a terminal connected via a wired / Storing in a storage unit of a data server; And a data abstraction step in which the processing unit combines at least two records among the original records constituting the data to generate a record different from the original record, wherein the data abstraction step comprises: Setting at least one field of each field of the record as an abstract reference field, and setting at least one field having a numeric data type among fields other than the abstract reference field as an abstract field to be abstracted; Generating a correction list comprising field values of the abstraction subject field of the original record; Removing duplicate values from the correction list and arranging them in order of magnitude of field values; Calculating at least one field value close to the corresponding field value and an average value of the corresponding field value with respect to the field value of each of the aligned correction lists and mapping
  • the present invention is characterized in that a field value of a specific field of an original record is converted into an average value of a corresponding field value and another field value close to the field value and is abstracted to a value different from the original field value, Is different from the case of using the original field value.
  • big data for distribution by selecting a field that can be a reference of statistical analysis and a field that can be a target of statistical analysis among various fields constituting big data, Can be obtained by associating the numeric data type field of the original record with the abstraction value having a field value different from the value of the original record and retaining the original meaning of the statistical analysis so that new information having the value of statistical analysis can be obtained, And large data that can fundamentally prevent backtracking through the combination.
  • the present invention is characterized in that, for each field value of the aligned correction list, a gap value with a field value adjacent to the corresponding field value is calculated to generate a gap value list corresponding to the corresponding field value, And generating a clearance value list by replacing the clearance value with the threshold value when the threshold value is out of a predetermined threshold value, wherein when calculating the average value of the field values of the aligned correction list,
  • the adjacent field value is further characterized by calculating a mean value by applying a value obtained by adding or subtracting the clearance value on the clearance value list to the corresponding field value.
  • the accuracy of the statistical analysis can be improved by applying a field value having a bad influence on the overall statistic to a threshold value.
  • the original data is composed of a log table of the personal data and the individual actions of the personal data, and the personal data and the log table are combined and converted into a single table, And the data abstraction step is performed on the data of the table.
  • a field for statistical analysis and a field for statistical analysis are selected from among various fields constituting the big data to generate big data for distribution, It is possible to obtain new information with the value of statistical analysis by abstracting it as a single abstraction record that has the field value different from the value of the original record and can maintain the original meaning of statistical analysis. It is possible to provide big data that can fundamentally prevent backtracking through the data.
  • two or more big data independently generated in a separate environment can be linked to various analyzes as needed through the distribution value field, and the reliability of statistical data can be improved.
  • Figure 1 is an exemplary diagram illustrating a data-centric computing environment that forms the Big Data Processing System of the present invention.
  • FIG. 2 is a block diagram showing a main configuration of the data server shown in FIG.
  • FIG. 3 is a block diagram illustrating the basic steps of data abstraction in accordance with one embodiment of the present invention.
  • FIG. 4 is a block diagram illustrating the basic steps of data abstraction according to another embodiment of the present invention.
  • a data-centric computing environment for forming a big data processing system of the present invention may be constructed by a plurality of user terminals 120 connected to a data server 110 and a data server through a wired or wireless network .
  • the data-centric computing environment utilizes data generated in real time from a plurality of user terminals 120 to provide a variety of functions such as a social network service (SNS), a smart grid, an intelligent home appliance, Means a technology based on big data processing that can provide various applications such as real-time streaming or real-time decision making.
  • SNS social network service
  • smart grid a smart grid
  • intelligent home appliance Means a technology based on big data processing that can provide various applications such as real-time streaming or real-time decision making.
  • the big data processing system and method according to the present invention are implemented by a data server 110 connected to a plurality of user terminals 120 and collect data generated by a plurality of user terminals 120 and process the data And provides the stored data to the user terminal 120 that takes up the data, thereby establishing an environment in which data-centric computing applications can be performed.
  • the user terminal 120 may be a computer having a communication device connected to the data server 110 and having an information processing function for generating data according to the operation of the user terminal 120, , Mobile communication terminals such as smart phones, tablet PCs, and PDAs (personal digital assistants), smart home appliances, radio frequency identification (RFID) data, vehicles such as black boxes or navigation systems, trains, airplanes But is not limited thereto.
  • Mobile communication terminals such as smart phones, tablet PCs, and PDAs (personal digital assistants), smart home appliances, radio frequency identification (RFID) data, vehicles such as black boxes or navigation systems, trains, airplanes But is not limited thereto.
  • RFID radio frequency identification
  • the data server 110 is connected to a plurality of user terminals 120 through a communication unit 113, such as a wired / wireless network such as short-range wireless communication, Wi-Fi, 3G (3Generation)
  • a communication unit 113 such as a wired / wireless network such as short-range wireless communication, Wi-Fi, 3G (3Generation)
  • a cloud server or a web server that collects data generated in the user terminal 120 and stores the data in the storage unit 112 and processes and stores the data collected by the processing unit 111, But is not limited thereto.
  • Original data collected through the communication unit 113 from the terminals 120 connected via the wire / wireless network is stored in the storage unit 112 of the data server 110.
  • the processing unit 111 appropriately processes the large data stored in the storage unit 112 to abstract and select data necessary for analysis, thereby reducing the capacity and non-identifying the data, Identification big data for distribution is stored in the storage unit 112 and the distribution non-identifying big data stored in the storage unit 112 in a relatively small capacity is analyzed and utilized through the communication unit 113 of the server and the communication network And transmitted to the destination.
  • FIG. 3 is a block diagram illustrating basic steps of data abstraction according to an embodiment of the present invention. Referring to FIG. 3, a method of processing non-discrimination of big data performed in a processing unit of a data server will be described in detail.
  • At least one field is set as an abstraction reference field (S10), and sets at least one field out of fields other than the abstraction reference field as an abstraction subject field (S20).
  • the abstraction reference field refers to a field for performing data abstraction.
  • the abstraction reference field is preprocessed by histogram, binning, It is preferable to select an abstraction reference field after converting it into card data.
  • the abstraction subject field is a field to which statistical values are to be calculated.
  • an average value is applied to a representative value of a corresponding field value of an abstraction record.
  • Maximum value is applied to a representative value of a corresponding field value of an abstraction record.
  • Maximum value is applied to a representative value of a corresponding field value of an abstraction record.
  • Maximum value is applied to a representative value of a corresponding field value of an abstraction record.
  • Maximum value maximum beam value
  • sampling function In the case of non-numerical data, it can be calculated by applying an integration function such as union, intersection, sampling, frequent action elements, clustering, and histogram.
  • At least two or more records having the same value of the abstract reference field among the original records are selected by the processing unit 111 at step S30, and the selected plurality of records are abstracted into one abstract record at step S40.
  • the original record included in one abstraction record is selected in a certain number (N) for each abstraction record, but it is also possible to select a different number of original records to be included in each abstraction record.
  • the abstraction record includes an abstraction reference field and the abstraction subject field, and the value of the abstraction reference field is allocated to a corresponding field value common to a plurality of selected records (S41).
  • the value of the abstraction subject field corresponds to Is converted into a representative value that can represent the field value, and is then assigned to the value of the abstract field (S42).
  • the representative value of the corresponding field value of the abstraction record is generally calculated by applying an aggregate function such as mean, median, maximum value, and sampling according to the contents of the field value .
  • the processing unit 111 stores the generated abstraction record in the storage unit 112 (S50), repeats the processes of S40 to S40 over the original data, and when the data abstraction operation is completed over the entire original data (S60 (Step S70).
  • Table 1 is a simple example of the original data before data abstraction, which includes the resident registration number, age, name, address, and income in each field of the original record.
  • the value of the abstraction reference field or the abstraction field to be selected is related to the identification of the individual, the value of the corresponding field is converted into a group value that can be included as one element, It is preferable to select the field to be abstracted.
  • Table 2 shows an example in which each field is transformed into a new field for data abstraction
  • the abstraction reference fields of age group, sex, and city indicate that a value common to selected records is allocated, and an average value of income field values of records selected as a representative value of an income field, which is an abstraction subject field, is allocated.
  • An identifier (ID) value is generated to uniquely distinguish each abstraction record.
  • the value of the abstraction subject field among the records having the same value of the abstraction reference field deviates from the value of the abstraction subject field of the other records by more than a predetermined standard It is preferable to exclude the record from the abstraction object.
  • the accuracy of the statistical analysis can be further improved by excluding records that adversely affect the accuracy of the statistics from the objects of abstraction.
  • correction value a value obtained by changing the attribute value of the abstraction record to a random value within the maximum allowable noise threshold value is allocated.
  • the value of the abstraction reference field or the abstraction target field is related to the identification of the individual, the value of the corresponding field is converted into a group value in which the corresponding field value can be contained as one element , It is preferable to select the field as the abstraction reference field or the abstraction subject field.
  • the "field having content related to the individual identification” is a field that can identify an individual by itself such as an individual's social security number, age, home address, or the like, or can easily identify an individual by combining with other data, Means a field having the value of the corresponding field as an element, means the information such as the age extracted from the resident registration number or the age, the city, city, and distance extracted from the home address.
  • the technique of extracting a certain field value into the group value as described above is a technique generally applied to the non-discrimination of data as mentioned above, but according to the present invention, data abstraction and abstraction are performed together, Information and a combination thereof can be more reliably prevented.
  • the value of the abstraction reference field or the abstraction subject field is related to the identification of the individual, it is possible to convert the value of the corresponding field into a hash function and select the abstraction reference field or the abstraction subject field.
  • the value of a corresponding field is converted into a hash function g (x), and an abstract reference field or an abstraction object
  • the field value of the field can be selected.
  • the transform function g (x) is defined as a hash function
  • the hash function g (x) is set to a value limited to the hash domain (0..m-1) determined for another random function f Define as follows.
  • the personal signature value is determined from 0 to m.
  • Different abstraction fields may have the same transform value even though different individuals have different field values, but the larger the value of m, the less likely that different individuals will have the same transform value.
  • the hash function can be defined as follows, and the conversion value by the hash function is as shown in Table 5.
  • g (resident registration number) (two digits before resident registration number) mod 1000
  • any one original record is abstracted to be included in one of the abstract records, and it is also possible to abstract one of the original records so as to be included in a plurality of abstract records.
  • the original record is sorted based on the abstraction reference field, and a plurality of records having the same value of the abstraction reference field are selected And generating the abstraction data by proceeding to the data abstraction step.
  • the original records are sorted by applying another sorting method of the abstraction reference field. Then, when a plurality of records having the same value of the abstraction reference field are selected according to the sort order and the data abstraction step is performed again, any one original record is abstracted to be included in a plurality of abstraction records.
  • Table 9 shows an example of abstraction data in which the data abstraction step is performed twice with different sort order as in Table 7 and Table 8 for the original record in Table 6.
  • a plurality of abstract records including the same original record may have various field values of the distribution value field corresponding to the abstraction reference field and the abstraction subject field, and accordingly, the distribution value field may be varied It becomes possible to utilize it for analysis, and the reliability of the statistical data can be further improved.
  • Table 10 shows an example of the log table.
  • the log table is composed of the service request / provision / use history of the individual generated by utilizing the service.
  • the semi-formal log record is the log It has a personal identification attribute, a time attribute, and a spatial attribute, and has the action items that the individual person has performed in the corresponding space as a field value in a semi-regular form at that point in time.
  • a log record is extracted for every individual in the abstraction target list obtained for each abstraction historical record, and is generated as a log record set of the abstraction historical record.
  • the abstraction target list is generated by associating, for each abstracted abstract image record, an attribute (e.g., resident registration number) that can specify each individual contained in the abstract image record.
  • an attribute e.g., resident registration number
  • the abstraction target list of the abstraction image record id 321 is as shown in Table 11, and a set of log records of the abstraction image record id 321 targeted for this can be generated as shown in Table 12.
  • the selected log records are abstracted into a single abstraction log record by applying various integration functions.
  • Integration functions include union, intersection, sampling, frequent elements, clustering, and histogram.
  • the abstraction log records generated when various integration functions are applied to the log record set (Table 12) of the abstract image record id 321 of Table 11 are as follows.
  • An example of selective abstraction by constraining time or space conditions is the union of individual behaviors within 7 minutes
  • the abstraction log records and abstraction log records thus generated are sequentially stored in the form of a table in the storage unit 112 to form big data for distribution.
  • By matching abstract abstraction records and abstraction log records of each table And may be formed as individual abstraction records.
  • the matching / integration to the abstraction record may be performed in a server providing big data for distribution, or in a server where big data is used.
  • the abstraction reference field of the abstraction data can be used for linkage analysis by combining with other abstraction personal world data abstracted by the same abstraction reference field.
  • the distribution value of the abstraction reference field or the abstraction subject field can be utilized for linkage analysis in order to improve the linking accuracy by linking similar abstraction records in data linkage analysis.
  • the abstraction reference field or the distribution value of the abstraction subject field means distribution information of the field values of the abstraction reference field of a plurality of original records included in the abstraction record and a distribution value field having the distribution information as a field value is stored in the abstraction record .
  • the field values of the distribution value field can be calculated by a normal distribution function. Typical types are average, standard deviation, median, quartile-quartile distance (Q3-Q1), maximum value, Or the number of different attribute values.
  • the field value of the abstraction reference field of any one abstraction record is 40, Assuming that each age value of the original record is 43 years old, 47 years old, 42 years old, and that the field value of the distribution value field included in the abstraction record is set to an intermediate value, the field value Becomes 47.
  • the new record A containing the average income information and the new record B containing the average flow property information are converted by the abstraction method of the present invention separately as the age field and the gender attribute, which are the same abstract reference field.
  • age distribution value field which is the distribution value field of the abstraction reference field, is equally defined as a middle age median value
  • distribution value fields are added additionally for each abstract record of A and B, as illustrated in Table 13 .
  • two or more big data independently generated in a separate environment can be used in conjunction with each other.
  • abstraction record set A S Abstraction record set A S , abstraction log record set A L
  • B Abstraction record set B S , abstraction log record set B L
  • the abstraction record set B S of A S and B is concatenated as described in the preceding Tables 13 and 14 by way of example.
  • Abstraction identifiable record x ⁇ A S and abstraction identifiable record y ⁇ B S a bond if x abstraction log record v ⁇ A L and y abstraction log record w ⁇ B L is behavior history of the same individual in the as previously considered to be illustrative
  • the two abstraction log records ⁇ v, w> are semantically linked, and the behavioral analysis on the aggregated big data (A L B L ) is performed.
  • FIG. 4 is a block diagram showing basic steps of data abstraction according to another embodiment of the present invention. Referring to FIG. 4, a method of processing non-discrimination of big data performed in a processing unit of a data server will be described in detail do.
  • At least one field is set as an abstraction reference field (B10), and sets at least one field having a numeric data type among fields other than the abstraction reference field as an abstraction subject field (B20).
  • the order is sorted by the size of the field values (B40).
  • At least one field value close to the corresponding field value and an average value of the corresponding field value are calculated for each field value of the aligned correction list, and correspond to an abstraction value corresponding to the field value (B50).
  • the processing unit 111 stores the generated abstraction record in the storage unit 112 (B70) and repeats the processes of B50 and B60 over the entire original data.
  • the data abstraction operation is completed over the entire original data (B70 ) Ends the operation.
  • a gap value with a field value adjacent to the corresponding field value is calculated and a gap value list corresponding to the corresponding field value is generated. If the calculated gap value is equal to a predetermined threshold value
  • the field value adjacent to the corresponding field value is set to the value of the corresponding field value when the average value is calculated with respect to the field value of each of the aligned correction lists, A value obtained by adding or subtracting the clearance value on the value list is applied to calculate the average value.
  • the abstraction reference field is a gender field and an address field generated from a age field, a resident registration number, and an address field, which are converted from an age field, and an income field is selected as an abstraction reference field .
  • the correction list is generated by extracting the income field value which is an abstraction reference field (left side of Table 16), eliminating the redundant value, and then sorting in the order of the size of the field value (right side of Table 16).
  • the threshold can be set in several ways depending on the nature of the data.
  • the threshold can be set to 1.5 times the standard deviation (mean +1.5 X deviation) on the average of the full range values.
  • Table 17 shows an example in which the threshold value is set to 870 to create the wellness list, and the threshold value of 900, which exceeds the threshold value, is replaced with the threshold value of 870.
  • a field value adjacent to the corresponding field value is calculated by applying a value obtained by adding or subtracting the clearance value on the clearance list to the corresponding field value .
  • an average value corresponding to each field value is assigned as an abstraction value of the abstraction target field, Is assigned to the corresponding field value of the original record.
  • the field value of the abstraction reference field is related to the identification of the individual, the field value is converted and assigned to a group value that can be included as one element.
  • Table 19 shows the original records including the converted fields
  • Table 20 shows the records in which the abstraction is completed.
  • non-identifying big data is generated by selecting a field that can be a reference of statistical analysis and a field that can be a target of statistical analysis, It is possible to obtain new information having a value of statistical analysis by associating the data type field with an abstraction value having a field value different from the value of the original record and maintaining the intrinsic meaning of the statistical analysis, that is, an average value with the adjacent value , It is possible to provide big data that can fundamentally prevent backtracking through specific information of a person and a combination thereof.
  • the present embodiment can also be applied to a case where the original data is composed of a new phase table and a log table.
  • the new phase table and the log table are combined and converted into a single table. Then, The data abstraction step proceeds.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Bioethics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Data Mining & Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Storage Device Security (AREA)

Abstract

L'invention concerne un procédé de désidentification de données volumineuses, qui peut distribuer librement les données volumineuses à un système externe sans souci de fuite d'informations personnelles et qui peut utiliser les données volumineuses à des fins diverses en associant les données générées à partir d'un environnement séparé avec les données volumineuses. Selon l'invention, le procédé génère des données volumineuses désidentifiées à des fins de distribution en sélectionnant un champ à utiliser comme référence d'analyse statistique ainsi qu'un champ à analyser statistiquement parmi les divers champs inclus dans les données volumineuses, une pluralité d'enregistrements originaux étant résumés en un seul enregistrement ou une valeur de champ numérique étant résumée en tant que valeur moyenne des valeurs numériques approximatives, les données volumineuses ayant une valeur de champ différente de la valeur d'enregistrement originale et pouvant donc fondamentalement empêcher un suivi au moyen des informations personnelles spécifiques d'un individu et d'une combinaison de celles-ci tout en conservant leur intention d'origine en tant que données d'analyse statistique.
PCT/KR2016/006206 2016-06-09 2016-06-10 Procédé de désidentification de données volumineuses WO2017213281A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2019517743A JP6829762B2 (ja) 2016-06-09 2016-06-10 ビッグデータの非識別化処理方法

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
KR1020160071747A KR101784265B1 (ko) 2016-06-09 2016-06-09 빅데이터의 비식별화 처리 방법
KR10-2016-0071747 2016-06-09

Publications (1)

Publication Number Publication Date
WO2017213281A1 true WO2017213281A1 (fr) 2017-12-14

Family

ID=60141322

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/KR2016/006206 WO2017213281A1 (fr) 2016-06-09 2016-06-10 Procédé de désidentification de données volumineuses

Country Status (3)

Country Link
JP (1) JP6829762B2 (fr)
KR (1) KR101784265B1 (fr)
WO (1) WO2017213281A1 (fr)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111182488A (zh) * 2019-12-05 2020-05-19 诺得物流股份有限公司 一种基于时间信道的溯源数据节能传输方法
CN111382952A (zh) * 2020-03-23 2020-07-07 福建省特种设备检验研究院 一种基于全面覆盖原则的电梯质量检查抽取方法
WO2022086145A1 (fr) * 2020-10-21 2022-04-28 Deeping Source Inc. Procédé permettant d'entraîner et de tester un réseau de brouillage pouvant traiter des données à brouiller à des fins de confidentialité, et dispositif d'entraînement et dispositif de test faisant appel à celui-ci
WO2022086147A1 (fr) * 2020-10-21 2022-04-28 Deeping Source Inc. Procédé permettant d'entraîner et de tester un réseau d'apprentissage utilisateur à utiliser pour reconnaître des données brouillées créées par brouillage de données originales pour protéger des informations personnelles et dispositif d'apprentissage utilisateur et dispositif de test faisant appel à celui-ci
CN115118458A (zh) * 2022-05-31 2022-09-27 腾讯科技(深圳)有限公司 数据处理方法、装置、计算机设备及存储介质

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019189969A1 (fr) * 2018-03-30 2019-10-03 주식회사 그리즐리 Procédé d'anonymisation d'informations personnelles volumineuses et procédé de combinaison des données anonymes
KR102035796B1 (ko) * 2018-07-26 2019-10-24 주식회사 딥핑소스 데이터를 비식별 처리하는 방법, 시스템 및 비일시성의 컴퓨터 판독 가능 기록 매체
US11941153B2 (en) * 2019-05-31 2024-03-26 Boala Co., Ltd. De-identification method for big data
KR102260039B1 (ko) * 2019-08-13 2021-06-03 주식회사 딥핑소스 개인 정보 보호를 위하여 원본 데이터를 컨실링 처리하는 변조 네트워크를 학습하는 방법 및 테스트하는 방법, 그리고, 이를 이용한 학습 장치 및 테스트 장치
US10621378B1 (en) * 2019-10-24 2020-04-14 Deeping Source Inc. Method for learning and testing user learning network to be used for recognizing obfuscated data created by concealing original data to protect personal information and learning device and testing device using the same
US10621379B1 (en) * 2019-10-24 2020-04-14 Deeping Source Inc. Method for training and testing adaption network corresponding to obfuscation network capable of processing data to be concealed for privacy, and training device and testing device using the same
KR20220013314A (ko) 2020-07-24 2022-02-04 (주)이노코어 빅데이터 환경에서의 개인정보 비식별화 처리를 위한 데이터 필드 자동 분류 시스템 및 방법
US11023777B1 (en) * 2020-09-25 2021-06-01 Deeping Source Inc. Methods for training and testing obfuscation network capable of performing distinct concealing processes for distinct regions of original image and learning and testing devices using the same
KR102504531B1 (ko) * 2020-11-20 2023-02-28 (주)디지탈쉽 데이터 통합 분석을 위한 데이터 수집 처리 장치 및 방법

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005049943A (ja) * 2003-07-29 2005-02-24 Toshiba Corp データ処理装置、データ処理方法およびプログラム
KR20090077659A (ko) * 2008-01-11 2009-07-15 주식회사 케이티프리텔 Ims 기반의 유무선 복합망에서의 지능형 개인화 정보 생성 장치, 시스템 및 방법
KR20100054821A (ko) * 2007-08-07 2010-05-25 가부시키가이샤 후지쯔 비에스씨 데이터베이스 관리 프로그램 및 데이터베이스 관리 장치
KR20120022778A (ko) * 2009-05-19 2012-03-12 가부시키가이샤 엔.티.티.도코모 데이터 결합 시스템 및 데이터 결합 방법
KR101463974B1 (ko) * 2014-05-26 2014-11-26 (주)시엠아이코리아 마케팅을 위한 빅데이터 분석 시스템 및 방법

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4146634B2 (ja) * 2001-11-21 2008-09-10 エヌ・シー・エル・コミュニケーション株式会社 2次情報利用システム
JP2010086179A (ja) * 2008-09-30 2010-04-15 Oki Electric Ind Co Ltd 情報処理装置、コンピュータプログラムおよび記録媒体
US9898620B2 (en) * 2012-09-28 2018-02-20 Panasonic Intellectual Property Management Co., Ltd. Information management method and information management system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005049943A (ja) * 2003-07-29 2005-02-24 Toshiba Corp データ処理装置、データ処理方法およびプログラム
KR20100054821A (ko) * 2007-08-07 2010-05-25 가부시키가이샤 후지쯔 비에스씨 데이터베이스 관리 프로그램 및 데이터베이스 관리 장치
KR20090077659A (ko) * 2008-01-11 2009-07-15 주식회사 케이티프리텔 Ims 기반의 유무선 복합망에서의 지능형 개인화 정보 생성 장치, 시스템 및 방법
KR20120022778A (ko) * 2009-05-19 2012-03-12 가부시키가이샤 엔.티.티.도코모 데이터 결합 시스템 및 데이터 결합 방법
KR101463974B1 (ko) * 2014-05-26 2014-11-26 (주)시엠아이코리아 마케팅을 위한 빅데이터 분석 시스템 및 방법

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111182488A (zh) * 2019-12-05 2020-05-19 诺得物流股份有限公司 一种基于时间信道的溯源数据节能传输方法
CN111182488B (zh) * 2019-12-05 2022-09-16 诺得物流股份有限公司 一种基于时间信道的溯源数据节能传输方法
CN111382952A (zh) * 2020-03-23 2020-07-07 福建省特种设备检验研究院 一种基于全面覆盖原则的电梯质量检查抽取方法
CN111382952B (zh) * 2020-03-23 2022-06-28 福建省特种设备检验研究院 一种基于全面覆盖原则的电梯质量检查抽取方法
WO2022086145A1 (fr) * 2020-10-21 2022-04-28 Deeping Source Inc. Procédé permettant d'entraîner et de tester un réseau de brouillage pouvant traiter des données à brouiller à des fins de confidentialité, et dispositif d'entraînement et dispositif de test faisant appel à celui-ci
WO2022086147A1 (fr) * 2020-10-21 2022-04-28 Deeping Source Inc. Procédé permettant d'entraîner et de tester un réseau d'apprentissage utilisateur à utiliser pour reconnaître des données brouillées créées par brouillage de données originales pour protéger des informations personnelles et dispositif d'apprentissage utilisateur et dispositif de test faisant appel à celui-ci
CN115118458A (zh) * 2022-05-31 2022-09-27 腾讯科技(深圳)有限公司 数据处理方法、装置、计算机设备及存储介质
CN115118458B (zh) * 2022-05-31 2024-04-19 腾讯科技(深圳)有限公司 数据处理方法、装置、计算机设备及存储介质

Also Published As

Publication number Publication date
KR101784265B1 (ko) 2017-10-12
JP6829762B2 (ja) 2021-02-10
JP2019523958A (ja) 2019-08-29

Similar Documents

Publication Publication Date Title
WO2017213281A1 (fr) Procédé de désidentification de données volumineuses
WO2018004236A1 (fr) Procédé et appareil de dépersonnalisation d'informations personnelles
WO2018082484A1 (fr) Procédé et système de capture d'écran pour dispositif électronique, et dispositif électronique
WO2018070623A1 (fr) Dispositif et système pour empêcher une contrefaçon et une falsification basées sur un contenu de document électronique, et procédé associé
WO2017222354A1 (fr) Système, procédé et support d'enregistrement lisible par ordinateur pour fournir un service de reconnaissance de situation basé sur l'ontologie dans un environnement d'internet des objets
WO2016167407A1 (fr) Procédé et dispositif de gestion de données
WO2018166099A1 (fr) Procédé et dispositif de détection de fuite d'informations, serveur et support d'informations lisible par ordinateur
WO2020147385A1 (fr) Procédé et appareil d'entrée de données, terminal et support d'informations lisible par ordinateur
WO2020071809A1 (fr) Procédé et appareil de gestion d'assertion améliorée dans un traitement multimédia en nuage
EP3241102A1 (fr) Système électronique doté d'un mécanisme de gestion d'accès, et son procédé de fonctionnement
WO2018076840A1 (fr) Procédé de partage de données, dispositif, support de stockage et serveur
WO2017111197A1 (fr) Système et procédé de visualisation de mégadonnées pour l'analyse d'apprentissage
WO2015129983A1 (fr) Dispositif et procédé destinés à recommander un film en fonction de l'exploration distribuée de règles d'association imprécises
WO2017146338A1 (fr) Procédé et appareil permettant d'archiver une base de données générant des informations d'index, et procédé et appareil permettant de consulter une base de données archivée comprenant des informations d'index
WO2018076829A1 (fr) Serveur, support d'informations, système, appareil et procédé de traitement de données de terminal
WO2018233356A1 (fr) Procédé, système, dispositif et support de stockage lisible pour balayer un document
WO2021075729A1 (fr) Système et procédé de mise à jour de graphe de connaissances
WO2023191129A1 (fr) Procédé de surveillance de facture et de régulation légale et programme associé
WO2020130331A1 (fr) Procédé de partage et de vérification de blocs et de documents électroniques entre des nœuds dans une chaîne de blocs
WO2024043613A1 (fr) Dispositif serveur pour fournir un service de gestion et de génération de curriculum vitae, et son procédé de fonctionnement
WO2011068315A4 (fr) Appareil permettant de sélectionner une base de données optimale en utilisant une technique de reconnaissance de force conceptuelle maximale et procédé associé
WO2018191889A1 (fr) Procédé et appareil de traitement de photo, et dispositif informatique
WO2023055047A1 (fr) Procédé d'entraînement de modèle de prédiction, procédé de prédiction d'informations et dispositif correspondant
WO2019190030A1 (fr) Procédé d'anonymisation d'informations personnelles dans des mégadonnées et de combinaison de données anonymisées
WO2018205816A1 (fr) Procédé, dispositif et appareil basés sur un service intégré en ligne et hors ligne, et support d'informations

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16904714

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2019517743

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16904714

Country of ref document: EP

Kind code of ref document: A1