WO2024019205A1

WO2024019205A1 - Electronic device and method for data processing

Info

Publication number: WO2024019205A1
Application number: PCT/KR2022/011120
Authority: WO
Inventors: 정운기; 임이랑; 손회연
Original assignee: 쿠팡 주식회사
Priority date: 2022-07-22
Filing date: 2022-07-28
Publication date: 2024-01-25
Also published as: KR20240013440A

Abstract

An electronic device for data processing and a method therefor are disclosed. The method for data processing according to the present disclosure comprises the steps of: identifying user identification data capable of identifying a service user and feature data corresponding to the user identification data; generating first non-identified data corresponding to the user identification data through a hashing technique; calculating a processing criterion specific to the feature data on the basis of meta information that defines a data processing scheme; generating second non-identified data corresponding to the feature data by processing at least a portion of the feature data on the basis of the processing criterion; and generating final non-identified data including one or more tables having the first non-identified data as a key and at least some of the second non-identified data as values.

Description

Electronic devices and methods for data processing

This disclosure relates to an electronic device and method for de-identifying sensitive data containing personal information.

As various services, including e-commerce, are operated, sensitive data containing personal information of numerous users is accumulated. In order to provide a more enhanced user experience and improve services, such user data must be actively utilized, such as using it to learn artificial intelligence models or processing it statistically.

However, there is a risk that the user's sensitive data may be exposed during this process, and it is true that many users are reluctant to have their sensitive data used by service providers.

Therefore, there is a need to de-identify (anonymize or pseudonymize) the user's sensitive data so as not to dilute the characteristics of the user's sensitive data excessively, and not reveal whose sensitive data it is.

In relation to this, prior literature such as KR2417044B1 may be referred to.

The present disclosure is proposed to solve the above-described problems, and its purpose is to provide an electronic device and method for processing data so that it is impossible to identify which user each data belongs to while appropriately maintaining the characteristics of the user's data. Do it as

The technical problems to be achieved by the present disclosure are not limited to the technical problems described above, and other technical problems can be inferred from the following embodiments.

A data processing method according to an embodiment disclosed includes the steps of checking user identification data capable of identifying a service user and characteristic data of the service user corresponding to the user identification data; Generating first non-identifiable data corresponding to the user identification data through a hashing technique; calculating a processing standard specialized for the feature data based on meta information defining a data processing method; generating second de-identified data corresponding to the feature data by processing at least a portion of the feature data based on the processing criteria; and generating final non-identifiable data including one or more tables having the first non-identifiable data as a key and at least some of the second non-identifiable data as values.

The meta information may include data field information defining the type of each data included in the feature data and data de-identification method information defining a data processing method corresponding to each type of data.

The data field information may be information that defines at least some of the service user's gender, age, number of visits to the service page, number of orders within the service, and number of membership months as the type of data.

The meta information may additionally include data usage information defining the usage of at least some of the data included in the feature data, and the data de-identification method information may include the type of each data and the data usage information. It may be information defining a data processing method corresponding to .

The data de-identification method information includes a first method of binary converting the data value, a second method of converting the data value into a representative value for each section, a third method of converting the data value into an approximate value, and a data value. It may be information that defines at least one of the fourth methods that maintain the same as the data processing method.

The first non-identifiable data may include data generated through a hash function based on the user identification data and additional data with a unique value.

The additional data may include a Universally Unique Identifier (UUID), and the hash function may include a Secure Hash Algorithm (SHA)-based hash function.

The UUID includes a first version UUID generated based on a timestamp and a MAC address (Media Access Control Address), a second version UUID in which some data constituting the first version UUID has been replaced, and the name and name of the service user. A third version UUID generated through a MD5 (Message-Digest algorithm 5)-based hash function based on the namespace, a randomly generated fourth version UUID, and a SHA-based hash based on the name and namespace of the service user. It may correspond to any one of the 5th version UUIDs generated through the function.

If the user identification data is the same but the time at which the final de-identified data is generated is different, the additional data used to generate each final de-identified data may be different.

The second non-identified data is generated according to the processing standard when feature data belonging to a data field for which the processing standard exists is the processing target, and when feature data belonging to a data field for which the processing standard does not exist is the processing target. , can be generated by binary converting the value of the feature data.

The final de-identified data is snapshot data distinguished by the time of creation, and contains one or more tables in which one or more user identification data confirmed at each preset data reading cycle and one or more characteristic data corresponding to each user identification data are matched. It can be included.

The characteristic data may include second non-identifying data corresponding to characteristic data of a service user currently enrolled and characteristic data of a service user currently unsubscribed.

According to one embodiment, the data processing method includes checking characteristic data of a service user who is in a withdrawal state; And generating second non-identifying data of the service user in the withdrawal state by processing at least a portion of the characteristic data of the service user in the withdrawal state based on pre-calculated processing standards for the characteristic data of the service user in the currently subscribed state. Additional steps may be included.

According to another embodiment, the data processing method includes: confirming characteristic data of a service user who has withdrawn; Calculating processing standards specialized for characteristic data of the service user who has withdrawn from the service based on meta information defining a data processing method; and generating second non-identifying data of the service user in the withdrawal state by processing at least a portion of the characteristic data of the service user in the withdrawal state based on processing standards specialized for the characteristic data of the service user in the withdrawal state. It may also be included.

The step of generating the first non-identification data includes first applying a hash function to the user identification data of the service user in the withdrawal state when generating the first non-identification data of the service user in the withdrawal state; and generating first non-identifiable data of the service user in the withdrawn state by secondary application of a hash function based on the primary application result and the additional data.

An electronic device for data processing according to an embodiment of the disclosure includes a transceiver, a memory for storing instructions, and a processor, wherein the processor is connected to the transceiver and the memory to provide user identification data capable of identifying a service user and Identify the characteristic data of the service user corresponding to the user identification data, generate first de-identified data corresponding to the user identification data through a hashing technique, and based on meta information defining the processing method of the data. Calculate a processing standard specialized for the feature data, process at least a portion of the feature data based on the processing standard to generate second non-identified data corresponding to the feature data, and use the first non-identified data as a key ( key), to generate final de-identified data including one or more tables having at least some of the second de-identified data as values.

Specific details of other embodiments are included in the detailed description and drawings.

According to the present disclosure, data that can identify a user is de-identified through a hashing technique, feature data representing characteristics is de-identified according to processing standards, and the two de-identified data are stored in a key-value table. By matching, a data set in which data representing the user and its characteristics are matched can be created.

In addition, according to the present disclosure, by first de-identifying and storing the data of users who are withdrawn in advance, the data on withdrawn members can be stored in a form that can be used later while satisfying the legal regulations applicable to withdrawn members. there is.

The effect of the invention is not limited to the effects mentioned above, and other effects not mentioned will be clearly understood by those skilled in the art from the description of the claims.

1 is a schematic configuration diagram showing an environment in which an electronic device operates, according to an embodiment.

Figure 2 is a flowchart to explain a data processing method according to an embodiment.

Figure 3 is an exemplary diagram of a pipeline through which data is processed according to an embodiment.

4 and 5 are flowcharts to explain a data processing method according to an additional embodiment.

Figure 6 is an exemplary diagram of a pipeline through which data is processed according to an additional embodiment.

7A to 7E are exemplary diagrams showing the structure of data processed according to an embodiment.

FIG. 8 is a block diagram illustrating an electronic device that processes data according to an embodiment.

The terms used in the embodiments are general terms that are currently widely used as much as possible while considering the functions in the present disclosure, but this may vary depending on the intention or precedent of a person working in the art, the emergence of new technology, etc. In addition, in certain cases, there are terms arbitrarily selected by the applicant, and in this case, the meaning will be described in detail in the relevant description. Therefore, the terms used in this disclosure should be defined based on the meaning of the term and the overall content of this disclosure, rather than simply the name of the term.

When it is said that a part "includes" a certain element throughout the specification, this means that, unless specifically stated to the contrary, it does not exclude other elements but may further include other elements. In addition, terms such as "...unit" and "...module" used in the specification refer to a unit that processes at least one function or operation, which is implemented as hardware or software, or as a combination of hardware and software. It may be possible, and unlike the example shown, specific operations may not be clearly distinguished.

The expression “at least one of a, b, and c” used throughout the specification means ‘a alone’, ‘b alone’, ‘c alone’, ‘a and b’, ‘a and c’, ‘b and c ', or 'all of a, b, and c'.

The “terminal” or “user terminal” mentioned below may be implemented as a computer or portable terminal that can connect to a server or other terminal through a network. Here, the computer includes, for example, a laptop, desktop, laptop, etc. equipped with a web browser, and the portable terminal is, for example, a wireless communication device that guarantees portability and mobility. , all types of communication-based terminals such as IMT (International Mobile Telecommunication), CDMA (Code Division Multiple Access), W-CDMA (W-Code Division Multiple Access), and LTE (Long Term Evolution), smartphones, tablet PCs, etc. It may include a handheld-based wireless communication device.

In the following description, “transmission,” “communication,” “transmission,” “reception,” “transmission,” “transmission,” “reception,” or similar terms of a signal, message, or information refer to the direct transmission of information, message, or information from one component to another. It includes not only what is done, but also what is transmitted through other components.

In particular, “transmitting” or “transmitting” a signal, message or information as a component indicates the final destination of the signal, message or information and does not mean the direct destination. The same applies to “receiving” signals, messages or information. Additionally, in the present disclosure, “related” to two or more data or information means that if one data (or information) is acquired, at least part of other data (or information) can be obtained based on it.

Additionally, terms such as first and second may be used to describe various components, but the components should not be limited by the terms. The above terms may be used for the purpose of distinguishing one component from another component.

For example, a first component may be referred to as a second component, and similarly, the second component may also be referred to as the first component without departing from the scope of the present disclosure.

Below, with reference to the attached drawings, embodiments of the present disclosure will be described in detail so that those skilled in the art can easily practice them. However, the present disclosure may be implemented in many different forms and is not limited to the embodiments described herein.

Hereinafter, embodiments of the present disclosure will be described in detail with reference to the drawings.

In describing the embodiments, description of technical content that is well known in the technical field to which the present invention belongs and that is not directly related to the present invention will be omitted. This is to convey the gist of the present invention more clearly without obscuring it by omitting unnecessary explanation.

For the same reason, some components are exaggerated, omitted, or schematically shown in the accompanying drawings. Additionally, the size of each component does not entirely reflect its actual size. In each drawing, identical or corresponding components are assigned the same reference numbers.

The advantages and features of the present invention and methods for achieving them will become clear by referring to the embodiments described in detail below along with the accompanying drawings. However, the present invention is not limited to the embodiments disclosed below and may be implemented in various different forms. The present embodiments are merely provided to ensure that the disclosure of the present invention is complete and to provide common knowledge in the technical field to which the present invention pertains. It is provided to fully inform those who have the scope of the invention, and the present invention is only defined by the scope of the claims. Like reference numerals refer to like elements throughout the specification.

It will be understood that each block of the processing flow diagrams and combinations of the flow diagram diagrams may be performed by computer program instructions. These computer program instructions can be mounted on a processor of a general-purpose computer, special-purpose computer, or other programmable data processing equipment, so that the instructions performed through the processor of the computer or other programmable data processing equipment are described in the flow chart block(s). It creates the means to perform functions. These computer program instructions may also be stored in computer-usable or computer-readable memory that can be directed to a computer or other programmable data processing equipment to implement a function in a particular manner, so that the computer-usable or computer-readable memory It is also possible to produce manufactured items containing instruction means that perform the functions described in the flowchart block(s). Computer program instructions can also be mounted on a computer or other programmable data processing equipment, so that a series of operational steps are performed on the computer or other programmable data processing equipment to create a process that is executed by the computer, thereby generating a process that is executed by the computer or other programmable data processing equipment. Instructions that perform processing equipment may also provide steps for executing the functions described in the flow diagram block(s).

Additionally, each block may represent a module, segment, or portion of code that includes one or more executable instructions for executing specified logical function(s). Additionally, it should be noted that in some alternative execution examples it is possible for the functions mentioned in the blocks to occur out of order. For example, it is possible for two blocks shown in succession to be performed substantially at the same time, or it is possible for the blocks to be performed in reverse order depending on the corresponding function.

1 is a schematic configuration diagram showing an environment in which an electronic device operates, according to an embodiment. Referring to FIG. 1, the electronic device 110 and the database 120 may communicate with each other or with another external device through the connected network 130. The network 130 includes a local area network (LAN), a wide area network (WAN), a value added network (VAN), a mobile radio communication network, a satellite communication network, and their respective networks. It is a comprehensive data communication network that includes a combination and allows each component shown in Figure 1 to communicate smoothly with each other, and may include wired Internet, wireless Internet, and mobile wireless communication network. Wireless communications include, for example, wireless LAN (Wi-Fi), Bluetooth, Bluetooth low energy, ZigBee, WFD (Wi-Fi Direct), UWB (ultra wideband), and infrared communication (IrDA, infrared Data Association). ), NFC (Near Field Communication), etc., but are not limited thereto.

The electronic device 110 is a device for de-identifying data stored in the database 120 or a copy thereof. In this disclosure, 'de-identification' refers to a general term for methods and mechanisms that change sensitive data, including personal information, so that it cannot be identified whose data it belongs to, and 'non-identification' means that the data is de-identified according to preset standards. This may mean that the information has been de-identified or has reached a state where it can receive approval from an authorized agency, committee, etc. in relation to de-identification. Such de-identified data can also be used for machine learning of artificial intelligence models.

The electronic device 110 may be a single server that performs an operation for de-identification, but depending on the embodiment, it may be composed of a plurality of servers, and may be configured to enable electronic communication between one or more servers and an external cloud server. It may also be configured in a connected form.

The database 120 is a medium in which personal information of users using the service is stored. Specifically, the database 120 may store sensitive data including the user's gender, age, number of times of service use, service subscription date, service use date, ID, password, or member serial number. Relatedly, the way data is stored in the database 120 may be designed in various ways, and the database 120 may be a relational database (RDB) that stores and provides access to data points related to each other. Depending on the embodiment, it may be a non-relational database (NoSQL).

The electronic device 110 can receive data from the database 120, read data stored in the database 120, and perform operations using the data. Additionally, the electronic device 110 may transmit the calculation result back to the database 120, store it in a storage space within the electronic device 110, or transmit it to a separate storage medium. Although FIG. 1 shows that there is only one database 120, this is for convenience of understanding, and depending on the embodiment, the number of databases 120 may be plural.

In relation to the above, it will be described in more detail through the drawings below. The method shown in FIGS. 2 to 7E may be performed, for example, by the electronic device 110 described above.

In step S210, the electronic device 110 checks user identification data capable of identifying the service user and characteristic data of the service user corresponding to the user identification data. In this disclosure, 'user identification data' may mean a member serial number, member ID, or other identifier uniquely assigned to each service user. In addition, 'characteristic data' refers to the remaining information excluding 'user identification data' among the personal information of service users, and may include, for example, the user's gender, age, number of times of service use, number of membership months, etc.

According to one embodiment, the electronic device 110 may read user identification data and characteristic data from the database 120, or temporarily store a copy of the user identification data and characteristic data in the electronic device 110.

According to one embodiment, the electronic device 110 may check user identification data and characteristic data corresponding to a batch of a specific size at every preset data read cycle or whenever a preset read condition is met. . At this time, the batch size may be set in advance or may be determined according to the data read period. Accordingly, the electronic device 110 can check different user identification data for two or more users at one time, and can also check characteristic data for each user.

In step S220, the electronic device 110 generates first non-identification data corresponding to user identification data through a hashing technique.

According to one embodiment, the first non-identifiable data may include data generated through a hash function based on (1) user identification data and (2) additional data with a unique value. For example, the electronic device 110 may generate first non-identified data by merging user identification data and additional data and inputting them into a hash function.

As an example of the above embodiment, the additional data may include a Universally Unique Identifier (UUID), and the hash function may include a hash function based on the Secure Hash Algorithm (SHA). According to this, the electronic device 110 can generate first non-identification data by merging user identification data and UUID and inputting it into a SHA-based hash function. Typically, the hash function may include a SHA-256-based hash function, and the UUID may include the following five versions of UUID. However, it should be noted that the types of hash functions and UUIDs are not limited to those described in this disclosure.

(1) 1st version UUID: UUID generated based on timestamp and MAC address (Media Access Control Address)

(2) Second version UUID: UUID generated by replacing some data constituting the first version UUID.

(3) Third version UUID: UUID generated through MD5 (Message-Digest algorithm 5)-based hash function based on the service user's name and namespace.

(4) 4th version UUID: randomly generated UUID

(5) Fifth version UUID: UUID generated through a SHA-based hash function based on the service user's name and namespace.

Meanwhile, according to one embodiment, in relation to the additional data used to generate the first de-identified data, if the user identification data is the same but the time at which the final de-identified data is generated is different, each final de-identified data is generated. The additional data used may be different. In other words, even if the electronic device 110 de-identifies the personal information of the same service user, it uses different additional data depending on the time of de-identification (the time when the final de-identified data is generated), depending on the time of de-identification. The final de-identified data can be generated differently, which can add the de-identification effect based on the time of de-identification to the de-identification effect based on the service user.

In step S230, the electronic device 110 calculates a processing standard specialized for feature data based on meta information defining a data processing method.

According to one embodiment, meta information includes (1) data field information defining the type of each data included in the feature data, and (2) data de-identification method information defining the data processing method corresponding to each type of data. may include. At this time, meta information including data field information and data de-identification method information is based on setting information generated by the consumer of the final de-identified data or setting information generated by the electronic device 110 in consideration of the consumer. can be defined.

In relation to this, the data field information may include information defining at least some of the service user's gender, age, number of visits to the service page, number of orders within the service, and number of membership months as the type of data. Specifically, data field information can be defined for each column to which each data belongs.

Meanwhile, the meta information may additionally include data usage information that defines the usage of at least some of the data included in the feature data, and in this case, the data de-identification method information corresponds to 'type and data usage information of each data'. It may include information defining the data processing method.

Meanwhile, data de-identification method information may include information defining the data processing method as follows. However, the data processing method is not limited to the methods listed below. Additionally, the data de-identification method information may be defined by the administrator of the electronic device 110, or may be defined by the electronic device 110 itself according to the type of each data.

(1) Method 1: Binary convert the data value

(2) Second method: Convert data values into representative values for each section

(3) Third method: Converting data values to approximate values (e.g., truncation)

(4) Method 4: Keep data values as is (bypass)

To describe each method in more detail, the first method converts data with two data values to 0 or 1 depending on the data value, or converts data with three or more data values to 0 based on a threshold. Alternatively, it may be converted to 1. Additionally, the second method may be a method of dividing the data value interval into a plurality of quantiles and then converting each data value into a representative value such as the median, average, minimum, or maximum value of the quantile interval to which the data value belongs. For example, according to the 50th quantile median value method, which is an example of the second method, the data value interval is divided into 50 and the corresponding data value can be replaced with the median value of the quantile interval to which each data value belongs. Meanwhile, the third method may be a method of obtaining an approximate value by rounding off/rounding less than a certain number of digits of the data value or applying a specific function to the data value.

In step S240, the electronic device 110 generates second de-identified data corresponding to the feature data by processing at least a portion of the feature data based on the calculated processing standard. In other words, the electronic device 110 may target feature data belonging to a data field to which the processing standard is applied and convert the feature data according to the processing standard to generate second non-identified data.

However, since processing standards may not be calculated for some of all data fields to which feature data belongs, the electronic device 110 may vary the method of generating the second non-identified data depending on the presence or absence of processing standards. For example, when feature data belonging to a data field for which a processing standard exists is subject to processing, the electronic device 110 may generate second de-identified data according to the processing standard, and feature data belonging to a data field for which a processing standard does not exist. When data is to be processed, second non-identified data can be generated by binary converting the value of the feature data.

In step S250, the electronic device 110 generates final de-identified data including one or more tables having the first de-identified data as a key and at least some of the second de-identified data as values. . For example, if the final de-identified data includes tables A, B, C, D, and E, the five tables have the first de-identified data corresponding to user It can be done differently. As another example, among tables A, B, C, D, and E, A and B have the first non-identifiable data corresponding to user The first non-identifiable data corresponding to user Y may have a common key, but the second non-identifiable data may be combined in different combinations. As a result, the manager of the electronic device 110 can use the first non-identifiable data corresponding to a specific user among the final non-identified data as a key and collectively search for values (second non-identifiable data) associated with the key. .

According to one embodiment, the final non-identified data generated by the electronic device 110 may be snapshot data distinguished by the time of creation. Specifically, the final de-identified data may include one or more tables in which one or more user identification data confirmed at each preset data read cycle and one or more feature data corresponding to each user identification data are matched. In this case, snapshot data created at different times are not associated with a common key.

Hereinafter, the data processing method of FIG. 2 will be described more intuitively with reference to FIG. 3.

Figure 3 is an exemplary diagram of a pipeline through which data is processed according to an embodiment. The process shown in a square represents the operation of the electronic device 110, and the information shown in a cylindrical shape represents data to be processed or data used to process the data. However, the fact that each data is shown as a separate cylindrical figure is for convenience of explanation and does not mean that the database in which each data is stored is separate.

According to one embodiment, the electronic device 110 may check user identification data and feature data that can identify a service user, and calculate processing standards specialized for feature data based on meta information. At this time, the electronic device 110 may calculate a processing standard specialized for feature data by additionally considering the distribution of feature data. However, the distribution of such feature data may be information confirmed by the electronic device 110 while checking the feature data, or may be information included in meta information in advance.

According to one embodiment, the electronic device 110 may process at least a portion of the feature data according to the calculated processing criteria to generate second non-identification data, and first non-identification data corresponding to the user identification data through a hashing technique. Identification data can be generated. At this time, in FIG. 3, the first de-identified data generation process is shown as being performed in a lower priority than the processing standard calculation process and the second de-identified data generation process. However, this is for convenience of explanation, and depending on the embodiment, the first de-identified data generation process The identification data generation process may be performed regardless of the order with other processes as long as the electronic device 110 confirms the user identification data.

4 and 5 are flowcharts to explain a data processing method according to an additional embodiment. The data processing method according to this disclosure is to de-identify personal information or copies thereof stored in the database 120, but it is difficult to store personal information of users who have withdrawn for a long period of time considering legally regulated terms and conditions. , it must be stored through a primary de-identification process in advance.

In this regard, Figure 4 shows a method of processing user identification data of a service user who has withdrawn from service. Specifically, it shows the process of generating first de-identified data by de-identifying user identification data. Step S410 may be performed before step S210 and step S420 may be performed after step S210, but the order of performing each step is not limited to this.

In step S410, the electronic device 110 first applies a hash function to the user identification data of the service user who has withdrawn. There are no special restrictions on the type of hash function applied at this time, but a representative hash function based on SHA can be applied.

In step S420, the electronic device 110 generates first non-identifiable data of the service user in the withdrawn state by secondly applying a hash function based on the first application result and additional data through step S410. Specifically, the electronic device 110 may merge the primary application result and the additional data and input them into a hash function to generate first non-identifiable data of the service user who is in the withdrawal state. There are no special restrictions on the type of hash function applied at this time, and a representative example is the SHA-based hash function.

According to one embodiment, the additional data may include a UUID, and the UUID may include the five versions of UUID described above with reference to FIG. 2. However, it should be noted that the types of hash functions and UUIDs are not limited to those described in this disclosure.

Meanwhile, Figure 5 shows a method of processing characteristic data of a service user who is in a withdrawal state. Since the characteristic data of a withdrawn user must be de-identified in advance and go through step S210 of FIG. 2, FIG. 5 shows a pre-processing process for the characteristic data of a service user who has withdrawn to this effect. In this case, the 'feature data' in step S210 of FIG. 2 may include (1) characteristic data of the service user currently in the subscription state and (2) second non-identifying data corresponding to the characteristic data of the service user in the withdrawal state. You can.

In step S510, the electronic device 110 checks characteristic data of the service user who has withdrawn.

According to one embodiment, the electronic device 110 may determine whether the user is in an unsubscribed or subscribed state by referring to user status information merged with the user's characteristic data. According to another embodiment, a database storing the characteristic data of a currently registered user and a database storing the characteristic data of a user who is unsubscribed exist separately, or the storage location of the currently registered user's characteristic data within one database and The storage location of the feature data of the user in the unsubscribed state may be different, and the electronic device 110 may determine whether the user is in the unsubscribed or subscribed state based on the storage medium or storage location of the two feature data. However, the method by which the electronic device 110 determines the user's status is not limited to this.

In step S520, the electronic device 110 calculates a processing standard specialized for characteristic data of a service user who has withdrawn from service based on meta information defining a data processing method.

In step S530, the electronic device 110 collects second non-identifying data of the service user in the withdrawal state by processing at least a portion of the characteristic data of the service user in the withdrawal state based on processing standards specialized for the characteristic data of the service user in the withdrawal state. creates .

Meanwhile, the electronic device 110 does not use processing standards specialized for characteristic data of service users who are currently in a withdrawal state, but uses pre-calculated processing standards to process characteristic data of service users who are currently in a subscription state. It can also be used when processing feature data. In this case, the above-described step S520 may be omitted, and in step S530, the electronic device 110 may generate second non-identified data of the service user who has withdrawn from the service based on a pre-calculated processing standard.

In the above flow diagrams, the method is divided into a plurality of steps, but at least some of the steps are performed in a different order, combined with other steps, omitted, or divided into detailed steps. It may be performed, or may be performed by adding one or more steps not shown.

Figure 6 is an exemplary diagram of a pipeline through which data is processed according to an additional embodiment. Specifically, the upper part of FIG. 6 shows the process of initially de-identifying the data of a user who has withdrawn.

According to one embodiment, the electronic device 110 checks the user identification data and characteristic data of the user in the withdrawn state, calculates processing standards specialized for the characteristic data of the user in the withdrawn state, and determines the user's withdrawal status according to the processing standard. Second non-identified data corresponding to the user's characteristic data may be generated.

According to another embodiment, the electronic device 110 checks the user identification data and characteristic data of the user who is in the unsubscribed state, but does not calculate a separate processing standard, unlike what is shown in FIG. 6, and determines the characteristics of the user who is currently enrolled. In order to process data, second non-identified data corresponding to the characteristic data of the user who has withdrawn can be generated using pre-calculated processing standards.

Meanwhile, as can be seen in FIG. 6, the data of a user in a withdrawal state goes through a primary de-identification process and can be used again as 'feature data' in the introduction process on the pipeline of FIG. 3.

In addition, as shown at the bottom of FIG. 6, the electronic device 110 removes data, user identification data, characteristic data, or second de-identification of a subscribed user according to an external order to remove withdrawal-related data or at its own discretion. Among the data, data about users who have withdrawn can be removed. Specifically, when some of the registered users withdraw from the service, the electronic device 110 may remove data of users who have newly withdrawn from the service from the data of the registered users. In addition, the data of the user who has newly withdrawn from the service can be removed from the user identification data, characteristic data, or second non-identified data that is already in the process of being processed, based on the user identification data of the user who has newly withdrawn from the service.

7A to 7E are exemplary diagrams showing data structures processed according to one embodiment.

First, Figure 7a shows the structure of data that has not undergone de-identification processing. Information on the year/month of creation of each data, user identification data as a serial number that can identify the service user, binary number indicating the gender of the user (for convenience of explanation, 0 is assumed to be a binary number corresponding to female and 1 is male), user The age of the user and the number of months that have passed since the user signed up are shown. As shown, sensitive personal information such as serial number, each user's age, and number of membership months is exposed.

Meanwhile, Figure 7b shows the structure of data that has undergone de-identification processing for some data. That is, a hash function was applied to the serial number among the personal information and converted into first non-identifiable data of a certain number of bits. Note that the first non-identifying data in the bottom two rows have the same value, so it can be inferred that the user identification data for the same service user has been converted, and the first non-identifying data in the top row is the user identification data of a separate user. It can be seen that has been converted.

Meanwhile, FIG. 7C shows a data structure indicating the status of each de-identified user's data in each de-identified table. As shown, it can be seen that in each table, the data of users currently in the subscription state has been de-identified, and the data of the users in the unsubscribe state has not been de-identified. That is, according to one embodiment, the electronic device 110 refers to the information indicating whether or not to de-identify processing for each user in the subscription state and the user in the unsubscribe state, and sees only the data of the user subject to de-identification processing. De-identification processing may be performed upon disclosure. Additionally, according to another embodiment, after performing a series of de-identification processing processes, the electronic device 110 stores a data structure indicating whether or not de-identification processing has been performed, as shown in FIG. 7C, and reports this to the administrator. You can send a , send a notification, or request a de-identification processing command for data that has not been de-identified (corresponding to a false state).

Meanwhile, Figure 7d shows data field information and data de-identification method information used to calculate the processing standards applied to de-identification processing of each table. Quantile200 is a 200th percentile median value method. The 200th percentile median value method is applied to data belonging to the order_cate2_fashion_r3m field in the table in row 1, data belonging to the order_cate1_fashion_r12m field in the table in row 2, and data belonging to the gmv_payltr_r12m field in the table in row 3. It indicates that it has been done.

Meanwhile, Figure 7e shows processing standards specialized for feature data calculated based on data field information and data de-identification method information. Specifically, the 200th quantile median value method (quantile200) is applied to data belonging to the order_cate2_fashion_kids_r3m field, data belonging to the order_cate1_special1_r12m field, and data belonging to the order_cate2_daily_supply_r12m field of the same sb_rocketpay.csm_AA_category2_sample table, and the calculated quantile value is listed in column 5. . That is, the electronic device 110 can calculate processing standards specialized for each field by considering meta information (data field information, data de-identification method information) even within the same table.

According to one embodiment, the electronic device 110 may include a transceiver 111, a processor 113, and a memory 115. In one embodiment, the electronic device 110 is connected to the database 120 through the transceiver 111 and can exchange data.

The processor 113 may perform at least one method described above through FIGS. 1 to 7E. The memory 115 may store information for performing at least one method described above with reference to FIGS. 1 to 7E. Memory 115 may be volatile memory or non-volatile memory.

The processor 113 can control the electronic device 110 to execute programs and provide information. The code of the program executed by the processor 113 may be stored in the memory 115.

The processor 113 is connected to the transceiver 111 and the memory 115, checks user identification data capable of identifying the service user and characteristic data of the service user corresponding to the user identification data, and uses a hashing technique to identify the user. Feature data by generating first non-identification data corresponding to the identification data, calculating processing standards specialized for the feature data based on meta information defining the processing method of the data, and processing at least a portion of the feature data based on the processing standards. generate second non-identifiable data corresponding to the final non-identifiable data including one or more tables with the first non-identifiable data as a key and at least some of the second non-identifiable data as values. can be created.

The electronic device 110 shown in FIG. 8 shows only components related to this embodiment. Accordingly, those skilled in the art can understand that other general-purpose components may be included in addition to the components shown in FIG. 8.

Devices according to the above-described embodiments include a processor, memory for storing and executing program data, permanent storage such as a disk drive, a communication port for communicating with an external device, a touch panel, keys, buttons, etc. It may include a user interface device, etc. Methods implemented as software modules or algorithms may be stored on a computer-readable recording medium as computer-readable codes or program instructions executable on the processor. Here, computer-readable recording media include magnetic storage media (e.g., ROM (read-only memory), RAM (random-access memory), floppy disk, hard disk, etc.) and optical read media (e.g., CD-ROM). ), DVD (Digital Versatile Disc), etc. The computer-readable recording medium is distributed among computer systems connected to a network, so that computer-readable code can be stored and executed in a distributed manner. The media may be readable by a computer, stored in memory, and executed by a processor.

This embodiment can be represented by functional block configurations and various processing steps. These functional blocks may be implemented as any number of hardware or/and software configurations that execute specific functions. For example, embodiments include integrated circuit configurations such as memory, processing, logic, look-up tables, etc. that can execute various functions under the control of one or more microprocessors or other control devices. can be hired. Similar to how the components can be implemented as software programming or software elements, this embodiment includes various algorithms implemented as combinations of data structures, processes, routines or other programming constructs, such as C, C++, Java ( It can be implemented in a programming or scripting language such as Java), assembler, etc. Functional aspects may be implemented as algorithms running on one or more processors. Additionally, this embodiment may employ conventional technologies for electronic environment setting, signal processing, message processing, and/or data processing. Terms such as “mechanism,” “element,” “means,” and “composition” can be used broadly and are not limited to mechanical and physical components. The term may include the meaning of a series of software routines in connection with a processor, etc.

The above-described embodiments are merely examples and other embodiments may be implemented within the scope of the claims described below.

Claims

In a processing method for de-identifying data in an electronic device,

Confirming user identification data capable of identifying a service user and characteristic data of the service user corresponding to the user identification data;

Generating first non-identifiable data corresponding to the user identification data through a hashing technique;

calculating a processing standard specialized for the feature data based on meta information defining a data processing method;

generating second de-identified data corresponding to the feature data by processing at least a portion of the feature data based on the processing criteria; and

A data processing method comprising generating final de-identified data including one or more tables having the first de-identified data as a key and at least some of the second de-identified data as values. .
According to paragraph 1,

The meta information is,

A data processing method comprising data field information defining the type of each data included in the feature data and data de-identification method information defining a data processing method corresponding to each type of data.
According to paragraph 2,

The data field information is,

A data processing method that defines at least some of the service user's gender, age, number of visits to the service page, number of orders within the service, and number of membership months as types of data.
According to paragraph 2,

The meta information is,

Additionally comprising data usage information defining the usage of at least some of the data included in the feature data,

The data de-identification method information is,

A data processing method that defines a data processing method corresponding to the type of each data and the data usage information.
According to paragraph 2,

The data de-identification method information is,

At least one of the first method of binary converting the data value, the second method of converting the data value into a representative value for each section, the third method of converting the data value into an approximate value, and the fourth method of maintaining the data value as is. A data processing method, which defines one as a data processing method.
According to paragraph 1,

The first non-identifying data is,

A data processing method comprising data generated through a hash function based on the user identification data and additional data having a unique value.
According to clause 6,

The additional data is,

Contains a Universally Unique Identifier (UUID),

The hash function is,

A data processing method including a hash function based on SHA (Secure Hash Algorithm).
In clause 7,

The UUID is,

A first version UUID generated based on a timestamp and a MAC address (Media Access Control Address), a second version UUID in which some data constituting the first version UUID has been replaced, and the name and namespace of the service user. 3rd version UUID generated through MD5 (Message-Digest algorithm 5)-based hash function, randomly generated 4th version UUID, generated through SHA-based hash function based on the name and namespace of the service user. Method for processing data, which is one of the 5th version UUIDs.
According to clause 6,

The additional data is,

When the user identification data is the same but the time at which the final de-identified data is generated is different, the additional data used to generate each final de-identified data is different.
According to paragraph 1,

The second non-identifying data is,

If feature data belonging to a data field in which the processing standard exists is the processing target, it is generated according to the processing standard, and if feature data belonging to a data field in which the processing standard does not exist is the processing target, the value of the feature data is converted to binary. A method of processing data, which is generated by doing so.
According to paragraph 1,

The final de-identified data is,

A data processing method comprising snapshot data distinguished by the time of creation, and including one or more tables in which one or more user identification data confirmed at each preset data reading cycle and one or more characteristic data corresponding to each user identification data are matched.
According to paragraph 1,

The characteristic data is,

A data processing method comprising second non-identifying data corresponding to characteristic data of a service user currently enrolled and characteristic data of a service user currently unsubscribed.
According to clause 12,

Confirming characteristic data of a service user who is in a withdrawal state; and

Processing at least a portion of the characteristic data of the service user in the withdrawal state based on pre-calculated processing standards for the characteristic data of the currently subscribed service user, thereby generating second non-identifying data of the service user in the withdrawal state. Data processing method further comprising.
According to clause 12,

Confirming characteristic data of a service user who is in a withdrawal state;

Calculating processing standards specialized for characteristic data of the service user who has withdrawn from the service based on meta information defining a data processing method; and

Processing at least a portion of the characteristic data of the unsubscribed service user based on processing standards specialized for the unsubscribed service user's characteristic data, thereby generating second non-identifiable data of the unsubscribed service user. How to process data.
According to paragraph 1,

The step of generating the first non-identified data includes:

In generating first non-identifying data of a service user who is in withdrawal status,

First applying a hash function to user identification data of a service user who has withdrawn from the service; and

A data processing method comprising generating first non-identifiable data of the service user in the withdrawn state by secondary application of a hash function based on the primary application result and the additional data.
A non-transitory computer-readable recording medium that records a program for executing the method of claim 1 on a computer.
An electronic device for data de-identification, comprising:

Includes a transceiver, memory for storing instructions, and a processor;

The processor is connected to the transceiver and the memory,

Confirm user identification data that can identify a service user and characteristic data of the service user corresponding to the user identification data,

Generating first non-identifying data corresponding to the user identification data through a hashing technique,

Calculate processing standards specialized for the feature data based on meta information that defines the data processing method,

generating second non-identified data corresponding to the feature data by processing at least a portion of the feature data based on the processing criteria;

An electronic device that generates final de-identified data including one or more tables having the first de-identified data as a key and at least some of the second de-identified data as values.