CN111382459A

CN111382459A - Privacy data integration method and server

Info

Publication number: CN111382459A
Application number: CN201910485170.6A
Authority: CN
Inventors: 高铭智; 王邦杰; 游家牧; 吕品慧; 刘凯诚
Original assignee: Industrial Technology Research Institute ITRI
Current assignee: Industrial Technology Research Institute ITRI
Priority date: 2018-12-27
Filing date: 2019-06-05
Publication date: 2020-07-07
Anticipated expiration: 2039-06-05
Also published as: CN111382459B

Abstract

This disclosure provides a privacy data integration method and server. The privacy data integration approach includes the following steps. The first processing device and the second processing device respectively obtain the first generation model and the second generation model based on the first privacy data and the second privacy data. The server generates first generated data and second generated data through the first generated model and the second generated model respectively. The server integrates the first generated data and the second generated data to generate composite data.

Description

Privacy data integration method and server

技术领域technical field

本揭露是有关于一种隐私数据整合方法与服务器。This disclosure relates to a privacy data integration method and server.

背景技术Background technique

出于某些商业目的，公司之间可能需要彼此共享客户数据。然而，不同客户数据的字段可能不同，故数据整合是件非常困难的工作。因此，需要提供一种数据整合方法来执行数据整合的工作。Companies may need to share customer data with each other for certain business purposes. However, the fields of different customer data may be different, so data integration is a very difficult task. Therefore, there is a need to provide a data integration method to perform the work of data integration.

此外，客户数据可能含有一些隐私信息。在数据整合过程中可能会有泄露客户隐私数据的担忧。因此，如何开发具有隐私保护的数据整合方法已成为大数据技术的重要发展方向。In addition, customer data may contain some private information. There may be concerns about leaking private customer data during the data integration process. Therefore, how to develop data integration methods with privacy protection has become an important development direction of big data technology.

发明内容SUMMARY OF THE INVENTION

根据本揭露的一实施例，提出一种隐私数据整合方法。隐私数据整合方法包括以下步骤。第一处理装置及第二处理装置分别依据第一隐私数据及第二隐私数据，获得第一生成模型及第二生成模型。服务器分别通过第一生成模型及第二生成模型产生第一生成数据及第二生成数据。服务器整合第一生成数据及第二生成数据，以产生合成数据。According to an embodiment of the present disclosure, a privacy data integration method is provided. The privacy data integration method includes the following steps. The first processing device and the second processing device obtain the first generation model and the second generation model according to the first privacy data and the second privacy data, respectively. The server generates the first generation data and the second generation data through the first generation model and the second generation model respectively. The server integrates the first generated data and the second generated data to generate composite data.

根据本揭露的一实施例，提出一种服务器。服务器用以执行隐私数据整合方法。隐私数据整合方法包括以下步骤。分别通过第一生成模型及第二生成模型产生第一生成数据及第二生成数据。第一生成模型及第二生成模型分别依据第一隐私数据及第二隐私数据获得。整合第一生成数据及第二生成数据，以产生合成数据。According to an embodiment of the present disclosure, a server is provided. The server is used to execute the privacy data integration method. The privacy data integration method includes the following steps. The first generation data and the second generation data are generated by the first generation model and the second generation model, respectively. The first generation model and the second generation model are obtained according to the first privacy data and the second privacy data, respectively. The first generated data and the second generated data are integrated to generate composite data.

为了对本揭露的上述及其他方面有更佳的了解，下文特举实施例，并配合所附图详细说明如下。In order to have a better understanding of the above and other aspects of the present disclosure, the following embodiments are given and described in detail in conjunction with the accompanying drawings as follows.

附图说明Description of drawings

图1为根据一实施例的横向数据整合的架构；FIG. 1 is an architecture of horizontal data integration according to an embodiment;

图2A为说明数据库链接算法(Database Join Algorithm)执行横向数据整合的一实施例；2A illustrates an embodiment of a Database Join Algorithm performing horizontal data integration;

图2B为说明记录链接算法(Record Linkage Algorithm)执行横向数据整合的一实施例；FIG. 2B illustrates an embodiment of the Record Linkage Algorithm performing horizontal data integration;

图2C为说明统计匹配算法(Statistical Match Algorithm)执行横向数据整合的一实施例；2C illustrates an embodiment of a Statistical Match Algorithm performing horizontal data integration;

图3为根据一实施例的一一处理装置、一二处理装置及一服务器；FIG. 3 shows one processing device, one two processing devices, and a server according to an embodiment;

图4为根据一实施例的隐私数据整合方法的流程图；4 is a flowchart of a privacy data integration method according to an embodiment;

图5为根据一实施例选择的数据库链接算法、记录链接算法或统计匹配算法的流程图；5 is a flowchart of a database linking algorithm, a record linking algorithm or a statistical matching algorithm selected according to an embodiment;

图6为说明根据一实施例获得联合机率分布的过程；6 is a diagram illustrating a process for obtaining a joint probability distribution according to an embodiment;

图7为根据另一实施例的联合机率分布。Figure 7 is a joint probability distribution according to another embodiment.

符号说明：Symbol Description:

100：第一处理装置100: First processing device

200：第二处理装置200: Second processing device

300：服务器300: server

900：网络900: network

A、B、C、D、DA、EC、IC、ID、NR、X、Y、Z：字段A, B, C, D, DA, EC, IC, ID, NR, X, Y, Z: Fields

CT53：列联表CT53: Contingency Table

GD51：第一生成数据GD51: First generated data

GD52：第二生成数据GD52: Second generated data

GM51：第一生成模型GM51: The first generative model

GM52：第二生成模型GM52: Second generative model

HV1：第一哈希值HV1: first hash value

HV2：第二哈希值HV2: second hash value

LK：链接分数LK: Link Score

JPD53、JPD53’：联合机率分布JPD53, JPD53’: joint probability distribution

NCT53：噪声列联表NCT53: Noise Contingency Table

ND：噪声数据ND: noise data

PD11、PD21、PD31、PD41、PD51：第一隐私数据PD11, PD21, PD31, PD41, PD51: first privacy data

PD12、PD22、PD32、PD42、PD52：第二隐私数据PD12, PD22, PD32, PD42, PD52: second privacy data

RV1、RV2：随机向量RV1, RV2: random vector

S110、S120、S130、S131、S132、S133、S134、S135、S136、S140、S150：步骤S110, S120, S130, S131, S132, S133, S134, S135, S136, S140, S150: Steps

SD13、SD23、SD33、SD43、SD53：合成数据SD13, SD23, SD33, SD43, SD53: Composite data

SP53：抽样数据SP53: Sampling Data

具体实施方式Detailed ways

请参照图1，其为根据一实施例的横向数据整合的架构。第一隐私数据PD11具有字段Y及字段X，其包含“(y11，x11)、(y12，x12)、(y13，x13)”，第二隐私数据PD12具有字段Z及字段X，其包含“(z21，x21)、(z22，x22)、(z23，x23)”。Please refer to FIG. 1 , which is an architecture of horizontal data integration according to an embodiment. The first privacy data PD11 has field Y and field X, which includes "(y11, x11), (y12, x12), (y13, x13)", and the second privacy data PD12 has field Z and field X, which includes "( z21, x21), (z22, x22), (z23, x23)".

第一隐私数据PD11及第二隐私数据PD12可以整合为具有字段Y、X、Z的一合成数据SD13。举例来说，合成数据SD13包含“(y31，x31，z31)、(y32，x32，z32)、(y33，x33，z33)”。合成数据SD13的字段Y、X与第一隐私数据PD11的字段Y、X具有相似的联合机率分布，合成数据SD13的字段Z、X与第二隐私数据PD12的字段Z、X具有相似的联合机率分布。因此，合成数据SD13可以同时代表第一隐私数据PD11及第二隐私数据PD12。The first privacy data PD11 and the second privacy data PD12 can be integrated into a composite data SD13 having fields Y, X, and Z. For example, the composite data SD13 includes "(y31, x31, z31), (y32, x32, z32), (y33, x33, z33)". The fields Y and X of the synthetic data SD13 have similar association probability distributions with the fields Y and X of the first privacy data PD11 , and the fields Z and X of the synthetic data SD13 have similar association probability with the fields Z and X of the second privacy data PD12 distributed. Therefore, the composite data SD13 can represent the first privacy data PD11 and the second privacy data PD12 at the same time.

此外，第一隐私数据PD11的“(y11，x11)、(y12，x12)、(y13，x13)”及第二隐私数据PD12的“(z21，x21)、(z22，x22)、(z23，x23)”并未直接显示于合成数据SD13中。因此，数据整合的结果具有隐私保护的功能。In addition, "(y11, x11), (y12, x12), (y13, x13)" of the first privacy data PD11 and "(z21, x21), (z22, x22), (z23, x23)" is not directly displayed in synthetic data SD13. Therefore, the result of data integration has the function of privacy protection.

请参照图2A，其为说明数据库链接算法(Database Join Algorithm)执行横向数据整合的一实施例。一第一隐私数据PD21具有字段EC、ID、IC。字段EC为能源消耗等级(Energy Consumption Level)，字段ID为使用者身份(User Identification)，字段IC为收入等级(Income Level)。使用者身份为一直接标识列(Direct Identifier)，能源消耗等级与收入等级为非直接标识列(Indirect Identifier非直接标识列)。直接标识列可以直接指向某一人；而非直接标识列则无法直接指向某一人。一第二隐私数据PD22具有字段NR、ID、IC。字段NR为总房间数(Total Number of Rooms)。总房间数为一非直接标识列。在数据库链接算法中，字段ID、IC为连接关键项(Joint Key)。举例来说，第一隐私数据PD21(或第二隐私数据PD22)的字段ID的内容被填入于一合成数据SD23的字段ID中。根据第一隐私数据PD21的字段ID的内容，第一隐私数据PD21的字段EC的内容对应地被填入于合成数据SD23的字段EC中。根据第二隐私数据PD22的字段ID的内容，第二隐私数据PD22的字段NR的内容对应地被填入合成数据SD23的字段NR中。根据第一隐私数据PD21(或第二隐私数据PD22)的字段ID的内容，第一隐私数据PD21(或第二隐私数据PD22)的字段IC的内容对应地被填入于合成数据SD23的字段IC中。Please refer to FIG. 2A , which illustrates an embodiment of horizontal data integration performed by a Database Join Algorithm. A first privacy data PD21 has fields EC, ID, IC. The field EC is Energy Consumption Level, the field ID is User Identification, and the field IC is Income Level. The user identity is a direct identifier column (Direct Identifier), and the energy consumption level and the income level are an indirect identifier column (Indirect Identifier). A direct identity column can point directly to a person; a non-direct identity column cannot point directly to a person. A second privacy data PD22 has fields NR, ID, IC. The field NR is the Total Number of Rooms. The total number of rooms is an indirect identification column. In the database linking algorithm, the field ID and IC are the joint key items (Joint Key). For example, the content of the field ID of the first privacy data PD21 (or the second privacy data PD22) is filled in the field ID of a composite data SD23. According to the content of the field ID of the first privacy data PD21, the content of the field EC of the first privacy data PD21 is correspondingly filled in the field EC of the composite data SD23. According to the content of the field ID of the second privacy data PD22, the content of the field NR of the second privacy data PD22 is correspondingly filled in the field NR of the composite data SD23. According to the content of the field ID of the first privacy data PD21 (or the second privacy data PD22 ), the content of the field IC of the first privacy data PD21 (or the second privacy data PD22 ) is correspondingly filled in the field IC of the composite data SD23 middle.

请参照图2B，其为说明记录链接算法(Record Linkage Algorithm)执行横向数据整合的一实施例。一第一隐私数据PD31具有字段EC、IC、DA。字段EC为能源消耗等级，字段IC为收入等级，字段DA为负债总数。能源消耗等级、收入等级及负债总数为非直接标识列。一第二隐私数据PD32具有字段NR、IC、DA。字段NR为总房间数。总房间数为非直接标识列。在记录链接算法中，字段IC、DA用以计算一链接分数(Linkage Score)LK。举例来说，第一隐私数据PD31的第一列与第二隐私数据PD32的第一列的链接分数LK为1.8。第一隐私数据PD21的第一列与第二隐私数据PD22的第七列的链接分数为0.8。第一隐私数据PD31与第二隐私数据PD32通过链接分数LK进行链接，而获得具有字段EC、IC、NR的合成数据SD33。Please refer to FIG. 2B , which illustrates an embodiment of the Record Linkage Algorithm performing horizontal data integration. A first privacy data PD31 has fields EC, IC, DA. The field EC is the energy consumption level, the field IC is the income level, and the field DA is the total liability. Energy consumption levels, income levels, and total liabilities are indirect identification columns. A second privacy data PD32 has fields NR, IC, DA. The field NR is the total number of rooms. The total number of rooms is an indirect identity column. In the record linking algorithm, the fields IC and DA are used to calculate a Linkage Score LK. For example, the link score LK of the first column of the first privacy data PD31 and the first column of the second privacy data PD32 is 1.8. The link score of the first column of the first privacy data PD21 and the seventh column of the second privacy data PD22 is 0.8. The first privacy data PD31 and the second privacy data PD32 are linked through the link score LK to obtain synthetic data SD33 having fields EC, IC, NR.

请参照图2C，其为说明统计匹配算法(Statistical Match Algorithm)执行横向数据整合的一实施例。第一隐私数据PD41具有字段EC、IC、DA。字段EC为能量消耗等级，字段IC为收入等级，字段DA为负债总数。能量消耗等级、收入等级及负债总数为非直接标识列。一第二隐私数据PD42具有字段NR、IC、DA。字段NR为总房间数。总房间数为非直接标识列。在统计匹配算法中，共同的字段DA用以计算一误差绝对值(Absolute Value of Error)。举例来说，对第二隐私数据PD42的字段DA而言，相对于302(第一隐私数据PD41的字段DA的第一列)的误差绝对值为“2、13、189、77、49、4、142”。对第二隐私数据PD42的字段DA而言，相对于310(第一隐私数据PD41的字段DA的第二列)的误差绝对值为“189、204、2、114、240、177、49”。第一隐私数据PD41及第二隐私数据PD42通过这些误差绝对值进行连结，而获得具有字段EC、IC、NR的一合成数据SD43。Please refer to FIG. 2C , which illustrates an embodiment of horizontal data integration performed by a Statistical Match Algorithm. The first privacy data PD41 has fields EC, IC, DA. The field EC is the energy consumption level, the field IC is the income level, and the field DA is the total liability. Energy consumption level, income level and total liabilities are indirect identification columns. A second privacy data PD42 has fields NR, IC, DA. The field NR is the total number of rooms. The total number of rooms is an indirect identity column. In the statistical matching algorithm, the common field DA is used to calculate an Absolute Value of Error. For example, for the field DA of the second privacy data PD42, the absolute value of the error relative to 302 (the first column of the field DA of the first privacy data PD41) is "2, 13, 189, 77, 49, 4 , 142”. For the field DA of the second privacy data PD42, the absolute value of the error relative to 310 (the second column of the field DA of the first privacy data PD41) is "189, 204, 2, 114, 240, 177, 49". The first privacy data PD41 and the second privacy data PD42 are connected by the absolute values of these errors to obtain a composite data SD43 having fields EC, IC, NR.

请参照图3～图4。图3为根据一实施例的一第一处理装置100、一第二处理装置200及一服务器300。图4为根据一实施例的隐私数据整合方法的流程图。第一处理装置100及第二处理装置200例如是(但不限于)一计算机、一芯片或一电路板。第一处理装置100设置于某一公司中，第二处理装置200则设置于另一公司中。服务器300例如是(但不限于)一计算机、一云端运算中心、一运算丛集系统(Computing Cluster System)或一边缘运算系统(Edge Computing System)。服务器300设置于第三方。第一处理装置100与服务器300可以通过网络900进行沟通，第二处理装置200与服务器300也可以通过网络900进行沟通。隐私数据整合方法通过第一处理装置100、第二处理装置200及服务器300进行说明。Please refer to Figures 3 to 4. FIG. 3 shows a first processing device 100 , a second processing device 200 and a server 300 according to an embodiment. FIG. 4 is a flowchart of a privacy data integration method according to an embodiment. The first processing device 100 and the second processing device 200 are, for example (but not limited to) a computer, a chip or a circuit board. The first processing device 100 is installed in a certain company, and the second processing device 200 is installed in another company. The server 300 is, for example (but not limited to) a computer, a cloud computing center, a computing cluster system or an edge computing system. The server 300 is installed in a third party. The first processing device 100 and the server 300 can communicate through the network 900 , and the second processing device 200 and the server 300 can also communicate through the network 900 . The private data integration method is described by the first processing device 100 , the second processing device 200 and the server 300 .

在步骤S110中，根据一第一隐私数据PD51及一第二隐私数据PD52，第一处理装置100及第二处理装置200分别获得一第一生成模型GM51及一第二生成模型GM52。举例来说，第一隐私数据PD51具有字段A、B、C，第二隐私数据PD52具有字段D、B、C。生成模型为给定Y变量的目标值为“y”，X变量是条件机率(即X|Y＝y)c第一隐私数据PD51或第二隐私数据PD52的一类别内容被转换为一数值内容。第一隐私数据PD51及第二隐私数据PD52并未直接传送至服务器300。事实上，仅有第一生成模型GM51的参数及第二生成模型GM52的参数传送至服务器300。In step S110, according to a first privacy data PD51 and a second privacy data PD52, the first processing device 100 and the second processing device 200 obtain a first generation model GM51 and a second generation model GM52, respectively. For example, the first privacy data PD51 has fields A, B, and C, and the second privacy data PD52 has fields D, B, and C. The generative model is that the target value of the given Y variable is "y", and the X variable is the conditional probability (ie X|Y=y) c. A type of content of the first privacy data PD51 or the second privacy data PD52 is converted into a numerical content . The first privacy data PD51 and the second privacy data PD52 are not directly transmitted to the server 300 . In fact, only the parameters of the first generation model GM51 and the parameters of the second generation model GM52 are transmitted to the server 300 .

接着，在步骤S120中，服务器300通过第一生成模型GM51及第二生成模型GM52产生第一生成数据GD51及第二生成数据GD52。第一生成模型GM51或第二生成模型GM52通过一生成算法(Generative Algorithm)来获得，例如是一变分自动编码器(Variational Auto-Encoder，VAE)算法、一生成对抗网络(Generative Adversarial Network，GAN)算法、一信息生成对抗网络(Info-GAN)、一AAE算法或一ALI算法。在此步骤中，一随机向量RV1输入至第一生成模型GM51后，第一生成模型GM51输出第一生成数据GD51。第一生成数据GD51与第一隐私数据PD51并不相同，但具有相似的联合机率分布。并且，另一随机向量RV2输入至第二生成模型GM52后，第二生成模型GM52输出第二生成数据GD52。第二生成数据GD52与第二隐私数据PD52并不相同，但具有相似的联合机率分布。Next, in step S120, the server 300 generates the first generation data GD51 and the second generation data GD52 using the first generation model GM51 and the second generation model GM52. The first generative model GM51 or the second generative model GM52 is obtained by a generative algorithm (Generative Algorithm), such as a variational auto-encoder (Variational Auto-Encoder, VAE) algorithm, a generative adversarial network (Generative Adversarial Network, GAN) ) algorithm, an information generation adversarial network (Info-GAN), an AAE algorithm or an ALI algorithm. In this step, after a random vector RV1 is input to the first generation model GM51, the first generation model GM51 outputs the first generation data GD51. The first generated data GD51 is not the same as the first private data PD51, but has a similar joint probability distribution. Then, after another random vector RV2 is input to the second generation model GM52, the second generation model GM52 outputs the second generation data GD52. The second generated data GD52 is not the same as the second private data PD52, but has a similar joint probability distribution.

然后，在步骤S130中，服务器300整合第一生成数据GD51与第二生成数据GD52，以获得一合成数据SD53。在步骤S130中，第一生成数据GD51与第二生成数据GD52可以通过数据库链接算法(例如是图2A所述的方式)、记录链接算法(例如是图2B所述的方式)、或统计匹配算法(例如是图2C所述的方式)来获得。最终的合成数据SD53包含曾被转换为数值内容的类别内容。Then, in step S130, the server 300 integrates the first generated data GD51 and the second generated data GD52 to obtain a composite data SD53. In step S130, the first generated data GD51 and the second generated data GD52 may use a database linking algorithm (eg, as described in FIG. 2A ), a record linking algorithm (eg, as described in FIG. 2B ), or a statistical matching algorithm (for example, in the manner described in FIG. 2C ). The final composite data SD53 contains the category contents that have been converted into numerical contents.

请参照图5，其为根据一实施例选择的数据库链接算法、记录链接算法或统计匹配算法的流程图。在步骤S131中，服务器300从第一处理装置100及第二处理装置200获得第一隐私数据PD51的第一哈希值(First Hash Value)HV1(绘示于图3)及第二隐私数据PD52的第二哈希值(Second Hash Value)HV2(绘示于图3)。第一哈希值HV1通过第一隐私数据PD51的直接标识列或具代表性的非直接标识列的内容的编码来获得。第二哈希值HV2通过第二隐私数据PD52的直接标识列或具代表性的非直接标识列的内容的编码来获得。Please refer to FIG. 5 , which is a flowchart of a database linking algorithm, a record linking algorithm or a statistical matching algorithm selected according to an embodiment. In step S131 , the server 300 obtains the first hash value (First Hash Value) HV1 (shown in FIG. 3 ) of the first privacy data PD51 and the second privacy data PD52 from the first processing device 100 and the second processing device 200 The second hash value (Second Hash Value) HV2 (shown in Figure 3). The first hash value HV1 is obtained by encoding the content of the direct identification column or the representative non-direct identification column of the first privacy data PD51. The second hash value HV2 is obtained by encoding the content of the direct identification column or the representative non-direct identification column of the second privacy data PD52.

在步骤S132中，服务器300就由对第一哈希值HV1与第二哈希值HV2进行比对，以判断第一生成数据GD51与第二生成数据GD52的重叠率是否高于一预定值。第一生成数据GD51与第二生成数据GD52的重叠率为重复内容的比率。重叠率不高于预定值，则进入步骤S136；若重叠率高于预定值，则进入步骤S133。In step S132, the server 300 compares the first hash value HV1 with the second hash value HV2 to determine whether the overlap ratio of the first generated data GD51 and the second generated data GD52 is higher than a predetermined value. The overlap ratio of the first generated data GD51 and the second generated data GD52 is the ratio of the overlapping content. If the overlap ratio is not higher than the predetermined value, proceed to step S136; if the overlap ratio is higher than the predetermined value, proceed to step S133.

在步骤S133中，服务器300判断第一生成数据GD51与第二生成数据GD52是否具有至少一连接关键项(Joint Key)。若第一生成数据GD51与第二生成数据GD52具有连接关键项，则进入步骤S134；若第一生成数据GD51与第二生成数据不具有连接关键项，则进入步骤S135。In step S133, the server 300 determines whether the first generated data GD51 and the second generated data GD52 have at least one joint key. If the first generated data GD51 and the second generated data GD52 have a connection key item, go to step S134; if the first generated data GD51 and the second generated data do not have a connection key item, go to step S135.

在步骤S134中，服务器300采用数据库链接算法(如图2A所述的方法)整合第一生成数据GD51及第二生成数据GD52。In step S134, the server 300 integrates the first generated data GD51 and the second generated data GD52 using a database linking algorithm (the method described in FIG. 2A).

在步骤S135中，服务器300采用记录链接算法(如图2B所述的方法)整合第一生成数据GD51及第二生成数据GD52。在此步骤中，在不使用连接关键项的情况下，采用记录链接算法来整合第一生成数据GD51与第二生成数据GD52。In step S135, the server 300 integrates the first generated data GD51 and the second generated data GD52 using a record linking algorithm (the method described in FIG. 2B). In this step, the record linking algorithm is used to integrate the first generated data GD51 and the second generated data GD52 without using the link key.

在步骤S136中，服务器300采用统计匹配算法(如图2C所述的方法)整合第一生成数据GD51第二生成数据GD52。In step S136, the server 300 integrates the first generated data GD51 and the second generated data GD52 using a statistical matching algorithm (the method described in FIG. 2C).

在图4的步骤S140中，服务器300获得合成数据SD53的一联合机率分布(JointProbability Distribution)JPD53。在步骤S140中，一噪声数据ND被加入至联合机率分布JPD53中。请参照图6，其说明根据一实施例获得联合机率分布JPD53的过程。首先，合成数据SD53被转换成一列联表(Contingency Table)CT53。字段EC、IC、NR的各种组合的次数被填入于列联表CT53中。接着，噪声数据ND被加入至列联表CT53，以获得一噪声列联表(NoisyContingency Table)NCT53。接着，噪声列联表NCT53的次数被转换为机率值，以获得联合机率分布JPD53。In step S140 of FIG. 4 , the server 300 obtains a joint probability distribution (JointProbability Distribution) JPD53 of the composite data SD53. In step S140, a noise data ND is added to the joint probability distribution JPD53. Please refer to FIG. 6, which illustrates the process of obtaining the joint probability distribution JPD53 according to an embodiment. First, the composite data SD53 is converted into a contingency table (Contingency Table) CT53. The number of times of various combinations of fields EC, IC, NR is filled in the contingency table CT53. Next, the noise data ND is added to the contingency table CT53 to obtain a noise contingency table (Noisy Contingency Table) NCT53. Next, the times of the noise contingency table NCT53 are converted into probability values to obtain a joint probability distribution JPD53.

此外，在另一实施例中，联合机率分布JPD53的维度可以被降低。请参照图7，其为根据另一实施例的联合机率分布JPD53’。维度为3的联合机率分布JPD53被转换为维度为2的联合机率分布JPD53’。如此一来，复杂度能够由5∧3降至5∧2+5∧2，使得运算负载与运算时间能够有效降低。Furthermore, in another embodiment, the dimensionality of the joint probability distribution JPD53 may be reduced. Please refer to FIG. 7, which is a joint probability distribution JPD53' according to another embodiment. The joint probability distribution JPD53 of dimension 3 is transformed into a joint probability distribution JPD53' of dimension 2. In this way, the complexity can be reduced from 5∧3 to 5∧2+5∧2, so that the operation load and operation time can be effectively reduced.

接着，在图4的步骤S150中，服务器300根据联合机率分布JPD53(或联合机率分布JPD53’)对成数据SD53进行抽样，以获得一抽样数据SP53。抽样数据SP53的内容近似第一隐私数据PD51与第二隐私数据PD52的内容。Next, in step S150 of FIG. 4, the server 300 samples the paired data SD53 according to the joint probability distribution JPD53 (or the joint probability distribution JPD53') to obtain a sampled data SP53. The content of the sample data SP53 is similar to the content of the first privacy data PD51 and the second privacy data PD52.

根据上述隐私数据整合方法，抽样数据SP53通过第一隐私数据PD51与第二隐私数据PD52的整合而获得。抽样数据SP53能够代表第一隐私数据PD51与第二隐私数据PD52的近似内容，但不会泄漏任何客户的隐私数据。这对大数据技术而言相当有用。此外，隐私数据的数量并非用以局限本揭露。举例来说，三个或三个以上的隐私数据也可能通过上述的隐私数据整合方法来执行。According to the above privacy data integration method, the sampling data SP53 is obtained by integrating the first privacy data PD51 and the second privacy data PD52. The sample data SP53 can represent the approximate contents of the first privacy data PD51 and the second privacy data PD52, but will not leak any customer's privacy data. This is quite useful for big data technology. In addition, the amount of private data is not intended to limit this disclosure. For example, three or more private data may also be performed by the above-mentioned private data integration method.

综上所述，虽然本揭露已以实施例公开如上，然其并非用以限定本揭露。本揭露所属技术领域中普通技术人员，在不脱离本揭露的精神和范围内，当可作各种的更动与润饰。因此，本揭露的保护范围当视权利要求所界定的为准。To sum up, although the present disclosure has been disclosed above with embodiments, it is not intended to limit the present disclosure. Those skilled in the art to which the present disclosure pertains can make various changes and modifications without departing from the spirit and scope of the present disclosure. Therefore, the protection scope of the present disclosure should be determined by the claims.

Claims

1. a privacy data integration method, is characterized in that, this privacy data integration method comprises:

The first processing device and the second processing device respectively obtain the first generation model and the second generation model according to the first privacy data and the second privacy data;

The server generates first generated data and second generated data through the first generation model and the second generation model, respectively; and

The server integrates the first generated data and the second generated data to generate synthetic data.

2. The privacy data integration method according to claim 1, wherein in the step of obtaining the first generative model and the second generative model, the first generative model or the second generative model is passed through a variational autoencoder ( Variational Auto-Encoder, VAE) algorithm, Generative Adversarial Network (GAN) algorithm, Information Generative Adversarial Network (Info-GAN), AAE algorithm or ALI algorithm.

3. The privacy data integration method according to claim 1, wherein in the step of obtaining the first generation model and the second generation model, the category content of the first privacy data or the second privacy data is converted into numerical values content, the synthetic data contains that category of content.

4. The privacy data integration method according to claim 1, wherein the step of integrating the first generated data and the second generated data comprises:

If the first generated data and the second generated data have at least one joint key, a database join algorithm (Database Join Algorithm) is used to integrate the first generated data and the second generated data.

5. The privacy data integration method according to claim 4, wherein the step of integrating the first generated data and the second generated data comprises:

If the overlap ratio of the first generated data and the second generated data is not higher than a predetermined value, a statistical matching algorithm (Statistical Match Algorithm) is used to integrate the first generated data and the second generated data.

6. The privacy data integration method according to claim 5, wherein the step of integrating the first generated data and the second generated data comprises:

If the overlap ratio of the first generated data and the second generated data is higher than a predetermined value, and the first generated data and the second generated data do not have the connection key item, a Record Linkage Algorithm is used to integrate The first generated data and the second generated data.

7. The privacy data integration method according to claim 1, further comprising:

A Joint Probability Distribution is obtained for the synthetic data.

8. The privacy data integration method of claim 7, wherein the joint probability distribution is transformed to reduce dimensionality.

9. The privacy data integration method according to claim 7, wherein in the step of obtaining the joint probability distribution, noise data is added to the joint probability distribution.

10. The privacy data integration method according to claim 7, further comprising:

The synthetic data is sampled to obtain a sampled data.

11. The privacy data integration method according to claim 10, wherein the content of the sample data is similar to the content of the first privacy data and the content of the second privacy data.

12. A server for executing a privacy data integration method, wherein the privacy data integration method comprises:

The first generation data and the second generation data are respectively generated by the first generation model and the second generation model, and the first generation model and the second generation model are obtained according to the first privacy data and the second privacy data respectively; and

The first generated data and the second generated data are integrated to generate composite data.

13. The server according to claim 12, wherein the first generative model or the second generative model is implemented by a variational auto-encoder (Variational Auto-Encoder, VAE) algorithm, a generative adversarial network (Generative Adversarial Network, GAN) algorithm, Information Generative Adversarial Network (Info-GAN), AAE algorithm or ALI algorithm is obtained.

14. The server of claim 12, wherein the category content of the first privacy data or the second privacy data is converted into numerical content, and the synthetic data includes the category content.

15. The server of claim 12, wherein the step of integrating the first generated data and the second generated data comprises:

If the first generated data and the second generated data have at least one joint key, a database join algorithm is used to integrate the first generated data and the second generated data.

16. The server of claim 15, wherein the step of integrating the first generated data and the second generated data comprises:

17. The server of claim 16, wherein the step of integrating the first generated data and the second generated data comprises:

18. The server according to claim 12, wherein the privacy data integration method further comprises:

A Joint Probability Distribution is obtained for the synthetic data.

19. The server of claim 18, wherein the joint probability distribution is transformed to reduce dimensionality.

20. The server of claim 18, wherein in the step of obtaining the joint probability distribution, noise data is added to the joint probability distribution.

21. The server according to claim 18, wherein the privacy data integration method further comprises:

The synthetic data is sampled to obtain sampled data.

22. The server of claim 18, wherein the content of the sample data is similar to the content of the first privacy data and the content of the second privacy data.