CN115454949A - Shared data determination method and device, electronic equipment and storage medium - Google Patents

Shared data determination method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN115454949A
CN115454949A CN202210892219.1A CN202210892219A CN115454949A CN 115454949 A CN115454949 A CN 115454949A CN 202210892219 A CN202210892219 A CN 202210892219A CN 115454949 A CN115454949 A CN 115454949A
Authority
CN
China
Prior art keywords
data
record data
network
discriminator
generated
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210892219.1A
Other languages
Chinese (zh)
Inventor
苏森
程祥
王振亚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Posts and Telecommunications
Original Assignee
Beijing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Posts and Telecommunications filed Critical Beijing University of Posts and Telecommunications
Priority to CN202210892219.1A priority Critical patent/CN115454949A/en
Publication of CN115454949A publication Critical patent/CN115454949A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/176Support for shared access to files; File sharing support
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/082Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

本申请提供一种共享数据确定方法、装置、电子设备及存储介质。该方法包括:接收当前批次的生成记录数据和敏感记录数据;利用局部判别器网络根据当前批次的生成记录数据和敏感记录数据更新局部判别器网络;利用更新后的局部判别器网络构造局部判别器响应,并利用数据共享平台根据预先获取到的真实集成记录训练数据、合成集成记录训练数据和判别器响应训练关系判别器以更新生成器网络;将预先采集到的随机向量输入至更新后的生成器网络以得到生成记录数据组;生成记录数据组,包括:多个生成记录数据;根据每个生成记录数据的权重构建目标共享数据。实现了能够在保证避免隐私泄露的同时实现垂直分割的数据共享,保证共享数据具有较高的可用性。

Figure 202210892219

The present application provides a shared data determination method, device, electronic equipment and storage medium. The method includes: receiving generated record data and sensitive record data of the current batch; using a local discriminator network to update the local discriminator network according to the generated record data and sensitive record data of the current batch; using the updated local discriminator network to construct a local The discriminator responds, and uses the data sharing platform to update the generator network according to the pre-acquired real integrated record training data, synthesized integrated record training data and discriminator response training relation discriminator; input the pre-acquired random vector into the updated The generator network is used to obtain the generated record data group; the generated record data group includes: multiple generated record data; and the target shared data is constructed according to the weight of each generated record data. Realize the data sharing that can achieve vertical segmentation while ensuring the avoidance of privacy leakage, and ensure the high availability of shared data.

Figure 202210892219

Description

共享数据确定方法、装置、电子设备及存储介质Shared data determination method, device, electronic equipment and storage medium

技术领域technical field

本申请涉及数据处理技术领域,尤其涉及一种共享数据确定方法、装置、电子设备及存储介质。The present application relates to the technical field of data processing, and in particular to a method, device, electronic device and storage medium for determining shared data.

背景技术Background technique

相关技术中,为实现垂直分割数据共享,通常根据每个数据拥有者利用自身的局部数据集构建一个生成式模型,然后利用学到的生成式模型进行数据的生成,最后将每一方生成的数据整合,形成共享数据集。但是,相关技术存在着由于各个局部数据集之间的ID不匹配,造成共享数据可用性差的问题。In related technologies, in order to achieve vertical split data sharing, each data owner usually constructs a generative model using its own local data set, and then uses the learned generative model to generate data, and finally the data generated by each party Integration to form a shared data set. However, there is a problem in related technologies that the availability of shared data is poor due to ID mismatch between various partial data sets.

发明内容Contents of the invention

有鉴于此,本申请的目的在于提出一种共享数据确定方法、装置、电子设备及存储介质。In view of this, the purpose of the present application is to propose a method, device, electronic device and storage medium for determining shared data.

基于所述目的,在第一方面,本申请提供了一种共享数据确定方法,包括:Based on the purpose, in the first aspect, the present application provides a method for determining shared data, including:

S1:接收当前批次的生成记录数据和敏感记录数据;S1: Receive the current batch of generated record data and sensitive record data;

S2:利用局部判别器网络根据所述当前批次的生成记录数据和敏感记录数据更新所述局部判别器网络;S2: Using the local discriminator network to update the local discriminator network according to the generated record data and sensitive record data of the current batch;

S3:利用更新后的局部判别器网络构造局部判别器响应,并利用数据共享平台根据预先获取到的真实集成记录训练数据、合成集成记录训练数据和所述判别器响应训练关系判别器以更新生成器网络;S3: Use the updated local discriminator network to construct the local discriminator response, and use the data sharing platform to update and generate server network;

S4:将预先采集到的随机向量输入至更新后的生成器网络以得到生成记录数据组;所述生成记录数据组,包括:多个生成记录数据;S4: Input the pre-collected random vector into the updated generator network to obtain a generated record data set; the generated record data set includes: a plurality of generated record data;

S5:根据每个所述生成记录数据的权重构建目标共享数据。S5: Construct target shared data according to the weight of each generated record data.

在一种可能的实现方式中,所述接收当前批次的生成记录数据和敏感记录数据,包括:In a possible implementation, the receiving the current batch of generated record data and sensitive record data includes:

接收来自所述生成器网络的当前批次的所述生成记录数据;receiving said production record data for a current batch from said network of producers;

根据预先获取的敏感数据集采样得到当前批次的所述敏感记录数据。The current batch of sensitive record data is obtained according to sampling of the pre-acquired sensitive data set.

在一种可能的实现方式中,所述利用局部判别器网络根据所述当前批次的生成记录数据和敏感记录数据更新所述局部判别器网络,包括:In a possible implementation manner, the updating the local discriminator network according to the generated record data and sensitive record data of the current batch by using the local discriminator network includes:

利用所述局部判别器网络根据所述当前批次的生成记录数据和敏感记录数据确定当前批次损失函数;using the local discriminator network to determine a current batch loss function based on the generated record data and sensitive record data of the current batch;

根据所述当前批次损失函数确定所述局部判别器网络的梯度信息,并对所述梯度信息进行剪枝处理;determining gradient information of the local discriminator network according to the current batch loss function, and performing pruning processing on the gradient information;

根据自适应噪音生成技术,利用从高斯分布中采样得到的高斯噪声对剪枝后的梯度信息进行扰动以确定更新参数;According to the adaptive noise generation technology, the Gaussian noise sampled from the Gaussian distribution is used to perturb the pruned gradient information to determine the update parameters;

根据所述更新参数更新所述局部判别器网络。The local discriminator network is updated according to the update parameters.

在一种可能的实现方式中,所述利用所述局部判别器网络根据所述当前批次的生成记录数据和敏感记录数据确定当前批次损失函数,之前还包括:In a possible implementation manner, the determining the loss function of the current batch according to the generated record data and sensitive record data of the current batch by using the local discriminator network further includes:

初始化所述判别器网络的参数;所述判别器网络的参数,包括:一阶动量估计

Figure BDA0003768040520000021
和二阶动量估计
Figure BDA0003768040520000022
Initialize the parameters of the discriminator network; the parameters of the discriminator network include: first-order momentum estimation
Figure BDA0003768040520000021
and the second-order momentum estimate
Figure BDA0003768040520000022

在一种可能的实现方式中,所述利用从高斯分布中采样得到的高斯噪声对剪枝后的梯度信息进行扰动以确定更新参数,包括:In a possible implementation manner, the Gaussian noise sampled from a Gaussian distribution is used to perturb the pruned gradient information to determine update parameters, including:

根据更新公式对所述一阶动量估计

Figure BDA0003768040520000023
和二阶动量估计
Figure BDA0003768040520000024
进行更新以确定所述更新参数;其中,所述更新公式,表示为The first-order momentum is estimated according to the update formula
Figure BDA0003768040520000023
and the second-order momentum estimate
Figure BDA0003768040520000024
Perform an update to determine the update parameters; wherein, the update formula is expressed as

Figure BDA0003768040520000025
Figure BDA0003768040520000025

Figure BDA0003768040520000026
Figure BDA0003768040520000026

其中,

Figure BDA0003768040520000027
表示更新后的一阶动量估计,
Figure BDA0003768040520000028
表示更新后的二阶动量估计,β1表示第一衰减速率,β2表示第二衰减速率,
Figure BDA0003768040520000029
表示第t-1轮的一阶动量估计,
Figure BDA00037680405200000210
表示第t-1轮的二阶动量估计,
Figure BDA00037680405200000211
表示判别器第t轮的梯度向量。in,
Figure BDA0003768040520000027
represents the updated first-order momentum estimate,
Figure BDA0003768040520000028
represents the updated second-order momentum estimate, β1 represents the first decay rate, and β2 represents the second decay rate,
Figure BDA0003768040520000029
Denotes the first-order momentum estimate for round t-1,
Figure BDA00037680405200000210
Denotes the second-order momentum estimate for round t-1,
Figure BDA00037680405200000211
Indicates the gradient vector of discriminator round t.

在一种可能的实现方式中,所述将预先采集到的随机向量输入至更新后的生成器网络以得到生成记录数据组,包括:In a possible implementation, the inputting the pre-collected random vector into the updated generator network to obtain the generated record data set includes:

将预先采集到的随机向量输入至更新后的生成器网络,并利用所述更新后的生成器网络确定生成记录数据;Inputting the pre-collected random vector into the updated generator network, and using the updated generator network to determine to generate record data;

重复执行步骤S1-S3,直至迭代次数达到阈值,确定每轮迭代中根据生成器网络确定的生成记录数据以得到所述生成记录数据组。Steps S1-S3 are repeatedly executed until the number of iterations reaches a threshold, and the generated record data determined according to the generator network in each iteration are determined to obtain the generated record data set.

在一种可能的实现方式中,所述根据每个所述生成记录数据的权重构建目标共享数据,之前还包括:In a possible implementation manner, the constructing target shared data according to the weight of each generated record data further includes:

保存每轮迭代中根据生成器网络确定的生成记录数据的权重;Save the weight of the generated record data determined by the generator network in each iteration;

将根据先验分布抽取到的隐向量输入至每轮迭代中的所述生成器网络中以确定多个合成记录数据;inputting hidden vectors extracted according to the prior distribution into the generator network in each iteration to determine a plurality of synthetic record data;

为每个所述合成记录数据分配权重;其中,每轮迭代中更新所述权重的更新公式,表示为Assign weights for each of the synthetic record data; wherein, the update formula for updating the weights in each iteration is expressed as

Figure BDA0003768040520000031
Figure BDA0003768040520000031

其中,wri表示权重,

Figure BDA0003768040520000032
表示合成记录数据,
Figure BDA0003768040520000033
表示第r个生成器的生成记录,R表示选择的生成器网络的数量,dj表示距离函数,M表示特征数量。Among them, w ri represents the weight,
Figure BDA0003768040520000032
represents synthetic record data,
Figure BDA0003768040520000033
Indicates the generation record of the rth generator, R indicates the number of selected generator networks, d j indicates the distance function, and M indicates the number of features.

每轮迭代中更新所述合成记录数据的更新公式,表示为The update formula for updating the synthetic record data in each iteration is expressed as

Figure BDA0003768040520000034
Figure BDA0003768040520000034

其中,

Figure BDA0003768040520000035
表示第r个生成器的生成记录。in,
Figure BDA0003768040520000035
Indicates the generation record of the rth generator.

在第二方面,本申请提供了一种共享数据确定装置,包括:In a second aspect, the present application provides a device for determining shared data, including:

接收模块,被配置为接收当前批次的生成记录数据和敏感记录数据;a receiving module configured to receive the generated record data and sensitive record data of the current batch;

第一更新模块,被配置为利用局部判别器网络根据所述当前批次的生成记录数据和敏感记录数据更新所述局部判别器网络;A first update module configured to update the local discriminator network according to the generated record data and sensitive record data of the current batch by using the local discriminator network;

第二更新模块,被配置为利用更新后的局部判别器网络构造局部判别器响应,并利用数据共享平台根据预先获取到的真实集成记录训练数据、合成集成记录训练数据和所述判别器响应训练关系判别器以更新生成器网络;The second update module is configured to use the updated local discriminator network to construct a local discriminator response, and use the data sharing platform to record training data according to the pre-acquired real ensemble, synthesize the ensemble record training data and the discriminator response training relational discriminator to update the generator network;

确定模块,被配置为将预先采集到的随机向量输入至更新后的生成器网络以得到生成记录数据组;所述生成记录数据组,包括:多个生成记录数据;The determination module is configured to input the pre-collected random vector into the updated generator network to obtain the generated record data set; the generated record data set includes: a plurality of generated record data;

构建模块,被配置为根据每个所述生成记录数据的权重构建目标共享数据。A construction module configured to construct target shared data according to the weight of each generated record data.

在第三方面,本申请提供了一种电子设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时实现如第一方面所述的共享数据确定方法。In a third aspect, the present application provides an electronic device, including a memory, a processor, and a computer program stored in the memory and operable on the processor. When the processor executes the program, the computer program described in the first aspect is implemented. The shared data determination method described above.

在第四方面,本申请提供了一种非暂态计算机可读存储介质,所述非暂态计算机可读存储介质存储计算机指令,所述计算机指令用于使计算机执行如第一方面所述的共享数据确定方法。In a fourth aspect, the present application provides a non-transitory computer-readable storage medium, the non-transitory computer-readable storage medium stores computer instructions, and the computer instructions are used to cause a computer to execute the method described in the first aspect Shared data determination method.

从上面所述可以看出,本申请提供的一种共享数据确定方法、装置、电子设备及存储介质,接收当前批次的生成记录数据和敏感记录数据;利用局部判别器网络根据所述当前批次的生成记录数据和敏感记录数据更新所述局部判别器网络;利用更新后的局部判别器网络构造局部判别器响应,并利用数据共享平台根据预先获取到的真实集成记录训练数据、合成集成记录训练数据和所述判别器响应训练关系判别器以更新生成器网络;将预先采集到的随机向量输入至更新后的生成器网络以得到生成记录数据组;所述生成记录数据组,包括:多个生成记录数据;根据每个所述生成记录数据的权重构建目标共享数据。实现了能够在保证避免隐私泄露的同时实现垂直分割的数据共享,进而保证最终得到的共享数据具有较高的可用性。From the above, it can be seen that the shared data determination method, device, electronic equipment and storage medium provided by the present application receive the generated record data and sensitive record data of the current batch; Generate record data and sensitive record data for the second time to update the local discriminator network; use the updated local discriminator network to construct a local discriminator response, and use the data sharing platform to synthesize integrated records based on the pre-acquired real integrated record training data The training data and the discriminator respond to the training relationship discriminator to update the generator network; the random vectors collected in advance are input to the updated generator network to obtain the generated record data set; the generated record data set includes: Generate record data; construct target shared data according to the weight of each generated record data. It realizes the data sharing that can achieve vertical segmentation while ensuring the avoidance of privacy leakage, and then ensures that the final shared data has high availability.

附图说明Description of drawings

为了更清楚地说明本申请或相关技术中的技术方案,下面将对实施例或相关技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the present application or related technologies, the following will briefly introduce the accompanying drawings that need to be used in the description of the embodiments or related technologies. Obviously, the accompanying drawings in the following description are only for this application Embodiments, for those of ordinary skill in the art, other drawings can also be obtained based on these drawings without any creative effort.

图1示出了相关技术中多方数据共享的场景示意图。FIG. 1 shows a schematic diagram of a scenario of multi-party data sharing in the related art.

图2示出了本申请实施例所提供的一种共享数据确定方法的示例性流程示意图。Fig. 2 shows an exemplary flow chart of a method for determining shared data provided by an embodiment of the present application.

图3示出了根据本申请实施例的满足差分隐私的多方数据共享算法的示意图。FIG. 3 shows a schematic diagram of a multi-party data sharing algorithm satisfying differential privacy according to an embodiment of the present application.

图4(a)示出了根据本申请实施例的各个算法的匹配第一种MNIST数据集的IS分数对比图。Fig. 4(a) shows a comparison chart of IS scores matching the first MNIST data set according to various algorithms according to the embodiment of the present application.

图4(b)示出了根据本申请实施例的各个算法的匹配第一种MNIST数据集的FID分数对比图。Fig. 4(b) shows a comparison chart of FID scores matching the first MNIST data set according to various algorithms according to the embodiment of the present application.

图4(c)示出了根据本申请实施例的各个算法的匹配第二种MNIST数据集的IS分数对比图。Fig. 4(c) shows a comparison chart of IS scores matching the second MNIST data set according to various algorithms according to the embodiment of the present application.

图4(d)示出了根据本申请实施例的各个算法的匹配第二种MNIST数据集的FID分数对比图。FIG. 4( d ) shows a comparison chart of FID scores matching the second MNIST data set according to various algorithms of the embodiments of the present application.

图5(a)示出了根据本申请实施例的各个算法在不同隐私预算下的第一次实验结果对比图。Fig. 5(a) shows a comparison chart of the first experimental results of various algorithms under different privacy budgets according to the embodiment of the present application.

图5(b)示出了根据本申请实施例的各个算法在不同隐私预算下的第二次实验结果对比图。Fig. 5(b) shows a comparison chart of the results of the second experiment of each algorithm under different privacy budgets according to the embodiment of the present application.

图6(a)示出了根据本申请实施例的各个算法在不同参与方数量下的第一次实验结果对比图。FIG. 6( a ) shows a comparison chart of the results of the first experiment of each algorithm under different numbers of participants according to the embodiment of the present application.

图6(b)示出了根据本申请实施例的各个算法在不同参与方数量下的第二次实验结果对比图。Fig. 6(b) shows a comparison chart of the results of the second experiment of each algorithm under different numbers of participants according to the embodiment of the present application.

图7示出了本申请实施例所提供的一种共享数据确定装置的示例性结构示意图。Fig. 7 shows an exemplary structural diagram of an apparatus for determining shared data provided by an embodiment of the present application.

图8示出了本申请实施例所提供的一种电子设备的示例性结构示意图。Fig. 8 shows an exemplary structural diagram of an electronic device provided by an embodiment of the present application.

具体实施方式detailed description

为使本申请的目的、技术方案和优点更加清楚明白,以下结合具体实施例,并参照附图,对本申请进一步详细说明。In order to make the purpose, technical solutions and advantages of the present application clearer, the present application will be further described in detail below in conjunction with specific embodiments and with reference to the accompanying drawings.

需要说明的是,除非另外定义,本申请实施例使用的技术术语或者科学术语应当为本申请所属领域内具有一般技能的人士所理解的通常意义。本申请实施例中使用的“第一”、“第二”以及类似的词语并不表示任何顺序、数量或者重要性,而只是用来区分不同的组成部分。“包括”或者“包含”等类似的词语意指出现该词前面的元件或者物件涵盖出现在该词后面列举的元件或者物件及其等同,而不排除其他元件或者物件。“连接”或者“相连”等类似的词语并非限定于物理的或者机械的连接,而是可以包括电性的连接,不管是直接的还是间接的。“上”、“下”、“左”、“右”等仅用于表示相对位置关系,当被描述对象的绝对位置改变后,则该相对位置关系也可能相应地改变。It should be noted that, unless otherwise defined, the technical terms or scientific terms used in the embodiments of the present application shall have the usual meanings understood by those skilled in the art to which the present application belongs. "First", "second" and similar words used in the embodiments of the present application do not indicate any order, quantity or importance, but are only used to distinguish different components. "Comprising" or "comprising" and similar words mean that the elements or items appearing before the word include the elements or items listed after the word and their equivalents, without excluding other elements or items. Words such as "connected" or "connected" are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "Up", "Down", "Left", "Right" and so on are only used to indicate the relative positional relationship. When the absolute position of the described object changes, the relative positional relationship may also change accordingly.

如背景技术部分所述,数据共享有助于激发数据中隐藏的经济价值,麦肯锡公司的报道中曾指出:如果目前七大行业(商务、金融、医疗健康、教育、交通、电力以及石油业天然气)中的数据互相公开,将带来大量的经济效益。数据共享有助于挖掘数据中蕴含的知识,例如:将医院和疾控中心的数据共享后可以用于分析疾病传播模式,提升公共医疗水平;将多个购物平台的数据共享后可以用于个性化商品推荐,提升消费者购物体验;将多个银行的数据共享后可以更好地进行客户信用评估,监测多方借贷、金融诈骗。然而,通常情况下,数据往往分布在多个数据拥有者手中(例如,医院有居民的医疗记录,银行有客户的账户记录),且包含大量的敏感信息,如果直接将不同数据拥有者的数据进行共享,会带来严重的隐私泄露问题。As mentioned in the background technology section, data sharing helps to stimulate the hidden economic value in data. McKinsey & Company reported that if the current seven major industries (business, finance, healthcare, education, transportation, electricity, and oil and gas) ) The data in ) are open to each other, which will bring a lot of economic benefits. Data sharing helps to mine the knowledge contained in the data, for example: after sharing the data of hospitals and CDCs, it can be used to analyze the mode of disease transmission and improve the level of public medical care; after sharing the data of multiple shopping platforms, it can be used for personal Optimized product recommendation to improve consumer shopping experience; sharing data from multiple banks can better evaluate customer credit and monitor multi-party lending and financial fraud. However, under normal circumstances, data is often distributed in the hands of multiple data owners (for example, hospitals have residents’ medical records, banks have customer account records), and contain a lot of sensitive information. If the data of different data owners is directly Sharing will bring serious privacy leakage problems.

参考图1,该场景主要涉及数据共享平台、数据拥有者和数据使用者三种角色。其中,每个数据拥有者分别持有一个关于同一组用户的不同属性的局部敏感数据集。共享平台协助进行局部敏感数据集的共享,构建数据共享模型,数据共享模型生成的新的共享数据与集成后的数据集具有相同的统计分布特征,同时,又没有直接进行本地敏感数据集的共享,一定程度上保护了每个数据拥有的隐私。数据使用者可以利用共享后的数据开展多种数据分析与挖掘任务。Referring to Figure 1, this scenario mainly involves three roles: data sharing platform, data owner and data user. Among them, each data owner holds a local sensitive data set about different attributes of the same group of users. The sharing platform assists in the sharing of local sensitive data sets, and builds a data sharing model. The new shared data generated by the data sharing model has the same statistical distribution characteristics as the integrated data sets, and at the same time, there is no direct sharing of local sensitive data sets. , which protects the privacy of each data to a certain extent. Data users can use the shared data to carry out various data analysis and mining tasks.

由上述处理可见,在进行多方数据共享的过程中,是可以避免数据共享模型和最终的数据使用者获取数据拥有者的隐私的,但是在形成多方数据共享模型前,各个数据拥有者的隐私数据还是可能被泄露的。具体地,对于每个局部敏感数据集,还存在以下三种角色可能会对隐私造成威胁:1)数据共享平台;2)参与数据共享的数据拥有者;3)数据使用者或其他可能获取到最终的共享数据潜在攻击者。It can be seen from the above processing that in the process of multi-party data sharing, it is possible to prevent the data sharing model and the final data user from obtaining the privacy of the data owner, but before the multi-party data sharing model is formed, the private data of each data owner It could still be leaked. Specifically, for each local sensitive data set, there are the following three roles that may pose a threat to privacy: 1) data sharing platform; 2) data owners participating in data sharing; 3) data users or others who may obtain The ultimate shared data potential attacker.

隐私保护的多方数据共享技术为解决多方数据共享带来的隐私泄露问题提供了一种可行的方案。近年来提出的差分隐私(Differential Privacy,DP)技术为解决数据共享带来的隐私泄露问题提供了一种可行的方案。与传统的基于匿名的隐私模型(例如,k-匿名[1]和l-多样性等)不同,差分隐私为敏感数据提供了一种严格的,可量化的隐私保护方法。通过在统计结果中加入适量噪音以确保修改数据集中一条记录不会对统计结果造成显著的影响,从而实现隐私保护的要求。The privacy-preserving multi-party data sharing technology provides a feasible solution to the privacy leakage problem caused by multi-party data sharing. The Differential Privacy (DP) technology proposed in recent years provides a feasible solution to the privacy leakage problem caused by data sharing. Different from traditional anonymity-based privacy models (eg, k-anonymity [1] and l-diversity, etc.), differential privacy provides a strict, quantifiable privacy protection method for sensitive data. By adding an appropriate amount of noise to the statistical results to ensure that modifying a record in the data set will not have a significant impact on the statistical results, so as to achieve the requirement of privacy protection.

申请人通过研究发现,相关技术中,一种直观的进行垂直分割数据共享的方法是:每个数据拥有者利用自身的局部数据集构建一个生成式模型,例如GAN或者BayesNetwork,然后利用学到的生成式模型进行数据的生成,最后将每一方生成的数据整合,形成共享数据集。然而,由于各个局部数据集之间的ID不匹配,所以难以保证最终得到的数据的可用性。Through research, the applicant found that in related technologies, an intuitive method for vertically splitting data sharing is: each data owner uses its own local data set to build a generative model, such as GAN or BayesNetwork, and then uses the learned The generative model generates data, and finally integrates the data generated by each party to form a shared data set. However, it is difficult to guarantee the usability of the resulting data due to the ID mismatch between the various partial datasets.

正因如此,本申请提供的一种共享数据确定方法、装置、电子设备及存储介质,接收当前批次的生成记录数据和敏感记录数据;利用局部判别器网络根据所述当前批次的生成记录数据和敏感记录数据更新所述局部判别器网络;利用更新后的局部判别器网络构造局部判别器响应,并利用数据共享平台根据预先获取到的真实集成记录训练数据、合成集成记录训练数据和所述判别器响应训练关系判别器以更新生成器网络;将预先采集到的随机向量输入至更新后的生成器网络以得到生成记录数据组;所述生成记录数据组,包括:多个生成记录数据;根据每个所述生成记录数据的权重构建目标共享数据。实现了能够在保证避免隐私泄露的同时实现垂直分割的数据共享,进而保证最终得到的共享数据具有较高的可用性。For this reason, a shared data determination method, device, electronic device and storage medium provided by the present application receive the generated record data and sensitive record data of the current batch; data and sensitive record data to update the local discriminator network; use the updated local discriminator network to construct the local discriminator response, and use the data sharing platform to obtain the pre-acquired real integrated record training data, synthesize integrated record training data and all The discriminator responds to the training relationship discriminator to update the generator network; the random vectors collected in advance are input to the updated generator network to obtain the generated record data set; the generated record data set includes: a plurality of generated record data ; Construct target shared data according to the weight of each generated record data. It realizes the data sharing that can achieve vertical segmentation while ensuring the avoidance of privacy leakage, and then ensures that the final shared data has high availability.

下面通过具体的实施例来对本申请实施例所提供的共享数据确定方法进行具体说明。The method for determining shared data provided by the embodiment of the present application will be specifically described below through specific embodiments.

参考图2,本申请实施例所提供的一种共享数据确定方法具体包括以下步骤:Referring to Fig. 2, a method for determining shared data provided in the embodiment of the present application specifically includes the following steps:

S1:接收当前批次的生成记录数据和敏感记录数据;S1: Receive the current batch of generated record data and sensitive record data;

S2:利用局部判别器网络根据所述当前批次的生成记录数据和敏感记录数据更新所述局部判别器网络;S2: Using the local discriminator network to update the local discriminator network according to the generated record data and sensitive record data of the current batch;

S3:利用更新后的局部判别器网络构造局部判别器响应,并利用数据共享平台根据预先获取到的真实集成记录训练数据、合成集成记录训练数据和所述判别器响应训练关系判别器以更新生成器网络;S3: Use the updated local discriminator network to construct the local discriminator response, and use the data sharing platform to update and generate server network;

S4:将预先采集到的随机向量输入至更新后的生成器网络以得到生成记录数据组;所述生成记录数据组,包括:多个生成记录数据;S4: Input the pre-collected random vector into the updated generator network to obtain a generated record data set; the generated record data set includes: a plurality of generated record data;

S5:根据每个所述生成记录数据的权重构建目标共享数据。S5: Construct target shared data according to the weight of each generated record data.

在一些实施例中,本申请可以应用于多个数据拥有者以及数据共享平台的场景下,以此为例,每个数据拥有者k持有一个判别器网络,而数据共享平台持有一个生成器网络G和两个关系判别器。In some embodiments, this application can be applied to the scenario of multiple data owners and data sharing platforms. Taking this as an example, each data owner k holds a discriminator network, and the data sharing platform holds a generating A network G and two relational discriminators.

在一些实施例中,隐变量z(通常为服从高斯分布的随机噪声)通过生成器网络G产生生成样本,对于判别器D来说,这是一个二分类问题,V(D,G)为二分类问题中常见的交叉熵损失。为了保证V(D,G)取得最大值,可以训练迭代k次判别器,然后再迭代1次生成器。其具体训练过程如下:In some embodiments, the latent variable z (usually random noise subject to Gaussian distribution) generates samples through the generator network G. For the discriminator D, this is a binary classification problem, and V(D,G) is two Common cross-entropy loss in classification problems. In order to ensure that V(D,G) achieves the maximum value, the discriminator can be trained for k iterations, and then the generator can be iterated once. The specific training process is as follows:

初始化生成器G和辨别器D两个网络的参数。从训练集抽取n个样本,以及生成器利用定义的噪声分布生成n个样本。固定生成器G,训练辨别器D,使其尽可能区分真假。循环更新k次辨别器D之后,更新1次生成器G,使辨别器尽可能区分不了真假。多次更新迭代后,理想状态下,最终辨别器D无法区分图片到底是来自真实的训练样本集合,还是来自生成器G生成的样本即可,此时辨别的概率为0.5,完成训练。Initialize the parameters of the two networks of generator G and discriminator D. Draw n samples from the training set, and the generator generates n samples using the defined noise distribution. Fix the generator G and train the discriminator D to make it distinguish between true and false as much as possible. After cyclically updating the discriminator D for k times, update the generator G once, so that the discriminator cannot distinguish between true and false as much as possible. After multiple update iterations, ideally, the final discriminator D cannot distinguish whether the picture comes from the real training sample set or the sample generated by the generator G. At this time, the discrimination probability is 0.5, and the training is completed.

在一些实施例中,本申请中提出的DPGDAN算法涉及两个阶段,在阶段1,数据共享平台与K个数据拥有者进行1对K的训练。特别地,对判别器和生成器进行交替训练。其中,每个数据拥有者使用自适应梯度扰动方法对其所持有的判别器参数进行更新;而数据共享平台利用收到的满足差分隐私的判别器反馈对自身所持有的生成器网络的参数进行更新。阶段2,从某个先验分布中采样得到一个随机向量,并将该随机向量输送到训练过程中得到的生成器网络,构建合成记录,并将合成记录聚合得到最终的共享数据记录。In some embodiments, the DPGDAN algorithm proposed in this application involves two stages. In stage 1, the data sharing platform conducts 1-to-K training with K data owners. In particular, the discriminator and generator are trained alternately. Among them, each data owner uses the adaptive gradient perturbation method to update the parameters of the discriminator held by it; and the data sharing platform uses the discriminator feedback that satisfies differential privacy to update the parameters of the generator network held by itself. The parameters are updated. In stage 2, a random vector is sampled from a certain prior distribution, and the random vector is sent to the generator network obtained during the training process to construct a synthetic record, and the synthetic record is aggregated to obtain the final shared data record.

在一些实施例中,本申请中所提及的DPGDAN算法涉及四个神经网络的训练,即每个数据拥有者持有的局部判别器网络、数据共享平台持有的生成器网络以及关系判别器网络。特别地,局部判别器网络判别器网络用于区分本地的敏感数据记录和生成器生成的记录,而生成器网络则致力于生成被这K个判别器网络视为“真实”的合成记录。给定上述不同的目标,可以写出生成器的目标函数如下:In some embodiments, the DPGDAN algorithm mentioned in this application involves the training of four neural networks, namely, the local discriminator network held by each data owner, the generator network held by the data sharing platform, and the relational discriminator network. In particular, the local discriminator network is used to distinguish local sensitive data records from those generated by the generator, while the generator network is dedicated to generating synthetic records that are considered "real" by these K discriminator networks. Given the above different objectives, the objective function of the generator can be written as follows:

Figure BDA0003768040520000081
Figure BDA0003768040520000081

其中,m表示批处理大小,

Figure BDA0003768040520000082
表示真实记录关系型判别器,
Figure BDA0003768040520000083
表示合成记录关系型判别器,
Figure BDA0003768040520000084
Figure BDA0003768040520000085
表示关系型判器,
Figure BDA0003768040520000086
表示生成器的整合记录,
Figure BDA0003768040520000087
表示生成器关于A的局部记录,
Figure BDA0003768040520000088
表示生成器关于B的局部记录。where m represents the batch size,
Figure BDA0003768040520000082
Denotes a ground-truth relational discriminator,
Figure BDA0003768040520000083
Denotes a synthetic record relational discriminator,
Figure BDA0003768040520000084
with
Figure BDA0003768040520000085
Represents a relational discriminator,
Figure BDA0003768040520000086
represents the integration record for the generator,
Figure BDA0003768040520000087
represents the generator's local record of A,
Figure BDA0003768040520000088
Represents the generator's local record about B.

合成记录关系判别器的目标函数如下:The objective function of the synthetic record relation discriminator is as follows:

Figure BDA0003768040520000089
Figure BDA0003768040520000089

其中,m表示批处理大小,

Figure BDA00037680405200000810
表示真实记录关系型判别器,
Figure BDA00037680405200000811
表示合成记录关系型判别器,
Figure BDA00037680405200000812
Figure BDA00037680405200000813
表示关系型判器,
Figure BDA00037680405200000814
表示混淆操作后的整合记录,
Figure BDA0003768040520000091
表示对整合记录的梯度,
Figure BDA0003768040520000092
表示生成器生成的整合记录。where m represents the batch size,
Figure BDA00037680405200000810
Denotes a ground-truth relational discriminator,
Figure BDA00037680405200000811
Denotes a synthetic record relational discriminator,
Figure BDA00037680405200000812
with
Figure BDA00037680405200000813
Represents a relational discriminator,
Figure BDA00037680405200000814
Indicates the integrated record after the obfuscation operation,
Figure BDA0003768040520000091
Denotes the gradient to the integrated record,
Figure BDA0003768040520000092
Represents the consolidated record produced by the generator.

真实记录关系判别器的目标函数如下:The objective function of the real record relation discriminator is as follows:

Figure BDA0003768040520000093
Figure BDA0003768040520000093

其中,m表示批处理大小,

Figure BDA0003768040520000094
表示真实记录关系型判别器,
Figure BDA0003768040520000095
表示合成记录关系型判别器,
Figure BDA0003768040520000096
Figure BDA0003768040520000097
表示关系型判器,
Figure BDA0003768040520000098
表示整合记录数据,
Figure BDA0003768040520000099
表示混淆后的整合记录,xi表示整合记录。where m represents the batch size,
Figure BDA0003768040520000094
Denotes a ground-truth relational discriminator,
Figure BDA0003768040520000095
Denotes a synthetic record relational discriminator,
Figure BDA0003768040520000096
with
Figure BDA0003768040520000097
Represents a relational discriminator,
Figure BDA0003768040520000098
Indicates the consolidated record data,
Figure BDA0003768040520000099
Indicates the integrated record after obfuscation, and xi indicates the integrated record.

两个局部判别器

Figure BDA00037680405200000910
Figure BDA00037680405200000911
用于区分真实的局部记录数据和生成的局部记录数据,其目标函数分别如下:Two local discriminators
Figure BDA00037680405200000910
with
Figure BDA00037680405200000911
It is used to distinguish real partial record data from generated partial record data, and its objective functions are as follows:

Figure BDA00037680405200000912
Figure BDA00037680405200000912

Figure BDA00037680405200000913
Figure BDA00037680405200000913

其中,xiA和xiB表示局部记录数据。Among them, x iA and x iB represent partial recording data.

在一些实施例中,生成器网络的梯度更新需要在其目标函数上执行梯度下降,然而,这包含了判别器的反馈这一敏感信息。为了在不泄露每个数据拥有者隐私的前提下进行生成器梯度的更新,首先对生成器网络梯度计算进行分解,然后确定与判别器网络相关的敏感信息,随后对其进行脱敏化处理。In some embodiments, the gradient update of the generator network needs to perform gradient descent on its objective function, however, this contains the sensitive information of the feedback of the discriminator. In order to update the generator gradient without revealing the privacy of each data owner, the gradient calculation of the generator network is first decomposed, and then the sensitive information related to the discriminator network is determined, and then desensitized.

基于微积分的链式法则,生成器的梯度计算可以进行如下分解:Based on the chain rule of calculus, the gradient calculation of the generator can be decomposed as follows:

Figure BDA00037680405200000914
Figure BDA00037680405200000914

其中,表示目标函数,K表示局部判别器数量,

Figure BDA00037680405200000915
表示判别器网络的响应,其中包含了敏感信息,
Figure BDA00037680405200000916
表示非敏感的计算因子。Among them, represents the objective function, K represents the number of local discriminators,
Figure BDA00037680405200000915
Represents the response of the discriminator network, which contains sensitive information,
Figure BDA00037680405200000916
Indicates an insensitive calculation factor.

在进行了以上梯度计算的分解后,可以利用下面步骤的反复迭代处理过程,对生成器网络的参数进行脱敏化处理,并产生隐私保护的判别器网络反馈。After the above decomposition of the gradient calculation, the iterative process of the following steps can be used to desensitize the parameters of the generator network and generate privacy-preserving discriminator network feedback.

在一些实施例中,所述利用所述局部判别器网络根据所述当前批次的生成记录数据和敏感记录数据确定当前批次损失函数,之前还包括:初始化所述判别器网络的参数;所述判别器网络的参数,包括:一阶动量估计

Figure BDA0003768040520000101
和二阶动量估计
Figure BDA0003768040520000102
In some embodiments, the determining the loss function of the current batch according to the generated record data and the sensitive record data of the current batch by using the local discriminator network further includes: initializing parameters of the discriminator network; The parameters of the discriminator network are described, including: first-order momentum estimation
Figure BDA0003768040520000101
and the second-order momentum estimate
Figure BDA0003768040520000102

在一些实施例中,所述接收当前批次的生成记录数据和敏感记录数据,包括:接收来自所述生成器网络的当前批次的所述生成记录数据;根据预先获取的敏感数据集采样得到当前批次的所述敏感记录数据。In some embodiments, the receiving the current batch of generated record data and sensitive record data includes: receiving the current batch of generated record data from the generator network; obtaining according to pre-acquired sensitive data set sampling The sensitive record data of the current batch.

在一些实施例中,所述利用局部判别器网络根据所述当前批次的生成记录数据和敏感记录数据更新所述局部判别器网络,包括:利用所述局部判别器网络根据所述当前批次的生成记录数据和敏感记录数据确定当前批次损失函数;根据所述当前批次损失函数确定所述局部判别器网络的梯度信息,并对所述梯度信息进行剪枝处理;根据自适应噪音生成技术,利用从高斯分布中采样得到的高斯噪声对剪枝后的梯度信息进行扰动以确定更新参数;根据所述更新参数更新所述局部判别器网络。In some embodiments, the updating the local discriminator network according to the generated record data and the sensitive record data of the current batch by using the local discriminator network includes: using the local discriminator network according to the current batch The generated record data and sensitive record data determine the current batch loss function; determine the gradient information of the local discriminator network according to the current batch loss function, and perform pruning processing on the gradient information; generate The technique uses Gaussian noise sampled from a Gaussian distribution to perturb the pruned gradient information to determine update parameters; and updates the local discriminator network according to the update parameters.

在进行了以上梯度计算的分解后,可以利用下面步骤的反复迭代处理过程,对生成器网络的参数进行脱敏化处理,并产生隐私保护的判别器网络反馈。After the above decomposition of the gradient calculation, the iterative process of the following steps can be used to desensitize the parameters of the generator network and generate privacy-preserving discriminator network feedback.

步骤1:初始化判别器网络参数,一阶动量估计

Figure BDA0003768040520000103
二阶动量估计
Figure BDA0003768040520000104
以及噪音规模σ0。Step 1: Initialize discriminator network parameters, first-order momentum estimation
Figure BDA0003768040520000103
second order momentum estimation
Figure BDA0003768040520000104
and the noise scale σ 0 .

步骤2:每个数据拥有者接收一个批次的来自生成器的合成记录数据

Figure BDA0003768040520000105
Step 2: Each data owner receives a batch of synthetic record data from the generator
Figure BDA0003768040520000105

步骤3:每个数据拥有者从本地抽取一个批次的敏感数据记录数据

Figure BDA0003768040520000106
Step 3: Each data owner extracts a batch of sensitive data record data locally
Figure BDA0003768040520000106

步骤4:将步骤1和2得到的数据记录输入到生成器网络当中,计算其损失函数:Step 4: Input the data records obtained in steps 1 and 2 into the generator network, and calculate its loss function:

Figure BDA0003768040520000107
Figure BDA0003768040520000107

其中,

Figure BDA0003768040520000111
表示均值,D为判别器网络,G为生成器网络,z为隐向量,xk为真实记录数据。in,
Figure BDA0003768040520000111
Indicates the mean value, D is the discriminator network, G is the generator network, z is the hidden vector, and x k is the real recorded data.

步骤5:根据损失函数计算判别器网络的梯度信息。Step 5: Calculate the gradient information of the discriminator network according to the loss function.

步骤6:根据自适应噪音生成方法,从高斯分布中采样得到高斯噪声对梯度进行扰动:Step 6: According to the adaptive noise generation method, Gaussian noise is sampled from the Gaussian distribution to perturb the gradient:

Figure BDA0003768040520000112
Figure BDA0003768040520000112

步骤7:对一阶动量估计

Figure BDA0003768040520000113
二阶动量估计
Figure BDA0003768040520000114
进行更新:Step 7: Estimating the first-order momentum
Figure BDA0003768040520000113
second order momentum estimation
Figure BDA0003768040520000114
Make an update:

Figure BDA0003768040520000115
Figure BDA0003768040520000115

其中,

Figure BDA0003768040520000116
表示更新后的一阶动量估计,
Figure BDA0003768040520000117
表示更新后的二阶动量估计,β1表示第一衰减速率,β2表示第二衰减速率,
Figure BDA0003768040520000118
表示第t-1轮的一阶动量估计,
Figure BDA0003768040520000119
表示第t-1轮的二阶动量估计,
Figure BDA00037680405200001110
表示判别器第t轮的梯度向量。in,
Figure BDA0003768040520000116
represents the updated first-order momentum estimate,
Figure BDA0003768040520000117
represents the updated second-order momentum estimate, β1 represents the first decay rate, and β2 represents the second decay rate,
Figure BDA0003768040520000118
Denotes the first-order momentum estimate for round t-1,
Figure BDA0003768040520000119
Denotes the second-order momentum estimate for round t-1,
Figure BDA00037680405200001110
Indicates the gradient vector of discriminator round t.

步骤8:对判别器网络的权重进行更新。Step 8: Update the weights of the discriminator network.

在一些实施例中,所述将预先采集到的随机向量输入至更新后的生成器网络以得到生成记录数据组,包括:将预先采集到的随机向量输入至更新后的生成器网络,并利用所述更新后的生成器网络确定生成记录数据;重复执行步骤S1-S3,直至迭代次数达到阈值,确定每轮迭代中根据生成器网络确定的生成记录数据以得到所述生成记录数据组。In some embodiments, the inputting the pre-collected random vectors into the updated generator network to obtain the generated record data set includes: inputting the pre-collected random vectors into the updated generator network, and using The updated generator network determines the generated record data; repeat steps S1-S3 until the number of iterations reaches a threshold, and determine the generated record data determined according to the generator network in each iteration to obtain the generated record data set.

在一些实施例中,快照聚合的具体步骤包括:In some embodiments, the specific steps of snapshot aggregation include:

保存不同迭代伦次得到的生成器网络的权重信息;Save the weight information of the generator network obtained by different iterations;

将从先验分布中抽取得到的隐向量馈送到生成器网络中,得到相应的合成记录数据;Feed the hidden vector extracted from the prior distribution into the generator network to obtain the corresponding synthetic record data;

为每个所述合成记录数据分配权重;assigning a weight to each of said synthetic record data;

根据更新公式对权重和共享数据进行更新。The weight and shared data are updated according to the update formula.

具体地,所述根据每个所述生成记录数据的权重构建目标共享数据,之前还包括:保存每轮迭代中根据生成器网络确定的生成记录数据的权重;将根据先验分布抽取到的隐向量输入至每轮迭代中的所述生成器网络中以确定多个合成记录数据;为每个所述合成记录数据分配权重;其中,每轮迭代中更新所述权重的更新公式,表示为Specifically, the construction of the target shared data according to the weight of each generated record data also includes: saving the weight of the generated record data determined according to the generator network in each round of iteration; A vector is input into the generator network in each round of iteration to determine a plurality of synthetic record data; weights are assigned to each of the synthetic record data; wherein, the update formula for updating the weight in each round of iteration is expressed as

Figure BDA00037680405200001111
Figure BDA00037680405200001111

其中,wri表示权重,

Figure BDA0003768040520000121
表示合成记录数据,
Figure BDA0003768040520000122
表示第r个生成器的生成记录,R表示选择的生成器网络的数量,dj表示距离函数,M表示特征数量。Among them, w ri represents the weight,
Figure BDA0003768040520000121
represents synthetic record data,
Figure BDA0003768040520000122
Indicates the generation record of the rth generator, R indicates the number of selected generator networks, dj indicates the distance function, and M indicates the number of features.

每轮迭代中更新所述合成记录数据的更新公式,表示为The update formula for updating the synthetic record data in each iteration is expressed as

Figure BDA0003768040520000123
Figure BDA0003768040520000123

其中,

Figure BDA0003768040520000124
表示第r个生成器的生成记录。in,
Figure BDA0003768040520000124
Indicates the generation record of the rth generator.

本申请的快照聚合方法基于以下思想:理想情况下,经过训练的生成器能够重现真实的数据分布。通过遵循GAN中的标准做法来完成共享记录的合成,即向训练好的生成器网络提供从先验分布中采样的隐向量z_i,然后将生成器的输出作为共享数据。然而,由于隐私预算有限,生成器网络和判别器网络可能未经过足够长时间的培训。因此,仅使用受过训练的生成器网络来生成共享记录将忽略训练过程中的生成器的有用的信息。为此,本申请提出了一种快照聚合方法,该方法利用训练过程中得到的生成器网络来改善最终共享数据的效用。The snapshot aggregation method of this application is based on the idea that, ideally, a trained generator can reproduce the real data distribution. Synthesis of shared records is done by following the standard practice in GANs, i.e., feeding a trained generator network with latent vectors z_i sampled from a prior distribution, and then taking the output of the generator as shared data. However, the generator and discriminator networks may not be trained long enough due to the limited privacy budget. Therefore, only using a trained generator network to generate shared records will ignore useful information about the generator during training. To this end, this application proposes a snapshot aggregation method that leverages the generator network obtained during training to improve the utility of the final shared data.

在进行比较时,采用被广泛使用的真实数据集Adult以及Bank进行实验验证。其中,表2列出了实验中的超参数值,Adult数据集包含了48842条美国人口普查数据记录;Bank数据集包含了45211条葡萄牙银行机构的账户记录。两个数据集的统计情况如下表1所示。In the comparison, the widely used real data sets Adult and Bank are used for experimental verification. Among them, Table 2 lists the hyperparameter values in the experiment. The Adult data set contains 48,842 US census data records; the Bank data set contains 45,211 account records of Portuguese banking institutions. The statistics of the two data sets are shown in Table 1 below.

表1 Adult及Bank数据集统计分析Table 1 Statistical analysis of Adult and Bank datasets

Figure BDA0003768040520000125
Figure BDA0003768040520000125

表2超参数设置Table 2 Hyperparameter settings

Figure BDA0003768040520000126
Figure BDA0003768040520000126

Figure BDA0003768040520000131
Figure BDA0003768040520000131

遵循数据共享任务的标准做法对本方法进行评估,即通过机器学习的有效性来测量共享数据的效用。特别地,我们首先使用共享数据训练预测模型,然后在实际测试集上测试训练后的预测模型。预测模型的准确性越高,表明数据实用性越好。The method is evaluated following standard practice for data sharing tasks, i.e. measuring the utility of shared data through the effectiveness of machine learning. In particular, we first train a predictive model using shared data, and then test the trained predictive model on a real test set. The higher the accuracy of the predictive model, the better the usefulness of the data.

下面通过对比本申请方法与现有方法的Accuracy实验结果,来说明本申请方法的效果和性能。参考图3,表明了本申请提供的共享数据确定算法的有效性。实验结果参考图4(a)-图4(d)、图5(a)-图5(b)和图6(a)-图6(b),其中DPGDAN算法表示本申请实施例所提供的算法,Nonprivat算法和Nosplit算法为相关技术中的算法。从图4(a)-图4(d)、图5(a)-图5(b)和图6(a)-图6(b)可以看出,本申请在不同的预测模型、不同的隐私预算和不同的参与方数量下都获得了接近于对比算法的性能,本申请可以在给定隐私预算下更好地保留原始数据的原始分布特征。需要指出的是,本申请所采用的对比算法仅为了指示性能的上限,并不能直接用于隐私保护的垂直分割数据共享任务。The effect and performance of the method of the present application will be described below by comparing the Accuracy experimental results of the method of the present application with the existing method. Referring to FIG. 3 , it shows the effectiveness of the shared data determination algorithm provided by the present application. Experimental results refer to Fig. 4 (a)-Fig. 4 (d), Fig. 5 (a)-Fig. 5 (b) and Fig. 6 (a)-Fig. 6 (b), wherein DPGDAN algorithm represents that the application embodiment provides algorithm, Nonprivat algorithm and Nosplit algorithm are algorithms in the related art. From Fig. 4(a)-Fig. 4(d), Fig. 5(a)-Fig. 5(b) and Fig. 6(a)-Fig. Under the privacy budget and the number of different participants, the performance of the comparison algorithm is close to that of the comparison algorithm, and the application can better preserve the original distribution characteristics of the original data under the given privacy budget. It should be pointed out that the comparison algorithm used in this application is only to indicate the upper limit of performance, and cannot be directly used for privacy-preserving vertical split data sharing tasks.

从图4(a)-图4(d)可以看出,利用本申请生成的共享数据能够更好地支持多种数据预测或机器学习任务。从图5(a)-图5(b)可以看出,本申请在不同的隐私预算下均保持了良好的性能,这是因为快照聚合方法可确保在隐私和数据实用程序之间更好地权衡之外,本申请的学习范式还可以利用各方的数据来指导模型更新。此外,本申请的1-versus-K设计使生成器网络更有可能接收来自判别器网络的信息信号,从而更好地指导生成器逼近真实数据分布。From Figure 4(a)-Figure 4(d), it can be seen that the shared data generated by this application can better support various data prediction or machine learning tasks. From Fig. 5(a)-Fig. 5(b), we can see that our application maintains good performance under different privacy budgets, because the snapshot aggregation method ensures a better balance between privacy and data utility. In addition to trade-offs, the learning paradigm of this application can also use data from various parties to guide model updates. Furthermore, the 1-versus-K design of the present application makes the generator network more likely to receive the informative signal from the discriminator network, thus better guiding the generator to approximate the real data distribution.

从图6(a)-图6(b)可以看出,与基线相比,本申请的准确性下降幅度较小。原因是,增加判别器的数量会减小最小极大目标函数在整个运行过程中的方差,并会加快收敛速度。From Fig. 6(a)-Fig. 6(b), we can see that the accuracy drop of our application is small compared with the baseline. The reason is that increasing the number of discriminators reduces the variance of the min-max objective function over the entire run and speeds up the convergence.

从上面所述可以看出,本申请提供的一种共享数据确定方法、装置、电子设备及存储介质,接收当前批次的生成记录数据和敏感记录数据;利用局部判别器网络根据所述当前批次的生成记录数据和敏感记录数据更新所述局部判别器网络;利用更新后的局部判别器网络构造局部判别器响应,并利用数据共享平台根据预先获取到的真实集成记录训练数据、合成集成记录训练数据和所述判别器响应训练关系判别器以更新生成器网络;将预先采集到的随机向量输入至更新后的生成器网络以得到生成记录数据组;所述生成记录数据组,包括:多个生成记录数据;根据每个所述生成记录数据的权重构建目标共享数据。实现了能够在保证避免隐私泄露的同时实现垂直分割的数据共享,进而保证最终得到的共享数据具有较高的可用性。From the above, it can be seen that the shared data determination method, device, electronic equipment and storage medium provided by the present application receive the generated record data and sensitive record data of the current batch; Generate record data and sensitive record data for the second time to update the local discriminator network; use the updated local discriminator network to construct a local discriminator response, and use the data sharing platform to synthesize integrated records based on the pre-acquired real integrated record training data The training data and the discriminator respond to the training relationship discriminator to update the generator network; the random vectors collected in advance are input to the updated generator network to obtain the generated record data set; the generated record data set includes: Generate record data; construct target shared data according to the weight of each generated record data. It realizes the data sharing that can achieve vertical segmentation while ensuring the avoidance of privacy leakage, and then ensures that the final shared data has high availability.

为了解决背景技术中提到的满足差分隐私的单方数据发布方法无法直接应用于垂直分割数据共享的问题,申请人提出了满足差分隐私的多方数据共享方法(DPGDAN算法)。该方法的主要思想是各个数据拥有者和数据共享平台共同训练一个定制化的生成对抗网络(GAN),以抽取各个局部敏感数据集的分布信息。特别地,每个数据拥有者持有一个判别器,利用本地敏感数据集对其进行训练,随后,将判别器的反馈信息进行脱敏化处理,上传给数据共享平台,数据共享平台利用收集到的判别器的反馈信息和关系判别器的反馈更新生成器的参数。在训练完成后,生成器在判别器的辅助下进行共享数据的生成。由于生成器更新过程中使用到了来自于各个数据拥有者的反馈信息,因此,可以使得最终共享的数据具有较高的效用。In order to solve the problem that the unilateral data publishing method satisfying differential privacy mentioned in the background technology cannot be directly applied to vertically split data sharing, the applicant proposed a multi-party data sharing method satisfying differential privacy (DPGDAN algorithm). The main idea of this method is that each data owner and data sharing platform jointly train a customized Generative Adversarial Network (GAN) to extract the distribution information of each local sensitive data set. In particular, each data owner holds a discriminator, uses the local sensitive data set to train it, and then desensitizes the feedback information of the discriminator, uploads it to the data sharing platform, and the data sharing platform utilizes the collected The feedback information from the discriminator and the feedback from the relational discriminator update the parameters of the generator. After training, the generator generates shared data with the assistance of the discriminator. Since feedback information from various data owners is used in the update process of the generator, the final shared data can have higher utility.

需要说明的是,本申请实施例的方法可以由单个设备执行,例如一台计算机或服务器等。本实施例的方法也可以应用于分布式场景下,由多台设备相互配合来完成。在这种分布式场景的情况下,这多台设备中的一台设备可以只执行本申请实施例的方法中的某一个或多个步骤,这多台设备相互之间会进行交互以完成所述的方法。It should be noted that the method in the embodiment of the present application may be executed by a single device, such as a computer or a server. The method of this embodiment can also be applied in a distributed scenario, and is completed by cooperation of multiple devices. In the case of such a distributed scenario, one of the multiple devices may only perform one or more steps in the method of the embodiment of the present application, and the multiple devices will interact with each other to complete all described method.

需要说明的是,所述对本申请的一些实施例进行了描述。其它实施例在所附权利要求书的范围内。在一些情况下,在权利要求书中记载的动作或步骤可以按照不同于所述实施例中的顺序来执行并且仍然可以实现期望的结果。另外,在附图中描绘的过程不一定要求示出的特定顺序或者连续顺序才能实现期望的结果。在某些实施方式中,多任务处理和并行处理也是可以的或者可能是有利的。It should be noted that the above describes some embodiments of the present application. Other implementations are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in an order different from that of the described embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. Multitasking and parallel processing are also possible or may be advantageous in certain embodiments.

图7示出了本申请实施例所提供的一种共享数据确定装置的示例性结构示意图。Fig. 7 shows an exemplary structural diagram of an apparatus for determining shared data provided by an embodiment of the present application.

基于同一发明构思,与所述任意实施例方法相对应的,本申请还提供了一种共享数据确定装置。Based on the same inventive concept, the present application further provides a device for determining shared data corresponding to the method in any of the embodiments.

参考图7,所述共享数据确定装置,包括:接收模块、第一更新模块、第二更新模块、确定模块以及构建模块;其中,Referring to FIG. 7 , the device for determining shared data includes: a receiving module, a first updating module, a second updating module, a determining module, and a building module; wherein,

接收模块,被配置为接收当前批次的生成记录数据和敏感记录数据;a receiving module configured to receive the generated record data and sensitive record data of the current batch;

第一更新模块,被配置为利用局部判别器网络根据所述当前批次的生成记录数据和敏感记录数据更新所述局部判别器网络;A first update module configured to update the local discriminator network according to the generated record data and sensitive record data of the current batch by using the local discriminator network;

第二更新模块,被配置为利用更新后的局部判别器网络构造局部判别器响应,并利用数据共享平台根据预先获取到的真实集成记录训练数据、合成集成记录训练数据和所述判别器响应训练关系判别器以更新生成器网络;The second update module is configured to use the updated local discriminator network to construct a local discriminator response, and use the data sharing platform to record training data according to the pre-acquired real ensemble, synthesize the ensemble record training data and the discriminator response training relational discriminator to update the generator network;

确定模块,被配置为将预先采集到的随机向量输入至更新后的生成器网络以得到生成记录数据组;所述生成记录数据组,包括:多个生成记录数据;The determination module is configured to input the pre-collected random vector into the updated generator network to obtain the generated record data set; the generated record data set includes: a plurality of generated record data;

构建模块,被配置为根据每个所述生成记录数据的权重构建目标共享数据。A construction module configured to construct target shared data according to the weight of each generated record data.

在一种可能的实现方式中,所述接收模块进一步被配置为:In a possible implementation manner, the receiving module is further configured to:

接收来自所述生成器网络的当前批次的所述生成记录数据;receiving said production record data for a current batch from said network of producers;

根据预先获取的敏感数据集采样得到当前批次的所述敏感记录数据。The current batch of sensitive record data is obtained according to sampling of the pre-acquired sensitive data set.

在一种可能的实现方式中,所述第一更新模块进一步被配置为:In a possible implementation manner, the first update module is further configured to:

利用所述局部判别器网络根据所述当前批次的生成记录数据和敏感记录数据确定当前批次损失函数;using the local discriminator network to determine a current batch loss function based on the generated record data and sensitive record data of the current batch;

根据所述当前批次损失函数确定所述局部判别器网络的梯度信息,并对所述梯度信息进行剪枝处理;determining gradient information of the local discriminator network according to the current batch loss function, and performing pruning processing on the gradient information;

根据自适应噪音生成技术,利用从高斯分布中采样得到的高斯噪声对剪枝后的梯度信息进行扰动以确定更新参数;According to the adaptive noise generation technology, the Gaussian noise sampled from the Gaussian distribution is used to perturb the pruned gradient information to determine the update parameters;

根据所述更新参数更新所述局部判别器网络。The local discriminator network is updated according to the update parameters.

在一种可能的实现方式中,所述装置,还包括:初始化模块;In a possible implementation manner, the device further includes: an initialization module;

所述初始化模块进一步被配置为:The initialization module is further configured to:

初始化所述判别器网络的参数;所述判别器网络的参数,包括:一阶动量估计

Figure BDA0003768040520000161
和二阶动量估计
Figure BDA0003768040520000162
Initialize the parameters of the discriminator network; the parameters of the discriminator network include: first-order momentum estimation
Figure BDA0003768040520000161
and the second-order momentum estimate
Figure BDA0003768040520000162

在一种可能的实现方式中,所述第二更新模块进一步被配置为:In a possible implementation manner, the second update module is further configured to:

根据更新公式对所述一阶动量估计

Figure BDA0003768040520000163
和二阶动量估计
Figure BDA0003768040520000164
进行更新以确定所述更新参数;其中,所述更新公式,表示为The first-order momentum is estimated according to the update formula
Figure BDA0003768040520000163
and the second-order momentum estimate
Figure BDA0003768040520000164
Perform an update to determine the update parameters; wherein, the update formula is expressed as

Figure BDA0003768040520000165
Figure BDA0003768040520000165

Figure BDA0003768040520000166
Figure BDA0003768040520000166

其中,

Figure BDA0003768040520000167
表示更新后的一阶动量估计,
Figure BDA0003768040520000168
表示更新后的二阶动量估计,β1表示第一衰减速率,β2表示第二衰减速率,
Figure BDA0003768040520000169
表示第t-1轮的一阶动量估计,
Figure BDA00037680405200001610
表示第t-1轮的二阶动量估计,
Figure BDA00037680405200001611
表示判别器第t轮的梯度向量。in,
Figure BDA0003768040520000167
represents the updated first-order momentum estimate,
Figure BDA0003768040520000168
represents the updated second-order momentum estimate, β1 represents the first decay rate, and β2 represents the second decay rate,
Figure BDA0003768040520000169
Denotes the first-order momentum estimate for round t-1,
Figure BDA00037680405200001610
Denotes the second-order momentum estimate for round t-1,
Figure BDA00037680405200001611
Indicates the gradient vector of discriminator round t.

在一种可能的实现方式中,所述确定模块进一步被配置为:In a possible implementation manner, the determination module is further configured to:

将预先采集到的随机向量输入至更新后的生成器网络,并利用所述更新后的生成器网络确定生成记录数据;Inputting the pre-collected random vector into the updated generator network, and using the updated generator network to determine to generate record data;

重复执行步骤S1-S3,直至迭代次数达到阈值,确定每轮迭代中根据生成器网络确定的生成记录数据以得到所述生成记录数据组。Steps S1-S3 are repeatedly executed until the number of iterations reaches a threshold, and the generated record data determined according to the generator network in each iteration are determined to obtain the generated record data set.

在一种可能的实现方式中,所述装置,还包括:第三更新模块;In a possible implementation manner, the device further includes: a third update module;

所述第三更新模块进一步被配置为:The third update module is further configured to:

保存每轮迭代中根据生成器网络确定的生成记录数据的权重;Save the weight of the generated record data determined by the generator network in each iteration;

将根据先验分布抽取到的隐向量输入至每轮迭代中的所述生成器网络中以确定多个合成记录数据;inputting hidden vectors extracted according to the prior distribution into the generator network in each iteration to determine a plurality of synthetic record data;

为每个所述合成记录数据分配权重;其中,每轮迭代中更新所述权重的更新公式,表示为Assign weights for each of the synthetic record data; wherein, the update formula for updating the weights in each iteration is expressed as

Figure BDA00037680405200001612
Figure BDA00037680405200001612

其中,wri表示权重,

Figure BDA00037680405200001613
表示合成记录数据,
Figure BDA00037680405200001614
表示第r个生成器的生成记录,,R表示选择的生成器网络的数量,dj表示距离函数,M表示特征数量。Among them, w ri represents the weight,
Figure BDA00037680405200001613
represents synthetic record data,
Figure BDA00037680405200001614
Denotes the generation record of the rth generator, R represents the number of selected generator networks, dj represents the distance function, and M represents the number of features.

每轮迭代中更新所述合成记录数据的更新公式,表示为The update formula for updating the synthetic record data in each iteration is expressed as

Figure BDA0003768040520000171
Figure BDA0003768040520000171

其中,

Figure BDA0003768040520000172
表示第r个生成器的生成记录。in,
Figure BDA0003768040520000172
Indicates the generation record of the rth generator.

为了描述的方便,描述以上装置时以功能分为各种模块分别描述。当然,在实施本申请时可以把各模块的功能在同一个或多个软件和/或硬件中实现。For the convenience of description, when describing the above devices, functions are divided into various modules and described separately. Of course, when implementing the present application, the functions of each module can be implemented in one or more pieces of software and/or hardware.

所述实施例的装置用于实现前述任一实施例中相应的共享数据确定方法,并且具有相应的方法实施例的有益效果,在此不再赘述。The device in the embodiment is used to implement the corresponding method for determining shared data in any of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiment, so details are not repeated here.

图8示出了本申请实施例所提供的一种电子设备的示例性结构示意图。Fig. 8 shows an exemplary structural diagram of an electronic device provided by an embodiment of the present application.

基于同一发明构思,与所述任意实施例方法相对应的,本申请还提供了一种电子设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时实现上任意一实施例所述的共享数据确定方法。图8示出了本实施例所提供的一种更为具体的电子设备硬件结构示意图,该设备可以包括:处理器810、存储器820、输入/输出接口830、通信接口840和总线850。其中处理器810、存储器820、输入/输出接口830和通信接口840通过总线850实现彼此之间在设备内部的通信连接。Based on the same inventive concept, and corresponding to the method in any of the above embodiments, the present application also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and operable on the processor, the processing The method for determining shared data described in any one of the above embodiments is implemented when the program is executed by the computer. FIG. 8 shows a schematic diagram of a more specific hardware structure of an electronic device provided by this embodiment. The device may include: a processor 810 , a memory 820 , an input/output interface 830 , a communication interface 840 and a bus 850 . The processor 810 , the memory 820 , the input/output interface 830 and the communication interface 840 are connected to each other within the device through the bus 850 .

处理器810可以采用通用的CPU(Central Processing Unit,中央处理器)、微处理器、应用专用集成电路(Application Specific Integrated Circuit,ASIC)、或者一个或多个集成电路等方式实现,用于执行相关程序,以实现本说明书实施例所提供的技术方案。The processor 810 may be implemented by a general-purpose CPU (Central Processing Unit, central processing unit), a microprocessor, an application-specific integrated circuit (Application Specific Integrated Circuit, ASIC), or one or more integrated circuits, and is used to execute related programs to realize the technical solutions provided by the embodiments of this specification.

存储器820可以采用ROM(Read Only Memory,只读存储器)、RAM(Random AccessMemory,随机存取存储器)、静态存储设备,动态存储设备等形式实现。存储器820可以存储操作系统和其他应用程序,在通过软件或者固件来实现本说明书实施例所提供的技术方案时,相关的程序代码保存在存储器820中,并由处理器810来调用执行。The memory 820 may be implemented in the form of ROM (Read Only Memory, read only memory), RAM (Random Access Memory, random access memory), static storage device, dynamic storage device, and the like. The memory 820 can store operating systems and other application programs. When implementing the technical solutions provided by the embodiments of this specification through software or firmware, the relevant program codes are stored in the memory 820 and invoked by the processor 810 for execution.

输入/输出接口830用于连接输入/输出模块,以实现信息输入及输出。输入/输出模块可以作为组件配置在设备中(图中未示出),也可以外接于设备以提供相应功能。其中输入设备可以包括键盘、鼠标、触摸屏、麦克风、各类传感器等,输出设备可以包括显示器、扬声器、振动器、指示灯等。The input/output interface 830 is used to connect the input/output module to realize information input and output. The input/output module can be configured in the device as a component (not shown in the figure), or can be externally connected to the device to provide corresponding functions. The input device may include a keyboard, mouse, touch screen, microphone, various sensors, etc., and the output device may include a display, a speaker, a vibrator, an indicator light, and the like.

通信接口840用于连接通信模块(图中未示出),以实现本设备与其他设备的通信交互。其中通信模块可以通过有线方式(例如USB、网线等)实现通信,也可以通过无线方式(例如移动网络、WIFI、蓝牙等)实现通信。The communication interface 840 is used to connect a communication module (not shown in the figure), so as to realize communication interaction between the device and other devices. The communication module can realize communication through wired means (such as USB, network cable, etc.), and can also realize communication through wireless means (such as mobile network, WIFI, Bluetooth, etc.).

总线850包括一通路,在设备的各个组件(例如处理器810、存储器820、输入/输出接口830和通信接口840)之间传输信息。Bus 850 includes a path for carrying information between the various components of the device (eg, processor 810, memory 820, input/output interface 830, and communication interface 840).

需要说明的是,尽管所述设备仅示出了处理器810、存储器820、输入/输出接口830、通信接口840以及总线850,但是在具体实施过程中,该设备还可以包括实现正常运行所必需的其他组件。此外,本领域的技术人员可以理解的是,所述设备中也可以仅包含实现本说明书实施例方案所必需的组件,而不必包含图中所示的全部组件。It should be noted that although the device only shows a processor 810, a memory 820, an input/output interface 830, a communication interface 840, and a bus 850, in a specific implementation process, the device may also include of other components. In addition, those skilled in the art can understand that the device may only include components necessary to implement the solutions of the embodiments of this specification, and does not necessarily include all the components shown in the figure.

所述实施例的电子设备用于实现前述任一实施例中相应的共享数据确定方法,并且具有相应的方法实施例的有益效果,在此不再赘述。The electronic device in this embodiment is used to implement the corresponding method for determining shared data in any of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiments, which will not be repeated here.

基于同一发明构思,与所述任意实施例方法相对应的,本申请还提供了一种非暂态计算机可读存储介质,所述非暂态计算机可读存储介质存储计算机指令,所述计算机指令用于使所述计算机执行如上任一实施例所述的共享数据确定方法。Based on the same inventive concept, the present application also provides a non-transitory computer-readable storage medium corresponding to the method in any of the embodiments described above, the non-transitory computer-readable storage medium stores computer instructions, and the computer instructions It is used to make the computer execute the method for determining shared data as described in any one of the above embodiments.

本实施例的计算机可读介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机的存储介质的例子包括,但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带,磁带磁磁盘存储或其他磁性存储设备或任何其他非传输介质,可用于存储可以被计算设备访问的信息。The computer-readable medium in this embodiment includes permanent and non-permanent, removable and non-removable media, and information storage can be realized by any method or technology. Information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read only memory (ROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Flash memory or other memory technology, Compact Disc Read-Only Memory (CD-ROM), Digital Versatile Disc (DVD) or other optical storage, Magnetic tape cartridge, tape magnetic disk storage or other magnetic storage device or any other non-transmission medium that can be used to store information that can be accessed by a computing device.

所述实施例的存储介质存储的计算机指令用于使所述计算机执行如上任一实施例所述的共享数据确定方法,并且具有相应的方法实施例的有益效果,在此不再赘述。The computer instructions stored in the storage medium of the embodiments are used to enable the computer to execute the method for determining shared data as described in any of the above embodiments, and have the beneficial effects of the corresponding method embodiments, which will not be repeated here.

所属领域的普通技术人员应当理解:以上任何实施例的讨论仅为示例性的,并非旨在暗示本申请的范围(包括权利要求)被限于这些例子;在本申请的思路下,以上实施例或者不同实施例中的技术特征之间也可以进行组合,步骤可以以任意顺序实现,并存在如上所述的本申请实施例的不同方面的许多其它变化,为了简明它们没有在细节中提供。Those of ordinary skill in the art should understand that: the discussion of any of the above embodiments is exemplary only, and is not intended to imply that the scope of the application (including claims) is limited to these examples; under the thinking of the application, the above embodiments or Combinations of technical features in different embodiments are also possible, steps may be implemented in any order, and there are many other variations of the different aspects of the embodiments of the application as described above, which are not provided in details for the sake of brevity.

另外,为简化说明和讨论,并且为了不会使本申请实施例难以理解,在所提供的附图中可以示出或可以不示出与集成电路(IC)芯片和其它部件的公知的电源/接地连接。此外,可以以框图的形式示出装置,以便避免使本申请实施例难以理解,并且这也考虑了以下事实,即关于这些框图装置的实施方式的细节是高度取决于将要实施本申请实施例的平台的(即,这些细节应当完全处于本领域技术人员的理解范围内)。在阐述了具体细节(例如,电路)以描述本申请的示例性实施例的情况下,对本领域技术人员来说显而易见的是,可以在没有这些具体细节的情况下或者这些具体细节有变化的情况下实施本申请实施例。因此,这些描述应被认为是说明性的而不是限制性的。In addition, to simplify illustration and discussion, and so as not to obscure the embodiments of the present application, well-known power supply/connection circuits associated with integrated circuit (IC) chips and other components may or may not be shown in the provided figures. ground connection. Furthermore, devices may be shown in block diagram form in order to avoid obscuring the embodiments of the present application, and this also takes into account the fact that details regarding the implementation of these block diagram devices are highly dependent on the implementation of the embodiments of the present application to be implemented. platform (ie, the details should be well within the purview of those skilled in the art). Where specific details (e.g., circuits) have been set forth to describe example embodiments of the present application, it will be apparent to those skilled in the art that reference may be made without or with variation from these specific details. Implement the embodiment of the present application below. Accordingly, these descriptions should be regarded as illustrative rather than restrictive.

尽管已经结合了本申请的具体实施例对本申请进行了描述,但是根据前面的描述,这些实施例的很多替换、修改和变型对本领域普通技术人员来说将是显而易见的。例如,其它存储器架构(例如,动态RAM(DRAM))可以使用所讨论的实施例。Although the application has been described in conjunction with specific embodiments thereof, many alternatives, modifications and variations of those embodiments will be apparent to those of ordinary skill in the art from the foregoing description. For example, other memory architectures such as dynamic RAM (DRAM) may use the discussed embodiments.

本申请实施例旨在涵盖落入所附权利要求的宽泛范围之内的所有这样的替换、修改和变型。因此,凡在本申请实施例的精神和原则之内,所做的任何省略、修改、等同替换、改进等,均应包含在本申请的保护范围之内。The embodiments of the present application are intended to embrace all such alternatives, modifications and variations that fall within the broad scope of the appended claims. Therefore, any omissions, modifications, equivalent replacements, improvements, etc. within the spirit and principles of the embodiments of the present application shall be included within the protection scope of the present application.

Claims (10)

1. A method for determining shared data, comprising:
s1: receiving the generated record data and the sensitive record data of the current batch;
s2: updating the local arbiter network according to the generated record data and the sensitive record data of the current batch by using the local arbiter network;
s3: constructing a local discriminator response by using the updated local discriminator network, and updating a generator network by using a data sharing platform according to real integrated record training data, synthetic integrated record training data and the discriminator response training relation discriminator which are obtained in advance;
s4: inputting the random vector collected in advance into an updated generator network to obtain a generated record data set; the generating of the record data set includes: a plurality of generation record data;
s5: and constructing target shared data according to the weight of each generated record data.
2. The method of claim 1, wherein receiving production record data and sensitive record data for a current batch comprises:
receiving the generation record data for a current batch from the generator network;
and sampling according to a pre-acquired sensitive data set to obtain the sensitive record data of the current batch.
3. The method of claim 1, wherein said updating the local arbiter network from the generated log data and the sensitive log data of the current batch using the local arbiter network comprises:
determining a current batch loss function according to the generated record data and the sensitive record data of the current batch by using the local arbiter network;
determining gradient information of the local discriminator network according to the current batch loss function, and pruning the gradient information;
according to the self-adaptive noise generation technology, gaussian noise obtained by sampling from Gaussian distribution is used for disturbing the gradient information after pruning so as to determine updating parameters;
and updating the local arbiter network according to the updated parameters.
4. The method of claim 3, wherein determining a current lot loss function from the generated log data and the sensitive log data of the current lot using the local arbiter network further comprises:
initializing parameters of the discriminator network; the parameters of the discriminator network comprise: first order momentum estimation
Figure FDA0003768040510000021
And second order momentum estimation
Figure FDA0003768040510000022
5. The method of claim 4, wherein perturbing the pruned gradient information with Gaussian noise sampled from the Gaussian distribution to determine the update parameters comprises:
estimating the first order momentum according to an updated formula
Figure FDA0003768040510000023
And second order momentum estimation
Figure FDA0003768040510000024
Updating to determine the updated parameters; wherein the update formula is expressed as
Figure FDA0003768040510000025
Figure FDA0003768040510000026
Wherein,
Figure FDA0003768040510000027
representing the updated first-order momentum estimate,
Figure FDA0003768040510000028
representing the updated second-order momentum estimate, beta 1 Denotes a first decay rate, beta 2 Which is indicative of a second rate of decay,
Figure FDA0003768040510000029
representing the first order momentum estimate for the t-1 th round,
Figure FDA00037680405100000210
representing the second order momentum estimate for the t-1 th round,
Figure FDA00037680405100000211
the gradient vector for round t-1 is shown.
6. The method of claim 1, wherein inputting the pre-collected random vectors into an updated generator network to obtain the generated record data set comprises:
inputting a random vector collected in advance into an updated generator network, and determining to generate record data by using the updated generator network;
and (4) repeatedly executing the steps S1-S3 until the iteration times reach a threshold value, and determining the generated record data determined according to the generator network in each iteration so as to obtain the generated record data group.
7. The method of claim 6, wherein said constructing target shared data according to the weight of each of said generated record data further comprises:
storing the weight of the generated record data determined according to the generator network in each iteration;
inputting hidden vectors extracted according to a prior distribution into the generator network in each iteration to determine a plurality of synthetic record data;
assigning a weight to each of the synthetic record data; wherein the updating formula for updating the weight in each iteration is expressed as
Figure FDA0003768040510000031
Wherein w ri The weight is represented by a weight that is,
Figure FDA0003768040510000032
it is indicated that the resultant recorded data,
Figure FDA0003768040510000033
representing the generation record of the R-th generator, R representing the number of generator networks selected, d j Representing the distance function and M the number of features.
An update formula for updating the composite record data in each iteration, expressed as
Figure FDA0003768040510000034
Wherein,
Figure FDA0003768040510000035
the representation represents the generated record representing the r-th generator.
8. A shared data determining apparatus, comprising:
the receiving module is configured to receive the generated record data and the sensitive record data of the current batch;
a first update module configured to update the local arbiter network according to the generated record data and sensitive record data of the current batch using a local arbiter network;
the second updating module is configured to construct a local discriminator response by using the updated local discriminator network, and update the generator network by using the data sharing platform according to the real integrated record training data, the synthetic integrated record training data and the discriminator response training relation discriminator which are acquired in advance;
a determining module configured to input a pre-collected random vector to the updated generator network to obtain a generated record data set; the generating of the record data set includes: a plurality of generation record data;
a construction module configured to construct target shared data according to the weight of each of the generated record data.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1 to 7 when executing the program.
10. A non-transitory computer readable storage medium storing computer instructions for causing a computer to implement the method of any one of claims 1 to 7.
CN202210892219.1A 2022-07-27 2022-07-27 Shared data determination method and device, electronic equipment and storage medium Pending CN115454949A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210892219.1A CN115454949A (en) 2022-07-27 2022-07-27 Shared data determination method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210892219.1A CN115454949A (en) 2022-07-27 2022-07-27 Shared data determination method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN115454949A true CN115454949A (en) 2022-12-09

Family

ID=84297189

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210892219.1A Pending CN115454949A (en) 2022-07-27 2022-07-27 Shared data determination method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN115454949A (en)

Similar Documents

Publication Publication Date Title
Augenstein et al. Generative models for effective ML on private, decentralized datasets
US11593660B2 (en) Subset conditioning using variational autoencoder with a learnable tensor train induced prior
US20230102337A1 (en) Method and apparatus for training recommendation model, computer device, and storage medium
Shmueli To explain or to predict?
Hur et al. A variable impacts measurement in random forest for mobile cloud computing
Wang et al. Large-scale ensemble model for customer churn prediction in search ads
CN107578332A (en) A method, device, equipment and storage medium for recommending cash commodities
CN113420212B (en) Recommendation method, device, equipment and storage medium based on deep feature learning
CN113408668A (en) Decision tree construction method and device based on federated learning system and electronic equipment
CN110428295A (en) Method of Commodity Recommendation and system
US20240394338A1 (en) Data compression techniques for machine learning models
Sina Mirabdolbaghi et al. Model optimization analysis of customer churn prediction using machine learning algorithms with focus on feature reductions
CN113656699B (en) User feature vector determining method, related equipment and medium
Zhang et al. A generative adversarial network–based method for generating negative financial samples
Compiani et al. Demand estimation with text and image data
Muschalik et al. shapiq: Shapley interactions for machine learning
Tian et al. Research of consumption behavior prediction based on improved DNN
CN108446738A (en) A kind of clustering method, device and electronic equipment
CN116703498B (en) Commodity recommendation method and device, electronic equipment and storage medium
KR20210048818A (en) Apparatus and method for trade based on artificial intelligence using fintech
CN115454949A (en) Shared data determination method and device, electronic equipment and storage medium
CN114443593B (en) Multi-party data sharing method and related equipment based on generative adversarial network
US20240420010A1 (en) Converting historical transaction data into merchant vectors for model training
US20240046292A1 (en) Intelligent prediction of lead conversion
CN117172632B (en) Enterprise abnormal behavior detection method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination