CN111340247A

CN111340247A - Longitudinal federated learning system optimization method, device and readable storage medium

Info

Publication number: CN111340247A
Application number: CN202010089045.6A
Authority: CN
Inventors: 郑会钿; 范涛; 马国强; 谭明超; 陈天健; 杨强
Original assignee: WeBank Co Ltd
Current assignee: WeBank Co Ltd
Priority date: 2020-02-12
Filing date: 2020-02-12
Publication date: 2020-06-26
Anticipated expiration: 2040-02-12
Also published as: CN111340247B; WO2021159798A1

Abstract

The invention discloses a vertical federated learning system optimization method, device and readable storage medium. The method comprises: receiving an encrypted and simplified intermediate result of the second device sent by the second device, and Encrypting and simplifying the intermediate results for data completion, obtaining the encryption and filling intermediate results of the second device, and finally calculating the encrypted first gradient values corresponding to the model parameters in the first device by using the encryption and filling intermediate results of the second device. , and update the model parameters of the first device based on the encrypted first gradient value, and iterate in a loop until it is detected that the preset stop condition is met, to obtain the target model parameters of the trained first device. In vertical federation training, by reducing the number of data contained in the intermediate results of participating devices, the amount of data that needs to be encrypted and communicated is reduced, the cost of encryption and communication is reduced, and the modeling time of vertical federation is greatly shortened.

Description

Vertical federated learning system optimization method, device and readable storage medium

技术领域technical field

本发明涉及机器学习技术领域，尤其涉及一种纵向联邦学习系统优化方法、设备及可读存储介质。The present invention relates to the technical field of machine learning, and in particular, to an optimization method, device and readable storage medium for a longitudinal federated learning system.

背景技术Background technique

随着人工智能的发展，人们为解决数据孤岛的问题，提出了“联邦学习”的概念，使得联邦双方在不用给出己方数据的情况下，也可进行模型训练得到模型参数，并且可以避免数据隐私泄露的问题。With the development of artificial intelligence, people put forward the concept of "federated learning" in order to solve the problem of data islands, so that both sides of the federation can train models to obtain model parameters without giving their own data, and can avoid data privacy breaches.

纵向联邦学习是在参与者的数据特征重叠较小，而用户重叠较多的情况下，取出参与者用户相同而用户数据特征不同的那部分用户及数据进行联合训练机器学习模型。比如有属于同一个地区的两个参与者A和B，其中参与者A是一家银行，参与者B是一个电商平台。参与者A和B在同一地区拥有较多相同的用户，但是A与B的业务不同，记录的用户数据特征是不同的。特别地，A和B记录的用户数据特征可能是互补的。在这样的场景下，可以使用纵向联邦学习来帮助A和B构建联合机器学习预测模型，帮助A和B向他们的客户提供更好的服务。In vertical federated learning, when the data features of the participants overlap less and the users overlap more, the part of the users and data with the same users but different data features is taken out to jointly train the machine learning model. For example, there are two participants A and B belonging to the same region, where participant A is a bank and participant B is an e-commerce platform. Participants A and B have more and the same users in the same area, but A and B have different businesses, and the recorded user data characteristics are different. In particular, the user data characteristics of A and B records may be complementary. In such a scenario, vertical federated learning can be used to help A and B build a joint machine learning predictive model to help A and B provide better services to their customers.

纵向联邦学习在建模过程中，参与者之间以加密形式交互用于计算梯度和损失函数的中间结果，每一轮模型训练都需要对中间结果中的每个数据进行加密及交换，中间结果的数量与参与者所拥有的数据的数量相同，故加密及交互的数据量很大，加密和通信成本很高，同时也增加了纵向联邦建模时间。In the modeling process of vertical federated learning, participants interact in encrypted form to calculate the intermediate results of gradient and loss function. Each round of model training needs to encrypt and exchange each data in the intermediate results. The amount of data is the same as the amount of data owned by the participants, so the amount of encrypted and interactive data is large, the encryption and communication costs are high, and it also increases the vertical federation modeling time.

发明内容SUMMARY OF THE INVENTION

本发明的主要目的在于提供一种纵向联邦学习系统优化方法、装置、设备及可读存储介质，旨在实现降低纵向联邦学习训练过程中的加密和通信成本，缩短建模时间。The main purpose of the present invention is to provide a vertical federated learning system optimization method, device, device and readable storage medium, aiming at reducing the encryption and communication costs in the vertical federated learning training process and shortening the modeling time.

为实现上述目的，本发明提供一种纵向联邦学习系统优化方法，应用于参与纵向联邦学习的第一设备，所述第一设备与第二设备通信连接，所述纵向联邦学习系统优化方法包括以下步骤：In order to achieve the above object, the present invention provides a vertical federated learning system optimization method, which is applied to a first device participating in vertical federated learning, and the first device is communicatively connected to the second device. The vertical federated learning system optimization method includes the following: step:

接收所述第二设备发送的第二设备的加密精简中间结果，其中，所述第二设备用于对计算得到的第二设备的各条样本数据对应的原始中间结果进行抽样处理，得到所述第二设备的部分样本数据对应的精简中间结果，并对所述第二设备的精简中间结果进行加密，得到所述第二设备的加密精简中间结果；Receive the encrypted and simplified intermediate result of the second device sent by the second device, wherein the second device is configured to perform sampling processing on the original intermediate results corresponding to each piece of sample data of the second device obtained by calculation, to obtain the a simplified intermediate result corresponding to part of the sample data of the second device, and encrypting the simplified intermediate result of the second device to obtain the encrypted and simplified intermediate result of the second device;

对所述第二设备的加密精简中间结果进行数据补齐，得到第二设备的加密补齐中间结果，其中，所述加密补齐中间结果的数据数量与所述原始中间结果的数据数量相同；performing data complementation on the encrypted and simplified intermediate result of the second device to obtain an encrypted and complemented intermediate result of the second device, wherein the data quantity of the encrypted and complemented intermediate result is the same as the data quantity of the original intermediate result;

利用所述第二设备的加密补齐中间结果计算得到所述第一设备中模型参数对应的加密第一梯度值，并基于所述加密第一梯度值更新所述第一设备的模型参数，循环迭代直到检测到满足预设停止条件时，得到训练完成的第一设备的目标模型参数。The encrypted first gradient value corresponding to the model parameter in the first device is obtained by calculating the intermediate result of the encryption and complementation of the second device, and the model parameter of the first device is updated based on the encrypted first gradient value, and the cycle is repeated. Iterating until it is detected that the preset stop condition is satisfied, the target model parameters of the trained first device are obtained.

可选地，所述对所述第二设备的加密精简中间结果进行数据补齐获得加密补齐中间结果的步骤包括：Optionally, the step of performing data complementation on the encrypted and reduced intermediate result of the second device to obtain the encrypted and complemented intermediate result includes:

获取所述第二设备的抽样对照表，并基于所述第二设备的抽样对照表在所述第二设备的加密精简中间结果中确定填充数据以及所述填充数据对应的填充位置；Obtaining a sampling comparison table of the second device, and determining padding data and a padding position corresponding to the padding data in the encrypted and condensed intermediate result of the second device based on the sampling comparison table of the second device;

在所述第二设备的加密精简中间结果中，基于所述填充位置插入所述填充数据，得到所述第二设备的加密补齐中间结果。In the encrypted compact intermediate result of the second device, the padding data is inserted based on the padding position to obtain the encrypted complementary intermediate result of the second device.

可选地，所述利用所述第二设备的加密补齐中间结果计算得到所述第一设备中模型参数对应的加密第一梯度值的步骤包括：Optionally, the step of calculating the encrypted first gradient value corresponding to the model parameter in the first device by using the encryption and complementing the intermediate result of the second device includes:

计算得到用于计算梯度值的第一设备的加密精简中间结果；calculating the encrypted condensed intermediate result of the first device for calculating the gradient value;

利用所述第二设备的加密补齐中间结果以及所述第一设备的加密精简中间结果，计算得到第一设备的加密中间结果；Using the encryption-complemented intermediate result of the second device and the encryption-reduced intermediate result of the first device to calculate the encrypted intermediate result of the first device;

利用所述第一设备的加密中间结果计算得到所述第一设备中模型参数对应的加密第一梯度值。The encrypted first gradient value corresponding to the model parameter in the first device is calculated by using the encrypted intermediate result of the first device.

可选地，所述计算得到用于计算梯度值的第一设备的加密精简中间结果的步骤包括：Optionally, the step of obtaining the encrypted and simplified intermediate result of the first device for calculating the gradient value by calculating includes:

所述第一设备对计算得到的第一设备的各条样本数据对应的原始中间结果进行抽样处理，得到所述第一设备的部分样本数据对应的精简中间结果；The first device performs sampling processing on the calculated original intermediate results corresponding to each piece of sample data of the first device, to obtain a simplified intermediate result corresponding to part of the sample data of the first device;

对所述第一设备的精简中间结果进行加密，得到所述第一设备的加密精简中间结果。The condensed intermediate result of the first device is encrypted to obtain the encrypted condensed intermediate result of the first device.

可选地，所述利用所述第二设备的加密补齐中间结果以及所述第一设备的加密精简中间结果，计算得到第一设备的加密中间结果的步骤包括：Optionally, the step of calculating and obtaining the encrypted intermediate result of the first device by using the encryption of the second device to complete the intermediate result and the encryption of the first device to reduce the intermediate result includes:

对所述第一设备的加密精简中间结果进行数据补齐，得到所述第一设备的加密补齐中间结果；performing data complementation on the encrypted and simplified intermediate result of the first device to obtain the encrypted and complemented intermediate result of the first device;

利用所述第一设备的加密补齐中间结果与所述第二设备的加密补齐中间结果，计算得到第一设备的加密中间结果。The encrypted intermediate result of the first device is obtained by calculation by using the encryption and complementation intermediate result of the first device and the encryption and complementation intermediate result of the second device.

为实现上述目的，本发明还提供一种纵向联邦学习系统优化方法，应用于参与纵向联邦学习的第二设备，所述纵向联邦学习系统优化方法包括以下步骤：In order to achieve the above object, the present invention also provides a vertical federated learning system optimization method, which is applied to the second device participating in the vertical federated learning. The vertical federated learning system optimization method includes the following steps:

对计算得到的第二设备的各条样本数据对应的原始中间结果进行抽样处理，得到所述第二设备的部分样本数据对应的精简中间结果；Perform sampling processing on the calculated original intermediate results corresponding to each piece of sample data of the second device, to obtain a simplified intermediate result corresponding to part of the sample data of the second device;

对第二设备的精简中间结果进行加密，得到所述第二设备的部分样本数据对应的精简中间结果并发送至所述第一设备，以供所述第一设备基于所述第二设备的加密精简中间结果反馈所述第一设备的加密中间结果，其中，所述第一设备用于对接收的所述第二设备的加密精简中间结果进行数据补齐获得所述第二设备的加密补齐中间结果，并利用所述第二设备的加密补齐中间结果计算得到所述第一设备的加密中间结果；Encrypt the reduced intermediate result of the second device to obtain a reduced intermediate result corresponding to part of the sample data of the second device and send it to the first device for the first device to encrypt based on the second device The condensed intermediate result feeds back the encrypted intermediate result of the first device, wherein the first device is configured to perform data complementation on the received encrypted condensed intermediate result of the second device to obtain the encrypted complement of the second device intermediate results, and use the encryption of the second device to supplement the intermediate results to calculate the encrypted intermediate results of the first device;

利用所述第一设备的加密中间结果计算得到所述第二设备中模型参数对应的加密第二梯度值，并基于所述加密第二梯度值更新所述第二设备的模型参数，循环迭代直到检测到满足预设停止条件时，得到训练完成的第二设备的目标模型参数。The encrypted second gradient value corresponding to the model parameter in the second device is calculated by using the encrypted intermediate result of the first device, and the model parameter of the second device is updated based on the encrypted second gradient value, and the loop is iterated until When it is detected that the preset stop condition is satisfied, the target model parameters of the second device that has been trained are obtained.

可选地，所述对计算得到的第二设备的各条样本数据对应的原始中间结果进行抽样处理，得到所述第二设备的部分样本数据对应的精简中间结果的步骤包括：Optionally, the step of performing sampling processing on the calculated original intermediate results corresponding to each piece of sample data of the second device, and obtaining the simplified intermediate results corresponding to part of the sample data of the second device includes:

利用所述第二设备的模型参数分别对所述第二设备各条样本数据进行加权求和，计算得到所述第二设备的原始中间结果；Using the model parameters of the second device to perform weighted summation on each piece of sample data of the second device, and calculate the original intermediate result of the second device;

基于阈值对所述第二设备的原始中间结果进行拆分，得到第一子原始中间结果和第二子原始中间结果，其中，所述第一子原始中间结果中的各个数据小于或等于阈值，所述第二子原始中间结果中的各个数据大于阈值；Splitting the original intermediate result of the second device based on the threshold to obtain a first sub-original intermediate result and a second sub-original intermediate result, wherein each data in the first sub-original intermediate result is less than or equal to the threshold, Each data in the second sub-original intermediate result is greater than a threshold;

将所述第一子原始中间结果的所有数据进行分组，并确定各组各自的代表数据，由各组的代表数据组成第三子原始中间结果；Grouping all the data of the first sub-original intermediate results, and determining the respective representative data of each group, and forming the third sub-original intermediate results from the representative data of each group;

基于所述第三子原始中间结果和所述第二子原始中间结果，得到第二设备的精简中间结果。Based on the third sub-original intermediate result and the second sub-original intermediate result, a reduced intermediate result for the second device is obtained.

为实现上述目的，本发明还提供一种纵向联邦学习系统优化设备，所述纵向联邦学习系统优化设备包括：存储器、处理器及存储在所述存储器上并可在所述处理器上运行的纵向联邦学习系统优化程序，所述纵向联邦学习系统优化程序被所述处理器执行时实现如上所述的纵向联邦学习系统优化方法的步骤。In order to achieve the above object, the present invention also provides a vertical federated learning system optimization device, the vertical federated learning system optimization device includes: a memory, a processor, and a vertical federated learning system stored on the memory and running on the processor. A federated learning system optimization program, when the longitudinal federated learning system optimization program is executed by the processor, implements the steps of the vertical federated learning system optimization method as described above.

此外，为实现上述目的，本发明还提出一种可读存储介质，所述可读存储介质上存储有纵向联邦学习系统优化程序，所述纵向联邦学习系统优化程序被处理器执行时实现如上所述的纵向联邦学习系统优化方法的步骤。In addition, in order to achieve the above object, the present invention also provides a readable storage medium, on which a longitudinal federated learning system optimization program is stored, and the longitudinal federated learning system optimization program is executed by a processor to achieve the above-mentioned The steps of the vertical federated learning system optimization method described above.

本发明中，接收所述第二设备发送的第二设备的加密精简中间结果，其中，所述第二设备对计算得到的第二设备的各条样本数据对应的原始中间结果进行抽样处理，得到所述第二设备的部分样本数据对应的精简中间结果，并对所述第二设备的精简中间结果进行加密，得到所述第二设备的加密精简中间结果，而后对所述第二设备的加密精简中间结果进行数据补齐，得到第二设备的加密补齐中间结果，其中，所述加密补齐中间结果的数据数量与所述原始中间结果的数据数量相同，接下来利用所述第二设备的加密补齐中间结果计算得到所述第一设备中模型参数对应的加密第一梯度值，并基于所述加密第一梯度值更新所述第一设备的模型参数，循环迭代直到检测到满足预设停止条件时，得到训练完成的第一设备的目标模型参数。通过减少参与设备对应的中间结果所包含的数据个数，减少了需要加密及通信的数据量，降低了加密和通信成本，同时极大的缩短了纵向联邦建模时间。In the present invention, the encrypted and simplified intermediate result of the second device sent by the second device is received, wherein the second device performs sampling processing on the original intermediate results corresponding to each piece of sample data of the second device obtained by calculation, to obtain The simplified intermediate result corresponding to part of the sample data of the second device is encrypted, and the simplified intermediate result of the second device is encrypted to obtain the encrypted simplified intermediate result of the second device, and then the encryption of the second device is encrypted. Streamlining the intermediate result and performing data completion to obtain an encrypted and completed intermediate result of the second device, wherein the data quantity of the encrypted and completed intermediate result is the same as the data quantity of the original intermediate result, and then using the second device The encrypted first gradient value corresponding to the model parameter in the first device is obtained by calculating the intermediate result of the encryption and complementation, and the model parameter of the first device is updated based on the encrypted first gradient value, and the loop is iterated until it is detected that the predetermined value is satisfied. When the stop condition is set, the target model parameters of the first device that has been trained are obtained. By reducing the number of data contained in the intermediate results corresponding to participating devices, the amount of data that needs to be encrypted and communicated is reduced, the cost of encryption and communication is reduced, and the time for vertical federation modeling is greatly shortened.

附图说明Description of drawings

图1是本发明实施例方案涉及的硬件运行环境的结构示意图；1 is a schematic structural diagram of a hardware operating environment involved in an embodiment of the present invention;

图2为本发明纵向联邦学习系统优化方法第一实施例的流程示意图；FIG. 2 is a schematic flowchart of the first embodiment of the vertical federated learning system optimization method of the present invention;

图3为本发明实施例涉及的一种样本数据示意图。FIG. 3 is a schematic diagram of a sample data involved in an embodiment of the present invention.

本发明目的的实现、功能特点及优点将结合实施例，参照附图做进一步说明。The realization, functional characteristics and advantages of the present invention will be further described with reference to the accompanying drawings in conjunction with the embodiments.

具体实施方式Detailed ways

应当理解，此处所描述的具体实施例仅仅用以解释本发明，并不用于限定本发明。It should be understood that the specific embodiments described herein are only used to explain the present invention, but not to limit the present invention.

如图1所示，图1是本发明实施例方案涉及的硬件运行环境的结构示意图。As shown in FIG. 1 , FIG. 1 is a schematic structural diagram of a hardware operating environment involved in an embodiment of the present invention.

需要说明的是，图1即可为纵向联邦学习系统优化设备的硬件运行环境的结构示意图。本发明实施例纵向联邦学习系统优化设备可以是PC，也可以是智能手机、智能电视机、平板电脑、便携计算机等具有显示功能的终端设备。It should be noted that FIG. 1 can be a schematic structural diagram of the hardware operating environment of the vertical federated learning system optimization device. The vertical federated learning system optimization device in the embodiment of the present invention may be a PC, or may be a terminal device with a display function, such as a smart phone, a smart TV, a tablet computer, and a portable computer.

如图1所示，该纵向联邦学习系统优化设备可以包括：处理器1001，例如CPU，网络接口1004，用户接口1003，存储器1005，通信总线1002。其中，通信总线1002用于实现这些组件之间的连接通信。用户接口1003可以包括显示屏(Display)、输入单元比如键盘(Keyboard)，可选用户接口1003还可以包括标准的有线接口、无线接口。网络接口1004可选的可以包括标准的有线接口、无线接口(如WI-FI接口)。存储器1005可以是高速RAM存储器，也可以是稳定的存储器(non-volatile memory)，例如磁盘存储器。存储器1005可选的还可以是独立于前述处理器1001的存储装置。As shown in FIG. 1 , the vertical federated learning system optimization device may include: a processor 1001 , such as a CPU, a network interface 1004 , a user interface 1003 , a memory 1005 , and a communication bus 1002 . Among them, the communication bus 1002 is used to realize the connection and communication between these components. The user interface 1003 may include a display screen (Display), an input unit such as a keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface and a wireless interface. Optionally, the network interface 1004 may include a standard wired interface and a wireless interface (eg, a WI-FI interface). The memory 1005 may be high-speed RAM memory, or may be non-volatile memory, such as disk memory. Optionally, the memory 1005 may also be a storage device independent of the aforementioned processor 1001 .

本领域技术人员可以理解，图1中示出的系统结构并不构成对终端系统的限定，可以包括比图示更多或更少的部件，或者组合某些部件，或者不同的部件布置。Those skilled in the art can understand that the system structure shown in FIG. 1 does not constitute a limitation on the terminal system, and may include more or less components than the one shown, or combine some components, or arrange different components.

如图1所示，作为一种可读存储介质的存储器1005中可以包括操作系统、网络通信模块、用户接口模块以及纵向联邦学习系统优化程序。As shown in FIG. 1 , the memory 1005 as a readable storage medium may include an operating system, a network communication module, a user interface module, and a vertical federated learning system optimization program.

在图1所示的系统中，网络接口1004主要用于连接后台服务器，与后台服务器进行数据通信；用户接口1003主要用于连接客户端(客户端)，与客户端进行数据通信；而处理器1001可以用于调用存储器1005中存储的纵向联邦学习系统优化程序。In the system shown in FIG. 1 , the network interface 1004 is mainly used to connect to the background server and perform data communication with the background server; the user interface 1003 is mainly used to connect the client (client) and perform data communication with the client; and the processor 1001 can be used to invoke the vertical federated learning system optimization program stored in memory 1005 .

在本实施例中，终端系统包括：存储器1005、处理器1001及存储在所述存储器1005上并可在所述处理器1001上运行的纵向联邦学习系统优化程序，其中，处理器1001调用存储器1005中存储的纵向联邦学习系统优化程序时，执行本申请各个实施例提供的纵向联邦学习系统优化方法的步骤。In this embodiment, the terminal system includes: a memory 1005, a processor 1001, and a vertical federated learning system optimization program stored on the memory 1005 and executable on the processor 1001, wherein the processor 1001 calls the memory 1005 When the vertical federated learning system optimization program stored in the system is executed, the steps of the vertical federated learning system optimization method provided by each embodiment of the present application are performed.

基于上述的结构，提出纵向联邦学习系统优化方法的各个实施例。Based on the above structure, various embodiments of the vertical federated learning system optimization method are proposed.

本发明实施例提供了纵向联邦学习系统优化方法的实施例，需要说明的是，虽然在流程图中示出了逻辑顺序，但是在某些情况下，可以以不同于此处的顺序执行所示出或描述的步骤。本发明实施例中涉及的第一设备和第二设备可以是参与纵向联邦学习联邦学习的参与设备，参与设备可以是智能手机、个人计算机和服务器等设备。This embodiment of the present invention provides an embodiment of an optimization method for a vertical federated learning system. It should be noted that, although a logical sequence is shown in the flowchart, in some cases, the sequence shown here may be executed in a different sequence. out or described steps. The first device and the second device involved in the embodiment of the present invention may be participating devices participating in the vertical federated learning federated learning, and the participating devices may be devices such as smart phones, personal computers, and servers.

参照图2，图2为本发明纵向联邦学习系统优化方法第一实施例的流程示意图。在本实施例中，所述纵向联邦学习系统优化方法包括：Referring to FIG. 2 , FIG. 2 is a schematic flowchart of a first embodiment of an optimization method for a vertical federated learning system according to the present invention. In this embodiment, the vertical federated learning system optimization method includes:

步骤S10，接收所述第二设备发送的第二设备的加密精简中间结果，其中，所述第二设备用于对计算得到的第二设备的各条样本数据对应的原始中间结果进行抽样处理，得到所述第二设备的部分样本数据对应的精简中间结果，并对所述第一设备的精简中间结果进行加密，得到所述第一设备的加密精简中间结果；Step S10: Receive an encrypted and simplified intermediate result of the second device sent by the second device, wherein the second device is configured to perform sampling processing on the original intermediate results corresponding to each piece of sample data of the second device obtained by calculation, obtaining a reduced intermediate result corresponding to part of the sample data of the second device, and encrypting the reduced intermediate result of the first device, to obtain an encrypted reduced intermediate result of the first device;

在本实施例中，第一设备与第二设备预先建立通信连接。第一设备和第二设备的本地数据在用户维度上有重叠部分，在数据特征上有不相同的部分(可能完全不相同)，第一设备与第二设备采用各自的本地数据进行样本对齐，确定双方的共有用户和不同的数据特征，第一设备将本地数据中共有用户的数据作为训练数据，第二设备将本地数据中共有用户的数据中与第一设备数据特征不同的数据作为训练数据，也即最终确定的第一样本数据和第二样本数据中用户是相同的，数据特征不相同。第一设备和第二设备进行样本对齐的方式可采用现有的样本对齐技术，在此不进行详细赘述。例如，图3为第一设备和第二设备中的样本数据示意图，第一设备本地数据中包括3个用户{U1，U2，U3}，数据特征包括{X1，X2，X3}，第二设备本地数据包括3个用户{U1，U2，U4}，数据特征包括{X4，X5}。样本对齐后，第一设备确定的训练数据是用户U1和U2在数据特征X1、X2和X3下的数据，第二设备确定的训练数据是用户U1和U2在数据特征X4和X5下的数据。In this embodiment, the first device establishes a communication connection with the second device in advance. The local data of the first device and the second device have overlapping parts in the user dimension, and have different parts (may be completely different) in the data characteristics. The first device and the second device use their respective local data for sample alignment, Determine the shared users and different data characteristics of the two parties, the first device uses the data of the shared users in the local data as training data, and the second device uses the data of the shared users in the local data that is different from the data characteristics of the first device as training data , that is, the users in the finally determined first sample data and the second sample data are the same, and the data characteristics are different. The manner in which the first device and the second device perform sample alignment may adopt the existing sample alignment technology, which will not be described in detail here. For example, FIG. 3 is a schematic diagram of sample data in the first device and the second device. The local data of the first device includes 3 users {U1, U2, U3}, the data features include {X1, X2, X3}, and the second device includes 3 users {U1, U2, U3}. The local data includes 3 users {U1, U2, U4}, and the data features include {X4, X5}. After the samples are aligned, the training data determined by the first device is the data of users U1 and U2 under data features X1, X2 and X3, and the training data determined by the second device is the data of users U1 and U2 under data features X4 and X5.

具体地，纵向联邦线性模型学习的一轮模型参数更新过程中，第一设备与第二设备之间以加密形式交互用于计算梯度和损失函数的中间结果，其中，加密采用同态加密算法，由第一设备和第二设备共同信任的第三方协调方产生公钥和私钥，并把公钥发送给第一设备和第二设备进行加密，第一设备和第二设备将加密的梯度值和加密的损失函数发送到协调方解密，然后根据解密后的梯度值更新第一设备的本地模型和第二设备的本地模型。Specifically, during a round of model parameter update in the learning of the longitudinal federated linear model, the first device and the second device interact in encrypted form to calculate the intermediate results of the gradient and the loss function, wherein the encryption adopts a homomorphic encryption algorithm, The public key and private key are generated by a third-party coordinator trusted by the first device and the second device, and the public key is sent to the first device and the second device for encryption. The first device and the second device will encrypt the gradient value And the encrypted loss function is sent to the coordinator for decryption, and then the local model of the first device and the local model of the second device are updated according to the decrypted gradient value.

本申请涉及的线性模型包括但不限于：逻辑回归、线性回归、泊松回归等线性的基于权重学习的模型算法。为了描述方便，本发明中以纵向逻辑回归模型训练为例，进行说明参与纵向联邦学习的第一设备与第二设备一起联合构建一个逻辑回归模型。第二设备拥有数据

其中，D_A表示第二设备的数据集，第一设备拥有数据

和标签

其中，D_B表示第一设备的数据集，

和

都是多维向量，而y_i是标量(例如，取值为0或1的标量，表示是或者否)。定义

其中，w_A和w_B是分别对应于

和

的机器学习模型参数，则The linear models involved in this application include, but are not limited to, linear weight learning-based model algorithms such as logistic regression, linear regression, and Poisson regression. For the convenience of description, the present invention takes longitudinal logistic regression model training as an example to illustrate that the first device participating in the longitudinal federated learning and the second device jointly build a logistic regression model. The second device owns the data

where D _A represents the data set of the second device and the first device owns the data

and labels

where _DB represents the dataset of the first device,

and

are multidimensional vectors, and y _i is a scalar (eg, a scalar with the value 0 or 1, indicating yes or no). definition

where w _A and w _B are respectively corresponding to

and

the machine learning model parameters, then

损失函数loss(也称为代价函数)为：The loss function loss (also known as the cost function) is:

loss在第一设备进行计算，根据损失函数定义可知，需要第二设备发送中间结果u_A和

至第一设备，以供第一设备计算loss值。联邦训练过程中，需要对中间结果进行加密，避免数据隐私泄露，故第二设备发送加密中间结果[[u_A]]和

至第一设备，其中，[[·]]表示同态加密。The loss is calculated in the first device. According to the definition of the loss function, the second device needs to send the intermediate results u _A and

to the first device for the first device to calculate the loss value. During the federated training process, the intermediate results need to be encrypted to avoid leakage of data privacy, so the second device sends the encrypted intermediate results [[u _A ]] and

to the first device, where [[ ]] represents homomorphic encryption.

进一步地，定义

Further, define

同态加密后

After homomorphic encryption

梯度函数G为：G＝dx＝∑dx_i，则[[G]]＝[[dx]]＝∑[[d]]x_i，第一设备根据接收第二设备发送的加密中间结果[[u_A]]，以及第一设备的u_B计算得到[[d]]，根据[[d]]进一步计算第一设备本地模型的加密梯度值[[G_B]]，同时第一设备发送[[d]]至第二设备，以供第二设备计算本地模型的加密梯度值[[G_A]]。The gradient function G is: G=dx=∑dx _i , then [[G]]=[[dx]]=∑[[d]]x _i , the first device receives the encrypted intermediate result [[ u _A ]], and the u _B of the first device is calculated to obtain [[d]], and the encrypted gradient value [[ _GB ]] of the local model of the first device is further calculated according to [[d]], while the first device sends [ [d]] to the second device for the second device to calculate the encrypted gradient value [[G _A ]] of the local model.

具体地，u_A和

的数量分别与

的样本数量相同，通常情况下样本数量是非常大的，第二设备需要对u_A和

进行加密再发送至第一设备进行交互，因此整个加密过程非常耗时并且通信量很大。第二设备对原始中间结果u_A和

进行抽样处理，得到精简中间结果u'_A和

原始中间结果到精简中间结果是数据维度上的数据量减少，即精简中间结果中的数据个数小于原始中间结果的数据个数，因此减少了需要加密及通信的数据量，降低了加密和通信成本。进一步，第二设备对精简中间结果进行加密，得到第二设备的加密精简中间结果，然后发送该加密精简中间结果至第一设备。原始中间结果u_A和

的处理过程相似，在本实施例中，以u_A为例进行说明。Specifically, u _A and

The number of and

_The number of samples is the same, usually the number of samples is very large, the second device needs to

The encryption is performed and then sent to the first device for interaction, so the entire encryption process is very time-consuming and has a large amount of communication. The second device compares the original intermediate result u _A and

Sampling is performed to obtain the simplified intermediate results u' _A and

From the original intermediate result to the reduced intermediate result, the amount of data in the data dimension is reduced, that is, the number of data in the reduced intermediate result is smaller than the number of data in the original intermediate result, thus reducing the amount of data that needs to be encrypted and communicated, reducing encryption and communication. cost. Further, the second device encrypts the condensed intermediate result to obtain the encrypted condensed intermediate result of the second device, and then sends the encrypted condensed intermediate result to the first device. The original intermediate result u _A and

The processing process is similar, and in this embodiment, u _A is used as an example for description.

步骤S20，对所述第二设备的加密精简中间结果进行数据补齐，得到第二设备的加密补齐中间结果，其中，所述加密补齐中间结果的数据数量与所述原始中间结果的数据数量相同；Step S20, performing data complementation on the encrypted and simplified intermediate result of the second device to obtain the encrypted and complemented intermediate result of the second device, wherein the data quantity of the encrypted and complemented intermediate result is the same as the data of the original intermediate result. the same quantity;

在本实施例中，

需要第一设备和第二设备的中间结果进行数据对齐后进行相关计算，所以第一设备接收到第二设备的加密精简中间结果后，需要进行数据补齐，得到加密补齐中间结果，保证加密补齐中间结果的数据数量与原始中间结果的数据数量相同。In this embodiment,

The intermediate results of the first device and the second device are required to perform data alignment and then perform related calculations. Therefore, after the first device receives the encrypted and simplified intermediate results of the second device, it needs to complete the data to obtain the encrypted and complete intermediate results to ensure encryption. The amount of data in the padded intermediate result is the same as the amount of data in the original intermediate result.

具体地，步骤S20包括：Specifically, step S20 includes:

步骤S21，获取所述第二设备的抽样对照表，并基于所述第二设备的抽样对照表在所述第二设备的加密精简中间结果中确定填充数据以及所述填充数据对应的填充位置；Step S21, obtaining the sampling comparison table of the second device, and determining padding data and the padding position corresponding to the padding data in the encrypted and simplified intermediate result of the second device based on the sampling comparison table of the second device;

步骤S22，在所述第二设备的加密精简中间结果中，基于所述填充位置插入所述填充数据，得到所述第二设备的加密补齐中间结果。Step S22 , inserting the padding data based on the padding position in the encrypted and condensed intermediate result of the second device, to obtain an encrypted and complete intermediate result of the second device.

在本实施例中，抽样对照表是第二设备在对原始中间结果进行抽样处理时生成的，该抽样对照表中记录了精简中间结果中各数据与原始中间结果中数据的替代关系，例如，精简中间结果中数据a，是原始中间结果中数据1，数据2，数据3的替代数据，就可以用数据a，恢复数据1、数据2和数据3。因为对精简中间结果采用的是同态加密算法，加密过程中不会影响数据的顺序，所以可以根据抽样对照表，对加密精简中间结果进行数据补齐，从而保证第二设备的加密补齐中间结果与第一设备中对应的数据是对齐。In this embodiment, the sampling comparison table is generated when the second device performs sampling processing on the original intermediate result, and the sampling comparison table records the substitution relationship between each data in the reduced intermediate result and the data in the original intermediate result, for example, Data a in the simplified intermediate result is the substitute data for data 1, data 2, and data 3 in the original intermediate result, and data a can be used to restore data 1, data 2, and data 3. Because the homomorphic encryption algorithm is used for the simplified intermediate result, the order of data will not be affected during the encryption process. Therefore, data can be filled in the encrypted and reduced intermediate result according to the sampling comparison table, so as to ensure that the encryption of the second device completes the intermediate result. The result is aligned with the corresponding data in the first device.

具体地，获取第二设备的抽样对照表，该抽样对照表由第二设备发送到第一设备，根据该抽样对照表在第二设备的加密精简中间结果中挑选出填充数据，再确定填充数据是对哪些数据的替代，例如，填充数据为数据a，其是对数据1、数据2和数据3的替代，需要说明的是数据1、数据2和数据3并不在加密精简中间结果中，只是在抽样对照表中记录了数据a与数据1、数据2和数据3之间存在替代关系，在加密精简中间结果补齐的过程中，需要用数据a补充到数据1、数据2和数据3所在位置。进一步确定填充数据对应的填充位置，在第二设备的加密精简中间结果中，在填充位置插入对应的填充数据，得到第二设备的加密补齐中间结果。Specifically, a sampling comparison table of the second device is obtained, the sampling comparison table is sent by the second device to the first device, padding data is selected from the encrypted and condensed intermediate results of the second device according to the sampling comparison table, and then the padding data is determined Which data is replaced. For example, the padding data is data a, which is a replacement for data 1, data 2, and data 3. It should be noted that data 1, data 2, and data 3 are not in the encrypted and simplified intermediate results, but only In the sampling comparison table, it is recorded that there is a substitution relationship between data a and data 1, data 2 and data 3. In the process of encrypting and simplifying the intermediate results, it is necessary to supplement data a to where data 1, data 2 and data 3 are located. Location. The padding position corresponding to the padding data is further determined, and the corresponding padding data is inserted into the padding position in the encryption and simplification intermediate result of the second device to obtain the encryption and complementation intermediate result of the second device.

步骤S30，利用所述第二设备的加密补齐中间结果计算得到所述第一设备中模型参数对应的加密第一梯度值，并基于所述加密第一梯度值更新所述第一设备的模型参数，循环迭代直到检测到满足预设停止条件时，得到训练完成的第一设备的目标模型参数。Step S30, calculating the encrypted first gradient value corresponding to the model parameter in the first device using the encryption and complementing the intermediate result of the second device, and updating the model of the first device based on the encrypted first gradient value parameters, and iterates in a loop until it is detected that the preset stop condition is satisfied, and then the target model parameters of the first device that has been trained are obtained.

在本实施例中，利用得到的第二设备的加密补齐中间结果后，与第一设备的加密补齐中间结果计算第一设备中模型参数对应的加密第一梯度值，将加密第一梯度值发送的协调方进行解密，协调方将解密后的第一梯度值发回给第一设备，第一设备利用第一梯度值更新本地模型参数。同时，利用第一设备的加密补齐中间结果和第二设备的加密补齐中间结果计算加密的损失函数，发送协调方，协调方对加密的损失函数进行解密，并检测是否满足预设停止条件，如果不满足预设停止条件，则继续下一轮迭代训练。In this embodiment, after using the obtained encryption of the second device to complete the intermediate result, the encrypted first gradient value corresponding to the model parameters in the first device is calculated with the encrypted and complete intermediate result of the first device, and the encrypted first gradient value is calculated. The coordinator that sends the value decrypts, and the coordinator sends the decrypted first gradient value back to the first device, and the first device updates the local model parameters using the first gradient value. At the same time, the encrypted loss function is calculated by using the encrypted and complete intermediate result of the first device and the encrypted and complete intermediate result of the second device, and sent to the coordinator, and the coordinator decrypts the encrypted loss function, and detects whether the preset stop condition is met. , if the preset stopping condition is not met, continue to the next round of iterative training.

具体地，步骤S30包括：Specifically, step S30 includes:

步骤S31，计算得到用于计算梯度值的第一设备的加密精简中间结果；Step S31, calculating and obtaining the encrypted and simplified intermediate result of the first device for calculating the gradient value;

具体地，步骤S31包括：Specifically, step S31 includes:

步骤a，所述第一设备对计算得到的第一设备的各条样本数据对应的原始中间结果进行抽样处理，得到所述第一设备的部分样本数据对应的精简中间结果；Step a, the first device performs sampling processing on the original intermediate results corresponding to each piece of sample data of the first device obtained by calculation, and obtains a simplified intermediate result corresponding to part of the sample data of the first device;

步骤b，对所述第一设备的精简中间结果进行加密，得到所述第一设备的加密精简中间结果。Step b, encrypting the condensed intermediate result of the first device to obtain the encrypted condensed intermediate result of the first device.

在本实施例中，需要对第一设备的原始中间结果进行加密后，才能与第二设备的加密补齐中间结果进行计算，为了减少加密时间，提升模型训练速度，故也对第一设备的原始中间结果进行抽样处理，从而减少加密的数据量，节省加密成本和模型训练时间。In this embodiment, the original intermediate result of the first device needs to be encrypted before it can be calculated with the encrypted intermediate result of the second device. In order to reduce the encryption time and improve the model training speed, the The raw intermediate results are sampled, thereby reducing the amount of encrypted data, saving encryption costs and model training time.

根据第一设备的模型参数以及第一设备拥有的数据，计算得到第一设备的各条样本数据对应的原始中间结果，

其中，w_B是第一设备的模型参数，

是第一设备拥有的数据。对第一设备的原始中间结果进行抽样处理，得到第一设备的精简中间结果。According to the model parameters of the first device and the data possessed by the first device, the original intermediate results corresponding to each piece of sample data of the first device are calculated and obtained,

where w _B is the model parameter of the first device,

is data owned by the first device. Perform sampling processing on the original intermediate result of the first device to obtain a simplified intermediate result of the first device.

为了避免纵向逻辑回归模型训练的精度损失，因此只替换绝对值较小的u_i值，对较大值u_i仍保留原值。故抽样的具体处理过程是，根据阈值对第一设备的原始中间结果进行拆分，得到两个原始中间结果的子集，其中，第一个子集中各个数据小于或等于阈值，第二个子集中各个数据大于阈值，阈值根据实际情况确定，只对第一个子集中的数据进行抽样。将第一个子集中的数据进行分组，并确定各组各自的代表数据，由各组的代表数据组成第三个子集，第三个子集和第二个子集的数据即为第一设备的精简中间结果，进一步对第一设备的精简中间结果进行加密，加密算法采用同态加密，得到第一设备的加密精简中间结果。In order to avoid the accuracy loss of longitudinal logistic regression model training, only the value of _ui with a smaller absolute value is replaced, and the original value of _ui with a larger value is still retained. Therefore, the specific processing process of sampling is to split the original intermediate results of the first device according to the threshold to obtain two subsets of the original intermediate results, wherein each data in the first subset is less than or equal to the threshold, and in the second subset Each data is greater than the threshold, the threshold is determined according to the actual situation, and only the data in the first subset is sampled. The data in the first subset is grouped, and the representative data of each group is determined. The representative data of each group forms a third subset, and the data of the third subset and the second subset is the simplification of the first device. For the intermediate result, the simplified intermediate result of the first device is further encrypted, and the encryption algorithm adopts homomorphic encryption to obtain the encrypted and simplified intermediate result of the first device.

步骤S32，利用所述第二设备的加密补齐中间结果以及所述第一设备的加密精简中间结果，计算得到第一设备的加密中间结果；Step S32, using the encryption of the second device to complete the intermediate result and the encryption of the first device to reduce the intermediate result, to calculate and obtain the encrypted intermediate result of the first device;

具体地，步骤S32包括：Specifically, step S32 includes:

步骤c，对所述第一设备的加密精简中间结果进行数据补齐，得到所述第一设备的加密补齐中间结果；Step c, performing data complementation on the encrypted and simplified intermediate result of the first device to obtain the encrypted and complemented intermediate result of the first device;

步骤d，利用所述第一设备的加密补齐中间结果与所述第二设备的加密补齐中间结果，计算得到第一设备的加密中间结果。In step d, the encrypted intermediate result of the first device is obtained by calculating by using the encryption and complementation intermediate result of the first device and the encryption and complementation intermediate result of the second device.

在本实施例中，对第一设备的加密精简中间结果进行数据补齐，得到第一设备的加密补齐中间结果，具体过程为：获取第一设备的抽样对照表，并根据第一设备的抽样对照表在第一设备的加密精简中间结果中确定填充数据以及填充数据对应的填充位置，在第一设备的加密精简中间结果中，根据填充位置插入填充数据，得到第一设备的加密补齐中间结果，第一设备的加密补齐中间结果的数据数量与第一设备的原始中间结果的数据数量相同。In this embodiment, data complementing is performed on the encrypted and simplified intermediate result of the first device to obtain the encrypted and complemented intermediate result of the first device. The specific process is: obtaining the sampling comparison table of the first device, The sampling comparison table determines the padding data and the padding position corresponding to the padding data in the encrypted and simplified intermediate result of the first device, inserts the padding data according to the padding position in the encrypted and simplified intermediate result of the first device, and obtains the encrypted complement of the first device As for the intermediate result, the data quantity of the encrypted-complemented intermediate result of the first device is the same as the data quantity of the original intermediate result of the first device.

进一步地，第一设备的加密补齐中间结果与第二设备的加密补齐中间结果数据是对齐的，利用第一设备的加密补齐中间结果与第二设备的加密补齐中间结果，计算得到第一设备的加密中间结果[[d]]。Further, the encryption of the first equipment and the intermediate result of the encryption of the second equipment are aligned, and the encryption of the first equipment and the encryption of the second equipment are used to complete the intermediate result, and the calculation is obtained. The encrypted intermediate result [[d]] of the first device.

步骤S33，利用所述第一设备的加密中间结果计算得到所述第一设备中模型参数对应的加密第一梯度值。Step S33, calculating the encrypted first gradient value corresponding to the model parameter in the first device by using the encrypted intermediate result of the first device.

在本实施例中，第一设备的加密中间结果[[d]]，第一设备中模型参数对应的加密第一梯度值[[G_B]]，

根据第一设备的加密中间结果以及第一设备拥有的数据，计算得到加密第一梯度值。In this embodiment, the encrypted intermediate result [[d]] of the first device, the encrypted first gradient value [[G _B ]] corresponding to the model parameters in the first device,

According to the encrypted intermediate result of the first device and the data possessed by the first device, the encrypted first gradient value is obtained by calculation.

本实施例提出的纵向联邦学习系统优化方法，接收所述第二设备发送的第二设备的加密精简中间结果，而后对所述第二设备的加密精简中间结果进行数据补齐，得到第二设备的加密补齐中间结果，最后利用所述第二设备的加密补齐中间结果计算得到所述第一设备中模型参数对应的加密第一梯度值，并基于所述加密第一梯度值更新所述第一设备的模型参数，循环迭代直到检测到满足预设停止条件时，得到训练完成的第一设备的目标模型参数。在纵向联邦训练中，通过减少参与设备的中间结果所包含的数据个数，从而减少了需要加密及通信的数据量，降低了加密和通信成本，同时极大的缩短了纵向联邦建模时间。The vertical federated learning system optimization method proposed in this embodiment receives the encrypted and simplified intermediate results of the second device sent by the second device, and then performs data complementation on the encrypted and simplified intermediate results of the second device to obtain the second device. Finally, the encrypted first gradient value corresponding to the model parameter in the first device is calculated by using the encrypted and completed intermediate result of the second device, and the encrypted first gradient value is updated based on the encrypted first gradient value. The model parameters of the first device are cyclically iterated until it is detected that the preset stop condition is met, and the target model parameters of the first device that have been trained are obtained. In vertical federation training, by reducing the number of data contained in the intermediate results of participating devices, the amount of data that needs to be encrypted and communicated is reduced, the cost of encryption and communication is reduced, and the modeling time of vertical federation is greatly shortened.

进一步的，根据第一实施例，本发明纵向联邦学习系统优化方法第二实施例提供一种纵向联邦学习系统优化方法，所述纵向联邦学习系统优化方法应用于第二设备，所述第二设备可以是智能手机、个人计算机等设备，所述纵向联邦学习系统优化方法包括：Further, according to the first embodiment, the second embodiment of the vertical federated learning system optimization method of the present invention provides a vertical federated learning system optimization method, the vertical federated learning system optimization method is applied to a second device, and the second device It can be a smart phone, a personal computer and other devices, and the vertical federated learning system optimization method includes:

步骤A10，对计算得到的第二设备的各条样本数据对应的原始中间结果进行抽样处理，得到所述第二设备的部分样本数据对应的精简中间结果；Step A10, performing sampling processing on the calculated original intermediate results corresponding to each piece of sample data of the second device, to obtain a simplified intermediate result corresponding to part of the sample data of the second device;

在本实施例中，在纵向联邦学习的一轮模型参数更新过程中，第一设备与第二设备之间以加密形式交互用于计算梯度和损失函数的中间结果，其中，加密采用同态加密算法，由第一设备和第二设备共同信任的第三方协调方产生公钥和私钥，并把公钥发送给第一设备和第二设备进行加密，第一设备和第二设备将加密的梯度值和加密的损失函数发送到协调方解密，然后根据解密后的梯度值更新第一设备和第二设备本地模型。In this embodiment, during a round of model parameter update process of vertical federated learning, the first device and the second device interact in encrypted form to calculate the intermediate results of gradient and loss function, wherein the encryption adopts homomorphic encryption Algorithm, a third-party coordinator jointly trusted by the first device and the second device generates a public key and a private key, and sends the public key to the first device and the second device for encryption, and the first device and the second device will encrypt the encrypted data. The gradient value and the encrypted loss function are sent to the coordinator for decryption, and then the local models of the first device and the second device are updated according to the decrypted gradient value.

本申请涉及的线性模型包括但不限于：逻辑回归、线性回归、泊松回归等线性的基于权重学习的模型算法。为了描述方便，本发明中以纵向逻辑回归模型训练为例，进行说明参与纵向联邦学习的第一设备与第二设备一起联合构建一个逻辑回归模型。。第二设备拥有数据

其中，D_A表示第二设备的数据集，第一设备拥有数据

和标签

其中，D_B表示第一设备的数据集，

和

其中，w_A和w_B是分别对应于

和

的机器学习模型参数，则The linear models involved in this application include, but are not limited to, linear weight learning-based model algorithms such as logistic regression, linear regression, and Poisson regression. For the convenience of description, the present invention takes longitudinal logistic regression model training as an example to illustrate that the first device participating in the longitudinal federated learning and the second device jointly build a logistic regression model. . The second device owns the data

and labels

where _DB represents the dataset of the first device,

and

where w _A and w _B are respectively corresponding to

and

the machine learning model parameters, then

loss在第一设备进行计算，根据损失函数定义可知需要第二设备发送中间结果u_A和

至第一设备，其中，[[·]]表示同态加密。The loss is calculated in the first device. According to the definition of the loss function, it can be seen that the second device needs to send the intermediate results u _A and

to the first device, where [[ ]] represents homomorphic encryption.

定义

definition

同态加密后的

梯度函数G为：G＝dx＝∑dx_i，则[[G]]＝[[dx]]＝∑[[d]]x_i，第一设备根据接收第二设备发送的加密中间结果[[u_A]]，以及第一设备的u_B计算得到[[d]]，根据[[d]]进一步计算第一设备本地模型的加密梯度值[[G_B]]，同时第一设备发送[[d]]至第二设备，以供第二设备计算本地模型的加密梯度值[[G_A]]。Homomorphically encrypted

The gradient function G is: G=dx=∑dx _i , then [[G]]=[[dx]]=∑[[d]]x _i , the first device receives the encrypted intermediate result [[ u _A ]], and the u _B of the first device is calculated to obtain [[d]], and the encrypted gradient value [[ _GB ]] of the local model of the first device is further calculated according to [[d]], while the first device sends [ [d]] to the second device for the second device to calculate the encrypted gradient value [[G _A ]] of the local model.

具体地，u_A和

的数量分别与

进行抽样处理，得到精简中间结果u'_A和

The number of and

Sampling is performed to obtain the simplified intermediate results u' _A and

具体地，步骤A10包括：Specifically, step A10 includes:

步骤A12，基于阈值对所述第二设备的原始中间结果进行拆分，得到第一子原始中间结果和第二子原始中间结果，其中，所述第一子原始中间结果中的各个数据小于或等于阈值，所述第二子原始中间结果中的各个数据大于阈值；Step A12, splitting the original intermediate result of the second device based on the threshold to obtain a first sub-original intermediate result and a second sub-original intermediate result, wherein each data in the first sub-original intermediate result is less than or is equal to the threshold, and each data in the second sub-original intermediate result is greater than the threshold;

步骤A13，将所述第一子原始中间结果的所有数据进行分组，并确定各组各自的代表数据，由各组的代表数据组成第三子原始中间结果；Step A13, grouping all the data of the first sub-original intermediate results, and determining the respective representative data of each group, and forming the third sub-original intermediate results by the representative data of each group;

步骤A14，基于所述第三子原始中间结果和所述第二子原始中间结果，得到第二设备的精简中间结果。Step A14, obtaining a simplified intermediate result of the second device based on the third sub-original intermediate result and the second sub-original intermediate result.

在本实施例中，为了避免纵向逻辑回归模型训练的精度损失，因此只替换绝对值较小的u_i值，对较大值u_i仍保留原值。故抽样的具体处理过程是，根据阈值对第二设备的原始中间结果进行拆分，得到两个原始中间结果的子集：第一子原始中间结果和第二子原始中间结果，其中，第一子原始中间结果中的各个数据小于或等于阈值，第二子原始中间结果中的各个数据大于阈值，阈值根据实际情况确定，只对第一子原始中间结果中的数据进行抽样。将第一子原始中间结果中的数据进行分组，并确定各组各自的代表数据，由各组的代表数据组成第三子原始中间结果，第三子原始中间结果和第二子原始中间结果的数据即为第二设备的精简中间结果。其中，对第一子原始中间结果中的数据进行分组以及确定代表数据的具体方法可根据实际情况确定，例如将第一子原始中间结果中的数据按照从大到小的顺序排列，然后平均分成N组，每组计算平均数，将平均数作为每组的代表数据；还可以采用手动设置N个初始聚类中心，再利用k-means得到最终聚类中心点，将最终聚类中心点作为每组的代表数据。In this embodiment, in order to avoid the loss of accuracy in the training of the longitudinal logistic regression model, only the value of _ui with a smaller absolute value is replaced, and the original value of _ui with a larger value is still retained. Therefore, the specific processing process of sampling is to split the original intermediate results of the second device according to the threshold to obtain two subsets of original intermediate results: the first sub-original intermediate results and the second sub-original intermediate results, where the first sub-original intermediate results Each data in the sub-original intermediate result is less than or equal to the threshold, and each data in the second sub-original intermediate result is greater than the threshold. The threshold is determined according to the actual situation, and only the data in the first sub-original intermediate result is sampled. The data in the first sub-original intermediate results are grouped, and the respective representative data of each group is determined, and the representative data of each group is composed of the third sub-original intermediate results, the third sub-original intermediate results and the second sub-original intermediate results. The data is the condensed intermediate result of the second device. The specific method for grouping the data in the first sub-original intermediate result and determining the representative data can be determined according to the actual situation, for example, arranging the data in the first sub-original intermediate result in descending order, and then equally dividing the N groups, the average is calculated for each group, and the average is used as the representative data of each group; it is also possible to manually set N initial cluster centers, and then use k-means to obtain the final cluster center point, and use the final cluster center point as Representative data for each group.

步骤A20，对第二设备的精简中间结果进行加密，得到所述第二设备的部分样本数据对应的精简中间结果并发送至所述第一设备，以供所述第一设备基于所述第二设备的加密精简中间结果反馈所述第一设备的加密中间结果，其中，所述第一设备对接收的所述第二设备的加密精简中间结果进行数据补齐获得所述第二设备的加密补齐中间结果，并利用所述第二设备的加密补齐中间结果计算得到所述第一设备的加密中间结果；Step A20: Encrypt the simplified intermediate result of the second device to obtain a simplified intermediate result corresponding to part of the sample data of the second device and send it to the first device for the first device to use the second device based on the second device. The encrypted and reduced intermediate result of the device feeds back the encrypted intermediate result of the first device, wherein the first device performs data completion on the received encrypted and reduced intermediate result of the second device to obtain the encrypted complement of the second device. Completing the intermediate result, and using the encryption of the second device to complete the intermediate result calculation to obtain the encrypted intermediate result of the first device;

在本实施例中，进一步对第二设备的精简中间结果进行加密，加密算法采用同态加密，得到第二设备的加密精简中间结果，并发送第二设备的加密精简中间结果至第一设备。第一设备接收到第二设备的加密精简中间结果，需要对第二设备的加密精简中间结果进行数据补齐，得到第二设备的加密补齐中间结果，然后利用第一设备的加密补齐中间结果与第二设备的加密补齐中间结果，计算得到第一设备的加密中间结果[[d]]，并由第一设备发送第一设备的加密中间结果[[d]]至第二设备，其中，第一设备的加密补齐中间结果与第二设备的加密补齐中间结果数据是对齐的。In this embodiment, the condensed intermediate result of the second device is further encrypted, and the encryption algorithm adopts homomorphic encryption to obtain the encrypted condensed intermediate result of the second device, and the encrypted condensed intermediate result of the second device is sent to the first device. The first device receives the encrypted and simplified intermediate result of the second device, and needs to complete the data of the encrypted and simplified intermediate result of the second device, obtains the encrypted and completed intermediate result of the second device, and then uses the encryption of the first device to complete the intermediate result. The result is supplemented with the encrypted intermediate result of the second device, and the encrypted intermediate result [[d]] of the first device is obtained by calculation, and the encrypted intermediate result [[d]] of the first device is sent by the first device to the second device. Wherein, the encrypted and completed intermediate result data of the first device is aligned with the encrypted and completed intermediate result data of the second device.

步骤A30，利用所述第一设备的加密中间结果计算得到所述第二设备中模型参数对应的加密第二梯度值，并基于所述加密第二梯度值更新所述第二设备的模型参数，循环迭代直到检测到满足预设停止条件时，得到训练完成的第二设备的目标模型参数。Step A30, calculating the encrypted second gradient value corresponding to the model parameter in the second device by using the encrypted intermediate result of the first device, and updating the model parameter of the second device based on the encrypted second gradient value, The loop is iterated until it is detected that the preset stop condition is satisfied, and the target model parameters of the second device that has been trained are obtained.

在本实施例中，利用得到的第一设备的加密中间结果[[d]]，第二设备中模型参数对应的加密第一梯度值[[G_A]]，

根据第一设备的加密中间结果以及第二设备的拥有的数据，结算得到加密第二梯度值。将加密第二梯度值发送的协调方进行解密，协调方将解密后的第二梯度值发回给第二设备，第二设备利用第二梯度值更新本地模型参数。同时，协调方检测是否满足预设停止条件，如果不满足预设停止条件，则继续下一轮迭代训练。In this embodiment, using the obtained encrypted intermediate result [[d]] of the first device, the encrypted first gradient value [[G _A ]] corresponding to the model parameters in the second device,

According to the encrypted intermediate result of the first device and the data possessed by the second device, the encrypted second gradient value is obtained by settlement. The coordinator who sent the encrypted second gradient value decrypts, the coordinator sends the decrypted second gradient value back to the second device, and the second device uses the second gradient value to update the local model parameters. At the same time, the coordinator detects whether the preset stop condition is met, and if the preset stop condition is not met, the next round of iterative training is continued.

本实施例提出的纵向联邦学习系统优化方法，对计算得到的第二设备的各条样本数据对应的原始中间结果进行抽样处理，得到所述第二设备的部分样本数据对应的精简中间结果，对第二设备的精简中间结果进行加密，得到所述第二设备的部分样本数据对应的精简中间结果并发送至所述第一设备，以供所述第一设备基于所述第二设备的加密精简中间结果反馈所述第一设备的加密中间结果，利用所述第一设备的加密中间结果计算得到所述第二设备中模型参数对应的加密第二梯度值，并基于所述加密第二梯度值更新所述第二设备的模型参数，循环迭代直到检测到满足预设停止条件时，得到训练完成的第二设备的目标模型参数。在纵向联邦训练中，通过减少参与设备的中间结果所包含的数据个数，从而减少了需要加密及通信的数据量，降低了加密和通信成本，同时极大的缩短了纵向联邦建模时间。The vertical federated learning system optimization method proposed in this embodiment performs sampling processing on the calculated original intermediate results corresponding to each piece of sample data of the second device, and obtains simplified intermediate results corresponding to part of the sample data of the second device. The condensed intermediate result of the second device is encrypted, and the condensed intermediate result corresponding to part of the sample data of the second device is obtained and sent to the first device for the first device to simplify based on the encryption of the second device. The intermediate result feeds back the encrypted intermediate result of the first device, and calculates the encrypted second gradient value corresponding to the model parameter in the second device by using the encrypted intermediate result of the first device, and based on the encrypted second gradient value The model parameters of the second device are updated, and the loop iteration is performed until it is detected that the preset stop condition is satisfied, and the target model parameters of the second device that have been trained are obtained. In vertical federation training, by reducing the number of data contained in the intermediate results of participating devices, the amount of data that needs to be encrypted and communicated is reduced, the cost of encryption and communication is reduced, and the modeling time of vertical federation is greatly shortened.

此外，本发明实施例还提出一种可读存储介质，所述存储介质上存储有纵向联邦学习系统优化程序，所述纵向联邦学习系统优化程序被处理器执行时实现如下所述的纵向联邦学习系统优化方法的步骤。In addition, an embodiment of the present invention also provides a readable storage medium, where a vertical federated learning system optimization program is stored on the storage medium, and when the vertical federated learning system optimization program is executed by a processor, the vertical federated learning as described below is implemented The steps of the system optimization method.

本发明纵向联邦学习系统优化设备和可读存储介质的各实施例，均可参照本发明纵向联邦学习系统优化方法各个实施例，此处不再赘述。For the embodiments of the vertical federated learning system optimization device and the readable storage medium of the present invention, reference may be made to the various embodiments of the vertical federated learning system optimization method of the present invention, which will not be repeated here.

需要说明的是，在本文中，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、物品或者装置不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、物品或者装置所固有的要素。在没有更多限制的情况下，由语句“包括一个……”限定的要素，并不排除在包括该要素的过程、方法、物品或者装置中还存在另外的相同要素。It should be noted that, herein, the terms "comprising", "comprising" or any other variation thereof are intended to encompass non-exclusive inclusion, such that a process, method, article or device comprising a series of elements includes not only those elements, It also includes other elements not expressly listed or inherent to such a process, method, article or apparatus. Without further limitation, an element qualified by the phrase "comprising a..." does not preclude the presence of additional identical elements in a process, method, article or apparatus that includes the element.

上述本发明实施例序号仅仅为了描述，不代表实施例的优劣。The above-mentioned serial numbers of the embodiments of the present invention are only for description, and do not represent the advantages or disadvantages of the embodiments.

通过以上的实施方式的描述，本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现，当然也可以通过硬件，但很多情况下前者是更佳的实施方式。基于这样的理解，本发明的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中，包括若干指令用以使得一台终端设备(可以是手机，计算机，服务器，空调器，或者网络设备等)执行本发明各个实施例所述的方法。From the description of the above embodiments, those skilled in the art can clearly understand that the method of the above embodiment can be implemented by means of software plus a necessary general hardware platform, and of course can also be implemented by hardware, but in many cases the former is better implementation. Based on this understanding, the technical solutions of the present invention can be embodied in the form of software products in essence or the parts that make contributions to the prior art, and the computer software products are stored in a storage medium (such as ROM/RAM, magnetic disk, CD-ROM), including several instructions to make a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) execute the methods described in the various embodiments of the present invention.

以上仅为本发明的优选实施例，并非因此限制本发明的专利范围，凡是利用本发明说明书及附图内容所作的等效结构或等效流程变换，或直接或间接运用在其他相关的技术领域，均同理包括在本发明的专利保护范围内。The above are only preferred embodiments of the present invention, and are not intended to limit the scope of the present invention. Any equivalent structure or equivalent process transformation made by using the contents of the description and drawings of the present invention, or directly or indirectly applied in other related technical fields , are similarly included in the scope of patent protection of the present invention.

Claims

1. The longitudinal federal learning system optimization method is applied to first equipment participating in longitudinal federal learning, and the first equipment is in communication connection with second equipment, and the longitudinal federal learning system optimization method comprises the following steps:

receiving an encrypted and simplified intermediate result of the second device sent by the second device, wherein the second device is configured to sample an original intermediate result corresponding to each sample data of the second device obtained through calculation to obtain a simplified intermediate result corresponding to a part of sample data of the second device, and encrypt the simplified intermediate result of the second device to obtain an encrypted and simplified intermediate result of the second device;

performing data supplementation on the encrypted and simplified intermediate result of the second device to obtain an encrypted and supplemented intermediate result of the second device, wherein the data quantity of the encrypted and supplemented intermediate result is the same as that of the original intermediate result;

and calculating to obtain an encrypted first gradient value corresponding to the model parameter in the first equipment by using the encrypted filling intermediate result of the second equipment, updating the model parameter of the first equipment based on the encrypted first gradient value, and performing iteration circularly until a preset stopping condition is met to obtain a target model parameter of the trained first equipment.

2. The method for optimizing a longitudinal federated learning system as set forth in claim 1, wherein the step of data-supplementing the cryptographically reduced intermediate results of the second device to obtain cryptographically supplemented intermediate results comprises:

acquiring a sampling comparison table of the second device, and determining filling data and a filling position corresponding to the filling data in an encrypted and simplified intermediate result of the second device based on the sampling comparison table of the second device;

and inserting the filling data into the encrypted and simplified intermediate result of the second device based on the filling position to obtain an encrypted and supplemented intermediate result of the second device.

3. The method for optimizing a longitudinal federal learning system as claimed in any one of claims 1 to 2, wherein the step of calculating the encrypted first gradient value corresponding to the model parameter in the first device using the encrypted padding intermediate result of the second device comprises:

calculating to obtain an encrypted and simplified intermediate result of first equipment for calculating the gradient value;

calculating to obtain an encrypted intermediate result of the first device by utilizing the encrypted filling intermediate result of the second device and the encrypted simplified intermediate result of the first device;

and calculating to obtain an encrypted first gradient value corresponding to the model parameter in the first equipment by using the encrypted intermediate result of the first equipment.

4. The method for longitudinal federal learning system optimization as claimed in claim 3, wherein said step of calculating a cryptographically condensed intermediate result of the first device used to calculate the gradient values comprises:

the first equipment performs sampling processing on the original intermediate result corresponding to each sample data of the first equipment obtained through calculation to obtain a simplified intermediate result corresponding to partial sample data of the first equipment;

and encrypting the simplified intermediate result of the first device to obtain an encrypted simplified intermediate result of the first device.

5. The method for optimizing a longitudinal federated learning system as defined in claim 3, wherein the step of calculating the intermediate result of encryption for the first device using the intermediate result of encryption padding for the second device and the intermediate result of encryption compaction for the first device comprises:

performing data complementation on the encrypted and simplified intermediate result of the first device to obtain an encrypted and complemented intermediate result of the first device;

and calculating to obtain an encrypted intermediate result of the first device by using the encrypted filling intermediate result of the first device and the encrypted filling intermediate result of the second device.

6. A longitudinal federated learning system optimization method is applied to second equipment participating in longitudinal federated learning and comprises the following steps:

sampling the original intermediate result corresponding to each sample data of the second equipment obtained by calculation to obtain a simplified intermediate result corresponding to part of sample data of the second equipment;

encrypting a simplified intermediate result of second equipment to obtain a simplified intermediate result corresponding to part of sample data of the second equipment, and sending the simplified intermediate result to the first equipment, so that the first equipment feeds back the encrypted intermediate result of the first equipment based on the encrypted simplified intermediate result of the second equipment, wherein the first equipment is used for performing data complementation on the received encrypted simplified intermediate result of the second equipment to obtain an encrypted complemented intermediate result of the second equipment, and calculating by using the encrypted complemented intermediate result of the second equipment to obtain the encrypted intermediate result of the first equipment;

and calculating to obtain an encrypted second gradient value corresponding to the model parameter in the second equipment by using the encrypted intermediate result of the first equipment, updating the model parameter of the second equipment based on the encrypted second gradient value, and performing loop iteration until a preset stop condition is met to obtain a target model parameter of the trained second equipment.

7. The method for optimizing a longitudinal federated learning system according to claim 6, wherein the step of sampling the original intermediate results corresponding to the various sample data of the second device obtained by calculation to obtain the reduced intermediate results corresponding to a part of sample data of the second device includes:

splitting the original intermediate result of the second device based on a threshold value to obtain a first sub-original intermediate result and a second sub-original intermediate result, wherein each piece of data in the first sub-original intermediate result is smaller than or equal to the threshold value, and each piece of data in the second sub-original intermediate result is larger than the threshold value;

grouping all data of the first sub-original intermediate result, determining respective representative data of each group, and forming a third sub-original intermediate result by the representative data of each group;

and obtaining a reduced intermediate result of the second device based on the third sub-original intermediate result and the second sub-original intermediate result.

8. A longitudinal federated learning system optimization apparatus, comprising: a memory, a processor, and a longitudinal federated learning system optimization program stored on the memory and executable on the processor, the longitudinal federated learning system optimization program when executed by the processor implementing the steps of the longitudinal federated learning system optimization method of any one of claims 1 to 5.

9. A longitudinal federated learning system optimization apparatus, comprising: a memory, a processor, and a longitudinal federated learning system optimization program stored on the memory and executable on the processor, the longitudinal federated learning system optimization program when executed by the processor implementing the steps of the longitudinal federated learning system optimization method of any one of claims 6 to 7.

10. A readable storage medium having stored thereon a longitudinal federated learning system optimization program that, when executed by a processor, performs the steps of the longitudinal federated learning system optimization method of any of claims 1-7.