CN112101946B

CN112101946B - Method and device for jointly training business model

Info

Publication number: CN112101946B
Application number: CN202011310524.2A
Authority: CN
Inventors: 熊涛
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2020-11-20
Filing date: 2020-11-20
Publication date: 2021-02-19
Anticipated expiration: 2040-11-20
Also published as: CN112101946A

Abstract

The embodiment of the specification provides a method for jointly training a business model, and in the aspect of data transmission, only a small-dimension representation vector is transmitted by each data party and gradient data corresponding to the representation vector is transmitted back by a service party through the structures of a data party local coding model and a service party whole office model, so that the quantity of parameters transmitted in a large-scale model training process is greatly reduced. In addition, in the process of determining the characterization vector of the local data by the data side, on one hand, a self-defined coding network can be used to fully adapt to the requirement of data heterogeneity, and on the other hand, a noise layer is added in the coding model to keep the robustness of the model prediction result at the preset privacy cost, so that the data privacy can be effectively protected. In a word, the method can improve the effectiveness of the large-scale joint training business model.

Description

Method and device for jointly training business model

Technical Field

One or more embodiments of the present disclosure relate to the field of computer technology, and in particular, to a method and apparatus for jointly training a business model.

Background

With the development of computer technology, more and more means for acquiring data are provided. For this reason, data privacy protection in various business processes is becoming more and more important. Especially in the field of multiparty joint calculation, the privacy protection of service data of each party is more important. In the federal learning process, a plurality of data parties jointly train related business models, and if necessary, a service party or a trusted third party is required to perform auxiliary calculation. Under the conditions of huge data volume, huge model scale, different data structures (isomerism) of each data side and the like, how to consider accuracy, privacy protection and processibility in federal learning is a problem worthy of research.

Disclosure of Invention

One or more embodiments of the present specification describe a method and apparatus for jointly training a business model to solve one or more of the problems identified in the background.

According to a first aspect, a method for jointly training a business model is provided, wherein the business model is used for processing related business data to determine a corresponding business processing result, the method is jointly executed by a plurality of data parties and a service party, the business model comprises at least one local coding model corresponding to a single data party and a global model arranged at the service party, the local coding model is used for coding local training samples which are held by the corresponding single data party and are used as privacy data to obtain corresponding characterization vectors, and the global model is used for processing the characterization vectors which are determined by the corresponding local coding model by the single data party and obtaining the business processing result; the method comprises the following steps: each data side carries out coding processing on local training samples through corresponding local coding models respectively, and a representation vector of each preset dimension is obtained for each local training sample respectively, wherein at least one layer in a single coding network is a noise layer, and the noise layer is a noise vector generated by superposing current expression vectors of current local training samples according to preset distribution; the server side processes each characterization vector by using the global model to obtain each business processing result corresponding to each training sample, reversely deduces the gradient of each model parameter of the global model based on the comparison between each business processing result and the corresponding sample label, and adjusts the model parameter held by the server side according to the obtained gradient data; and each data side determines the gradient of each model parameter in the local coding model according to the gradient data obtained by the reverse derivation of the service side so as to adjust the local model parameter according to the obtained gradient data.

According to a second aspect, a method for jointly training a business model is provided, wherein the business model is used for processing related business data to determine a corresponding business processing result, the method is jointly performed by a plurality of data parties and a service party, the plurality of data parties include a first party, the business model includes at least one local coding model corresponding to a single data party, and a global model provided for the service party, the single local coding model is used for coding local training samples which are held by the corresponding single data party and are used as privacy data to obtain a corresponding characterization vector, and the global model is used for processing the characterization vector determined by the corresponding local coding model by the single data party and obtaining the business processing result; the local coding model corresponding to the first party is a first coding model, in which method the first party performs operations comprising: processing a local first sample by using the first coding network to obtain a first characterization vector, and sending the first characterization vector to a server, so that the server processes the first characterization vector by using a global model, and obtains a first service processing result corresponding to the first sample, thereby determining a first gradient of the global model based on a comparison result of the first service processing result and a first label corresponding to the first sample, wherein at least one layer in the first coding network is a noise layer, and the noise layer is used for superimposing noise vectors generated according to a predetermined distribution on a current expression vector of the first sample; and determining the gradient of each model parameter in the first coding network based on the first gradient so as to adjust the local model parameter according to the obtained gradient data.

According to an embodiment, said processing a local first sample with said first coding network to obtain a first token vector comprises: performing predetermined s times of coding processing on the first sample by using the first coding network to obtain corresponding s characterization vectors, wherein the single coding processing corresponds to a single noise vector with predetermined distribution; determining the first token vector based on an average vector of the s token vectors.

According to one embodiment, the predetermined distribution is one of a gaussian distribution or a laplace distribution.

According to one embodiment, the noise layer is an independent neural network layer and the current expression vector is an output vector of a previous layer of the first coding network.

According to one embodiment, in the noise layer, the elements in a single dimension of the output vector are the processing results of elements in each dimension in the output vector of the previous layer, and the superposition results of corresponding elements in the noise vector are superposed.

According to a third aspect, a method for jointly training a business model is provided, wherein the business model is used for processing related business data to determine a corresponding business processing result, the method is jointly executed by a plurality of data parties and a service party, the business model comprises at least one local coding model corresponding to a single data party and a global model arranged at the service party, the single local coding model is used for coding local training samples which are held by the corresponding single data party and are used as privacy data to obtain a corresponding characterization vector, and the global model is used for processing the characterization vector determined by the corresponding local coding model by the single data party and obtaining the business processing result; in the method, the operation performed by the server side comprises: obtaining each characterization vector obtained by each data party processing a local training sample according to a local coding model to obtain each service processing result, wherein at least one layer in a single local coding model is a noise layer, and the noise layer is used for superposing noise vectors generated according to preset distribution on current expression vectors of corresponding training samples; reversely deducing the gradient of each model parameter of the global model based on the comparison result of each business processing result and the sample label corresponding to each training sample so as to adjust the model parameter held by the server according to the obtained gradient data; and respectively sending the gradient data corresponding to each training sample to a corresponding data party so that the corresponding data party can deduce the gradient of each model parameter in the local coding model according to the corresponding gradient data, thereby adjusting the local model parameter according to the obtained gradient data.

According to a fourth aspect, there is provided a method of business processing, wherein relevant business data is processed by a business model jointly trained in advance by a plurality of data parties and a service party to determine a corresponding business processing result, the business model includes at least one local coding model corresponding to a single data party and a global model corresponding to the service party, the method includes: obtaining a characterization vector obtained by processing to-be-processed service data by using at least one local coding model, wherein at least one layer in a single local coding model is a noise layer, and the noise layer is used for superposing noise vectors generated according to preset distribution on a current expression vector of the to-be-processed service data; inputting the characterization vector of the service data to be processed into a pre-trained global model to obtain a corresponding output result, wherein the global model is a global model trained together with at least one local coding model by utilizing any one of the first aspect, the second aspect and the third aspect; and determining a service processing result of the service data to be processed according to the output result of the global model.

In an embodiment, the characterization vector for processing the service data is determined based on the processing of the service data to be processed by the local coding model corresponding to the corresponding data party, or determined based on an average vector of each characterization vector obtained by processing the service data to be processed by the local coding network of each data party.

According to a fifth aspect, a system for jointly training a business model is provided, wherein the business model is configured to process related business data to determine a corresponding business processing result, the system includes a service party and a plurality of data parties, the business model includes at least one local coding model corresponding to a single data party, and a global model provided to the service party, the local coding model is configured to code a local training sample, which is held by the corresponding single data party and is used as private data, to obtain a corresponding token vector, and the global model is configured to process the token vector determined by the corresponding local coding model by the single data party and obtain the business processing result; the system is configured to:

each data side carries out coding processing on local training samples through corresponding local coding models respectively, and a representation vector of each preset dimension is obtained for each local training sample respectively, wherein at least one layer in a single coding network is a noise layer, and the noise layer is a noise vector generated by superposing current expression vectors of current local training samples according to preset distribution;

the server side processes each characterization vector by using the global model to obtain each business processing result corresponding to each training sample, reversely deduces the gradient of each model parameter of the global model based on the comparison between each business processing result and the corresponding sample label, and adjusts the model parameter held by the server side according to the obtained gradient data;

and each data side determines the gradient of each model parameter in the local coding model according to the gradient data obtained by the reverse derivation of the service side so as to adjust the local model parameter according to the obtained gradient data.

According to a sixth aspect, an apparatus for jointly training a business model is provided, where the business model is configured to process related business data to determine a corresponding business processing result, a method for jointly training the business model is performed by a plurality of data parties and a service party, where the plurality of data parties includes a first party, the business model includes at least one local coding model corresponding to a single data party, and a global model provided to the service party, the single local coding model is configured to encode a local training sample, which is held by the corresponding single data party and is used as private data, to obtain a corresponding token vector, and the global model is configured to process the token vector determined by the corresponding local coding model by the single data party and obtain a business processing result; the local coding model corresponding to the first party is a first coding model, and the apparatus is provided at the first party and configured to:

the encoding unit is configured to process a local first sample by using the first encoding network to obtain a first characterization vector, and send the first characterization vector to a server, so that the server processes the first characterization vector by using a global model, and obtains a first service processing result corresponding to the first sample, thereby determining a first gradient of the global model based on a comparison result between the first service processing result and a first label corresponding to the first sample, wherein at least one layer in the first encoding network is a noise layer, and the noise layer is used for superimposing noise vectors generated according to a predetermined distribution on a current expression vector of the first sample;

a gradient determining unit configured to determine gradients of the respective model parameters in the first coding network based on the first gradient, so as to adjust local model parameters according to the obtained gradient data.

In one embodiment, the encoding unit is further configured to:

performing predetermined s times of coding processing on the first sample by using the first coding network to obtain corresponding s characterization vectors, wherein the single coding processing corresponds to a single noise vector with predetermined distribution;

determining the first token vector based on an average vector of the s token vectors.

According to a seventh aspect, an apparatus for jointly training a business model is provided, where the business model is configured to process related business data to determine a corresponding business processing result, a method for jointly training the business model is performed by multiple data parties and a service party, the business model includes at least one local coding model corresponding to a single data party, and a global model provided to the service party, the single local coding model is configured to encode a local training sample, which is held by the corresponding single data party and is used as private data, to obtain a corresponding token vector, and the global model is configured to process the token vector determined by the corresponding local coding model by the single data party and obtain the business processing result; the device is arranged on a server side and is configured to:

the system comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is configured to acquire each characterization vector obtained by processing a local training sample by each data party according to a local coding model to obtain each service processing result, at least one layer in a single local coding model is a noise layer, and the noise layer is used for superposing noise vectors generated according to preset distribution on current expression vectors of corresponding training samples;

the gradient determining unit is configured to reversely deduce the gradient of each model parameter of the global model based on the comparison result of each business processing result and the sample label corresponding to each training sample, so as to adjust the model parameter held by the server according to the obtained gradient data;

and the gradient feedback unit is configured to send the gradient data corresponding to each training sample to the corresponding data party respectively, so that the corresponding data party deduces the gradient of each model parameter in the local coding model according to the corresponding gradient data, and the local model parameter is adjusted according to the obtained gradient data.

According to an eighth aspect, there is provided a business processing apparatus for processing related business data by using a business model jointly trained in advance by a plurality of data parties and a service party to determine a corresponding business processing result, wherein the business model includes at least one local coding model corresponding to a single data party and a global model corresponding to the service party, the apparatus is configured to:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is configured to acquire a characterization vector obtained by processing service data to be processed by using at least one local coding model, at least one layer in a single local coding model is a noise layer, and the noise layer is used for superposing noise vectors generated according to preset distribution on a current expression vector of the service data to be processed;

the prediction unit is configured to input the characterization vector of the service data to be processed into a pre-trained global model to obtain a corresponding output result, wherein the global model is a global model trained together with at least one local coding model;

and the result determining unit is configured to determine a service processing result of the service data to be processed according to the output result of the global model.

According to a ninth aspect, there is provided a computer readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method of the first to fourth aspects.

According to a tenth aspect, there is provided a computing device comprising a memory and a processor, wherein the memory has stored therein executable code, and the processor, when executing the executable code, implements the methods of the first to fourth aspects.

According to the method, the device and the system provided by the embodiment of the specification, based on the architectures of the local coding model of the data side and the global bureau model of the service side, in the aspect of data transmission, only the characterization vectors with smaller dimensions are transmitted by each data side, and the gradient data corresponding to the characterization vectors are transmitted back by the service side, so that the quantity of parameters transmitted in the large-scale model training process is greatly reduced. In addition, in the process of determining the characterization vector of the local data by the data side, on one hand, a self-defined coding network can be used to fully adapt to the requirement of data heterogeneity, and on the other hand, a noise layer is added in the coding model to keep the robustness of the model prediction result at the preset privacy cost, so that the data privacy can be effectively protected. In a word, the method can improve the effectiveness of the large-scale joint training business model.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic diagram of a specific implementation architecture under the technical concept of the present specification;

FIG. 2 illustrates a flow diagram of a method of jointly training a business model, according to one embodiment;

FIG. 3 is a schematic diagram of a specific model architecture under the technical concept of the present specification;

FIG. 4 illustrates a partial coding model diagram according to one embodiment;

FIG. 5 illustrates a flow diagram of a method for a joint training business model for a data party, according to one embodiment;

FIG. 6 illustrates a flow diagram of a method for a joint training business model for a server according to one embodiment;

FIG. 7 illustrates a method flow diagram of business processing according to one embodiment;

FIG. 8 illustrates a system block diagram of a joint training business model according to one embodiment;

fig. 9 shows a schematic block diagram of a traffic processing device according to one embodiment.

Detailed Description

The scheme provided by the specification is described below with reference to the accompanying drawings.

First, a description will be given with reference to an embodiment shown in fig. 1. Fig. 1 shows a specific implementation scenario in which a plurality of data parties jointly train a business model. The business model may be a machine learning model for performing business processing such as classification, scoring, etc. on given business data, and may include, for example, a neural network, a decision tree, a support vector machine, etc.

In this implementation scenario, the data parties 1, 2, 3, etc. may jointly train the business model. The data side 1, the data side 2 and the data side 3 can respectively hold service data. The service data may be various types of data such as text, image, voice, video, etc., and the specific content thereof is related to a specific service scene. For example, in an analysis scenario of the electronic payment platform or the banking institution for the business object user, the business data of the electronic payment platform may be related data of the electronic payment, the transfer, the debit record and the like of the user maintained by the electronic payment platform, and the business data of the banking institution may be related data of the credit record, the income, the remittance and the like of the user maintained by the banking institution.

Under the implementation architecture, all data parties can jointly train the business model. In order to jointly train the business model, a service party (or a trusted third party) can participate to synthesize data of each party, and complex operations are processed in an auxiliary mode on the premise of protecting privacy of each data party. In the conventional technology, the business model is usually trained jointly by combining the local models of the data parties and the global model (or global parameters) maintained by the service party. In this manner, each data party typically passes model parameters of the local model to the service party to determine a global model or global parameters to direct each data party to update the respective local model.

However, under large-scale data and model requirements, each data party may hold massive data, and the data structures are not necessarily consistent, thereby possibly resulting in a large scale of the used business model. Thus, the parameter information transferred may have a large data size. In the conventional technology, the mode of transferring information such as model parameters to update the global model may not meet the requirements of data volume and efficiency.

To this end, the technical idea of the present specification provides a joint training approach for local token learning. Specifically, each data party performs the dimension reduction coding process on its training sample through the local model, so that the local training sample is represented by using a characterization vector with less dimension. And each data party sends the characterization vectors determined by the local model to the service party, the service party further processes the characterization vectors sent by each service party, the gradient of the model parameters is reversely determined, and each data party further determines the gradient of the model parameters of the local model part respectively, so that the corresponding model parameters are adjusted, and the parallel training of the service model is realized.

In the process of jointly training the service model, in order to ensure the data security of each data party, a differential privacy mode can be adopted, and each data party adds disturbance to the data of the data party at a certain privacy cost. Meanwhile, the use of the characterization vectors and the use of small-scale characterization vectors with small corresponding element quantities to characterize large-scale data of each data party can effectively reduce the data scale, adapt to different data structures and realize the expandability of large-scale training. The mode of jointly training the business model not only further reduces the data scale, but also can protect the data privacy, thereby improving the practicability of the large-scale business model training of a plurality of business parties.

The technical idea of the present specification is described in detail below.

FIG. 2 illustrates a flow diagram of a joint training business model according to one embodiment of the present description. The business model may be a combination of one or more of various machine learning models, such as a neural network, a support vector machine, and a decision tree. In this specification, a business model may be divided into two parts: one part is a coding part, usually provided on the data side, called a local coding model, for example implemented by a coding network such as bert, graph neural network; the other part may be a part for subsequent processing of the coded token vector, which is usually provided at the server side and called a global model, and is implemented by a machine learning model such as a convolutional neural network, a support vector machine, a decision tree, and the like. As shown in fig. 3, a schematic diagram of a model architecture of an embodiment of the present specification is provided.

It is understood that during machine learning, each data party may hold data of similar structure or data of different structure. For example, each data side is a different bank financial institution, and holds data such as the age, the capital flow, the deposit amount, the loan record, and the repayment record of the user. As another example, the various data parties may include a social finance platform, a banking institution, and the like, the data held by the social finance platform includes data such as user age, friend relationship, transfer records, resident geographic location, credit consumption amount, credit consumption repayment records, and the like, and the banking institution may hold data such as user age, fund flow, deposit amount, loan records, repayment records, and the like. The data held by each data party can be represented in a unified format, for example, by vectors of unified dimensions, and the data of each dimension is determined according to an agreed formatting rule (a feature value extraction rule), or can be represented in a personalized format, for example, by individually customizing the vectors of each dimension formatting rule. Therefore, each data side can correspond to at least one local coding model for coding local service data. Referring to fig. 3, each data party may encode local service data (e.g., picture service data) using different encoding models. The coding models of the various business parties may employ different principles and architectures, such as the unsupervised model, the supervised model, the unsupervised model, the adaptive model, and so on, shown in fig. 3.

Each data party can provide a characterization vector capable of representing local service data for the service party, and the service party can use the characterization vector provided by each data party as an input vector to obtain an output result of the service model. In the model training stage, the server side can also combine the comparison between the corresponding sample labels and the output results to deduce the gradient of each model parameter of the global model.

As shown in fig. 2, in an embodiment of the present specification, a process of jointly training a business model by multiple data parties may include: step 201, each data side respectively carries out coding processing on local training samples through corresponding local coding models, and respectively obtains characterization vectors of each preset dimension aiming at each local training sample, wherein at least one layer in a single coding network is a noise layer, and the noise layer is a noise vector generated by superposing current expression vectors of current local training samples according to preset distribution; step 202, the server uses the global model to process each characterization vector to obtain each service processing result corresponding to each training sample, and step 203 reversely deduces the gradient of each model parameter of the global model based on the comparison between each service processing result and the corresponding sample label so as to adjust the model parameter held by the server according to the obtained gradient data; and 204, determining the gradient of each model parameter in the local coding model by each data party according to the gradient data obtained by the reverse derivation of the service party so as to adjust the local model parameter according to the obtained gradient data.

First, in step 201, each data party performs coding processing on a local training sample through a corresponding local coding model, and obtains a characterization vector of each predetermined dimension for each local training sample. At least one layer in the single coding network is a noise layer, and the noise layer is a noise vector generated according to preset distribution and superposed on a current expression vector of a current local training sample.

It will be appreciated that the characterization vector may be a vector used to describe the traffic data of the data party. To reduce data throughput, especially in large-scale data processing, the characterization vectors may typically be less dimensional vectors. Thus, the characterization vector may be obtained by an encoding process on the training samples. The training samples may include various types of sample data, which may be described in the form of pictures, texts, videos, animations, or in the form of feature vectors obtained by extracting feature values.

The data side involves two aspects in the process of encoding the local training samples. On the one hand, the data with more dimension or larger data amount can be subjected to dimensionality reduction characterization, that is, the training sample is characterized by the vector with smaller dimension. In another aspect, training samples are perturbed to preserve data privacy.

The characterization process of the training samples may be implemented by an encoding network, for example. The coding network can be an existing coding network, such as bert, or other networks designed according to needs, such as a graph neural network and the like. In the case that the coding network is an existing coding network, it may be a neural network with fixed parameters, or may be non-fixed parameters (network architecture determination, parameters need to be trained). If the coding network is a neural network with non-fixed parameters or a coding network designed according to needs, the model parameters can be adjusted and determined in the process of jointly training the business model.

According to one embodiment, a customized coding network can be adopted in the data characterization process of a single data party. For example, a first party (any one of the data parties) employs a fully connected neural network, a second party employs an adaptive neural network, a third party employs an unsupervised neural network, a fourth party employs an unsupervised neural network, and so on.

In order to further maintain local data privacy, a single data party can also perform differential privacy processing in the data characterization process. Differential privacy (differential privacy) is a means in cryptography that aims to provide a way to maximize the accuracy of data queries while minimizing the chances of identifying their records when querying from a statistical database. A random algorithm M is provided, and PM is a set formed by all possible outputs of M. For any two adjacent data sets D and D' and any subset SM of PMs, if the random algorithm M satisfies: pr [ M (D) epsilon SM]<=e^ε×Pr[M(D＇)∈SM]Algorithm M is then said to provide epsilon-differential privacy protection, where the parameter epsilon is referred to as the privacy protection budget, which balances the degree and accuracy of privacy protection. ε may be generally predetermined. The closer ε is to 0, e^εThe closer to 1, the closer the processing results of the random algorithm to the two neighboring data sets D and D', the stronger the degree of privacy protection.

Differential Privacy by a single data party may also be referred to as Local Differential Privacy (LDP). In the encoding process, a mode of adding disturbance to the intermediate vector can be adopted for performing differential privacy processing. Taking a multilayer neural network as an example, a perturbation vector can be superimposed on a normal output vector of a network with a very static layer number. For machine learning models, for algorithms that satisfy differential privacy, it is desirable that the expectations of model prediction also satisfy the characteristics of differential privacy. In fact, for algorithm A satisfying differential privacy, we are

If this is true, it can be mathematically proven that the following relationship holds:

。

that is to say, for the prediction model, the algorithm satisfying the epsilon-delta difference privacy, the expected value of the model prediction also satisfies the difference privacy, and the influence of the input disturbance on the output result is controllable, so that the model has reliable robustness.

According to this theory, noise satisfying differential privacy can be added on at least one layer of the local coding model. As shown in fig. 4, the layer of the coding model to which noise is added may be referred to as a noise layer. By means of the noise layer, noise vectors generated according to a predetermined distribution can be superimposed on the current expression vector of the current training sample. Where the noise vector may be generated by a mechanism that satisfies differential privacy.

Generally, the differential privacy may have a laplacian mechanism, a gaussian mechanism, an exponential mechanism, and the like, and in a noise layer of the coding network, a noise vector satisfying the laplacian distribution or the gaussian distribution may be superimposed on a current output vector of the coding network to obtain a disturbed vector, and subsequent processing is continued to obtain a characterization vector. Wherein the dimension of the noise vector is the same as that of the current output vector. For example, as shown in fig. 4, a noise vector may be superimposed on the output vector of the first layer.

The noise vector may be a vector having the same dimension (hereinafter, n) as the current output vector. Taking the gaussian mechanism as an example, as a specific example, the probability density function of gaussian distribution noise is expressed as:

wherein the content of the first and second substances,

in order to be a predetermined variance, the variance is,

is a predetermined average value. When x is randomly valued for N times, generating a random value satisfying N (

，

) N noise elements of the gaussian distribution. The individual noise elements may constitute a noise vector. It is understood that the mean in the gaussian distribution may control the value to which the fluctuation of the noise element is referenced, and the variance may control the magnitude of the fluctuation of the noise element around the mean. If the mean value is 0, it means that each noise element fluctuates around 0, the variance is 1, and it means that the sum of squares of each noise element does not exceed 1, thereby ensuring that the noise is small enough without affecting the accuracy. Optionally, variance under Gaussian distribution

May be determined by a preset privacy preserving budget (also referred to as a privacy factor) epsilon, e.g. with a variance of epsilon squared, etc.

According to another specific example, under the laplacian mechanism, the distribution of element values in the noise vector may also conform to the laplacian distribution. The expression of the laplacian distribution noise is:

wherein the content of the first and second substances,

being the mean of the laplacian distribution, the mean may take 0 when it is noise to ensure that the noise element fluctuates around 0 (small enough to have little effect on the result). Substituting privacy factor epsilon and sensitivity 1 of Laplace noise into the data processing system

The distribution of Laplace at 1/ε is taken. Where the privacy factor epsilon, also referred to as the privacy protection budget, is used to balance the degree of privacy protection and accuracy. ε may typically be predetermined (e.g., 0.1). Using a preselected random algorithm of

Generating n random values at

Taking the n random values, the Laplace function

N values of (a) can form an n-dimensional noise vector.

It will be appreciated that the noise layer may be a separate layer in the model, such as a layer inserted in a multi-layer neural network, dedicated to superimposing noise vectors (e.g. a layer after the first layer in fig. 4). In this case, the vector addition operation with the noise vector may be performed on the basis of the expression vector obtained in a certain operation stage of the local coding model after the operation stage. In some embodiments, the noise layer may also be fused at a certain stage of the local coding model, such as the noise layer is fused at the ith layer itself, i.e. the ith layer is a noise layer. In this case, the calculation logic of this stage (noise layer) includes an operation of superimposing noise vectors. For example, the original computation logic of a certain layer of neural network is: y = XW, as a noise floor, its computation logic may be modified to: y = XW + Noise, where X is an input vector of the Noise layer (which may also be an output vector of a neural network of a previous layer or a result of a previous stage calculation), W is a parameter matrix, and Noise is a Noise vector.

After each data party passes through the local coding model, each corresponding characterization vector is obtained for each training sample, and each characterization vector is sent to the service party (or the trusted third party). In the model training stage, because the adjustment of the model parameters is a progressive process, each data party can determine the characterization vectors for a predetermined number (such as 1, 10, etc.) of training samples at a single time and send the characterization vectors to the service party.

Then, in step 202, the server uses the global model to process each characterization vector, to obtain each service processing result corresponding to each training sample, and reversely deduces the gradient of each model parameter of the global model based on the comparison between each service processing result and the corresponding sample label through step 203, so as to adjust the model parameter held by the server according to the obtained gradient data.

It can be understood that under the implementation architecture of the present specification, the service party may use the token vector provided by each data party as an input vector of the global model (or a feature vector equivalent to the global model), and obtain a corresponding output result. And outputting the output result of the global model, namely the business processing result of the business data of the corresponding training sample. The server side can determine the gradient of each model parameter in the global model based on the comparison between the output result of the business model and the sample label, so as to adjust the model parameters held by the server side according to the obtained gradient data.

It is understood that the business model is divided into at least two phases, a local coding model phase for each data side, and a global model phase for the service side. According to the calculation principle of the gradient (usually, the partial derivative of the loss function on the model parameter, the loss function is determined based on the comparison result of the output result of the business model and the sample label), when the gradient of the model parameter is determined, the gradient of the model parameter can be reversely deduced, that is, the model loss is determined according to the comparison of the output result of the business model and the sample label, and each model parameter gradient related to the global model can be determined by using the partial derivative of the model loss on the model parameter related to the global model of the server. The server may also adjust model parameters of the global model according to the gradient. The adjustment method is, for example, a gradient descent method, a newton method, or the like, and is not limited herein.

Further, the gradient of the model parameter of the local model of each data side may be determined according to the gradient of the model parameter of the service side. The principle is that for a business model divided into two phases, the first phase is assumed to be X₂=Y₁=f₁（W₁，X₁) The second stage is Y₂=f₂（W₂，X₂) Then the model loss is represented by Y₂Determining, the gradient of the second stage being based on Y₂To W₂While the gradient of the model parameter of the first stage is based on Y₂To f₁Partial derivatives of (a) with f₁To W₁Is determined by multiplying the partial derivatives of (a). That is, the gradient of the model parameters of the local coding model of each data side is based on the Y determined by the server side₂To f₁Partial derivative of (Y)₂To X₂Gradient of) is determined.

Therefore, after determining the gradient of the model parameter of the global model, each data party may determine the gradient of each model parameter in the local coding model according to the gradient data obtained by the reverse derivation of the service party, so as to adjust the local model parameter according to the obtained gradient data, via step 204. The method for adjusting the local model parameters may be, for example, a gradient descent method, a newton method, or the like, which is not limited herein. The gradient data obtained by the reverse derivation of the service party is the gradient of the feature vector according to the service processing result.

It can be understood that the coding network structures of the data parties can be different, and the adjustment results of the model parameters do not need to be interacted. In the above process, only data interaction between the data side and the service side is involved, and the interactive data content includes a characterization vector provided by the data side to the service side and gradient data about the characterization vector issued by the service side to the data side. Because the dimension of the characterization vector is controllable, for example, the dimension can be 100, the interactive data amount can be greatly reduced in the model training process under a large-scale model or large-scale data, and the model efficiency is improved.

It should be noted that, as can be seen from the model architecture shown in fig. 3, the local coding model of any data side and the global model of the service side may constitute a complete business model. The global model can also be used as an independent business model to process the characterization vector obtained by processing the business data through the trained local coding model by any data party.

The technical concept provided by the specification can be applied to business scenes with consistent targets of various data parties, such as classification scenes for elephants and goats and the like shown in fig. 3. Each data party can hold data with the same or different structures, and belongs to the situation of horizontal segmentation (single party independently has a single training sample) in the combined training process.

If any party in the multiple business parties is called a first party, and the local coding model corresponding to the first party is assumed to be the first coding model, as shown in fig. 5, in the process of jointly training the business models, the operation performed by the first party may include the following steps:

step 501, processing a local first sample by using a first coding network to obtain a first characterization vector, and sending the first characterization vector to a server. Wherein at least one layer of the first coding network is a noise layer for superimposing a noise vector generated according to a predetermined distribution on the current expression vector of the first samples.

The noise vector may be determined using a distribution (e.g., gaussian distribution, laplacian distribution) satisfying the differential privacy, which is not described herein. The noise vector may be obtained from a vector set generated in advance, or may be generated in real time during execution of the local coding model, which is not limited herein. Since the noise vector has a certain randomness in each execution, in an alternative implementation, a predetermined s (s is a positive integer greater than 1) times of encoding processing may be performed on a single training sample, such as a first sample, to obtain corresponding s token vectors, where each encoding processing corresponds to one noise vector in a predetermined distribution, and then, the first token vector is determined based on an average vector of the s token vectors. The first token vector is for example positively correlated with the average vector of the s token vectors. A first positive table vector in the first token vector identifies a first sample. The first token vector may be a token vector to be sent to the server, determined for the first sample by the first party.

Optionally, the first party may send the first token vector to the server together with a first tag corresponding to the first sample.

The server may process the first characterization vector using the global model, and obtain a first service processing result corresponding to the first sample, thereby determining a first gradient of the global model based on a comparison result of the first service processing result and a first label corresponding to the first sample.

Step 502, determining gradients of each model parameter in the first coding network based on the first gradient of the global model determined by the server, so as to adjust the local model parameter according to the obtained gradient data.

On the other hand, for the service side, in the method for jointly training the business model, the operations performed may be as shown in fig. 6, including:

step 601, obtaining each characterization vector obtained by each data party processing the local training sample according to the local coding model, and obtaining each service processing result. Wherein at least one layer of the single local coding model is a noise layer for superimposing noise vectors generated according to a predetermined distribution on the current expression vector of the respective training sample.

Step 602, based on the comparison result between each service processing result and the sample label corresponding to each training sample, reversely deducing the gradient of each model parameter of the global model, so as to adjust the model parameter held by the service party according to the obtained gradient data.

Step 603, the gradient data corresponding to each training sample is respectively sent to the corresponding data side, so that the corresponding data side can deduce the gradient of each model parameter in the local coding model according to the corresponding gradient data, and thus the local model parameter is adjusted according to the obtained gradient data.

It should be noted that fig. 5 and fig. 6 describe operations in the process of jointly training the service model from the perspective of a data party and a service party, respectively, and the flows shown in fig. 5 and fig. 6 are part of the overall flow of fig. 2, so that the description related to fig. 2 is also applicable to the corresponding parts of fig. 5 and fig. 6, and is not repeated here.

In the data transmission aspect, only the characterization vectors with smaller dimensions are transmitted by each data party, and the gradient data corresponding to the characterization vectors are transmitted back by the service party, so that the quantity of parameters transmitted in the large-scale model training process is greatly reduced. In addition, in the process of determining the characterization vector of the local data by the data side, on one hand, a self-defined coding network can be used to fully adapt to the requirement of data heterogeneity, and on the other hand, a noise layer is added in the coding model to keep the robustness of the model prediction result at the preset privacy cost, so that the data privacy can be effectively protected. In a word, the method can improve the effectiveness of the large-scale joint training business model.

Further, the service model trained through the process of fig. 2 may be detected by the test model, or used to process new service data, so as to obtain a service processing result. To this end, embodiments of the present specification further provide a method for processing a service, where in the method for jointly training a service model shown in fig. 2, 5, and 6, the trained at least one local coding model and global model are used for processing the service.

It can be understood that after the training of each local coding model and global model is completed, the local coding models and global models may still be distributed to each data party and service party, and the data party performs business processing with the aid of the service party, or may be all laid out in a device or device cluster for use by a single data party (or service party).

As shown in fig. 7, the service processing flow provided by the present specification includes the following steps:

step 701, obtaining a characterization vector obtained by processing to-be-processed service data by using at least one local coding model. At least one layer in the single local coding model is a noise layer, and the noise layer is used for superposing noise vectors generated according to preset distribution on a current expression vector of the service data to be processed.

In an embodiment, a data party corresponding to the service data to be processed may be known, and at this time, the service data to be processed may be processed by using a local coding model corresponding to the data party to obtain a corresponding characterization vector.

In another embodiment, the data party corresponding to the service data to be processed may be unknown, and at this time, the service data to be processed may be processed by using each local coding model corresponding to each data party, so as to obtain each characterization vector corresponding to each local coding model. And then, averaging or weighted averaging the plurality of characterization vectors to obtain a final characterization vector of the service data to be processed.

Optionally, when a single local coding model is used to process the service data to be processed, multiple processing may be performed to obtain multiple characterization vectors, and an average vector of the multiple characterization vectors is used as a characterization vector obtained by processing the single local coding model for the service data to be processed. This is because the noise vector used in the single processing procedure has a certain randomness, and the result of multiple averaging can eliminate the error possibly caused by the single randomness.

And 702, inputting the characterization vector of the service data to be processed into a pre-trained global model to obtain a corresponding output result. The global model may be a global model trained with at least one local coding model using any of the embodiments described in fig. 2, 4, 5 and described for them.

And 703, determining a service processing result of the service data to be processed according to the output result of the global model.

According to an embodiment of another aspect, a system for jointly training a business model is also provided. The business model is used for processing the relevant business data to determine a corresponding business processing result. As shown in fig. 8, system 800 includes a service party 82 and a plurality of data parties (fig. 8 shows only one data party 81). The business model comprises at least one local coding model corresponding to a single data party (such as a data party 81) and a global model arranged on the service party 82, wherein the local coding model is used for coding local training samples which are held by the corresponding single data party and are used as privacy data to obtain corresponding characterization vectors, and the global model is used for processing the characterization vectors determined by the corresponding local coding model by the single data party and obtaining a business processing result. The system 800 is configured to:

each data party (such as the data party 81) respectively carries out coding processing on local training samples through corresponding local coding models, and obtains characterization vectors of each preset dimension aiming at each local training sample, wherein at least one layer in a single coding network is a noise layer, and the noise layer is a noise vector generated by superposing current expression vectors of current local training samples according to preset distribution;

the server 82 processes each characterization vector by using the global model to obtain each service processing result corresponding to each training sample, and reversely deduces the gradient of each model parameter of the global model based on the comparison between each service processing result and the corresponding sample label so as to adjust the model parameter held by the server according to the obtained gradient data;

each data side (including data side 81) determines the gradient of each model parameter in the local coding model according to the gradient data reversely derived by the service side 82, so as to adjust the local model parameter according to the obtained gradient data.

Further, an apparatus for jointly training a business model is provided. Assuming that the apparatus is provided on a first one of the data parties (e.g., data party 81), the apparatus may include:

the encoding unit 811 is configured to process a local first sample by using a first encoding network to obtain a first characterization vector, and send the first characterization vector to a server, so that the server processes the first characterization vector by using a global model, and obtains a first service processing result corresponding to the first sample, thereby determining a first gradient of the global model based on a comparison result between the first service processing result and a first label corresponding to the first sample, at least one layer in the first encoding network being a noise layer, the noise layer being configured to superimpose noise vectors generated according to predetermined distribution on a current expression vector of the first sample;

a gradient determining unit 812 configured to determine gradients of the respective model parameters in the first coding network based on the first gradient to adjust the local model parameters according to the obtained gradient data.

In one embodiment, the encoding unit 811 is further configured to:

performing predetermined s times of coding processing on the first sample by using a first coding network to obtain corresponding s characterization vectors, wherein the single coding processing corresponds to a single noise vector with predetermined distribution;

a first token vector is determined based on an average vector of the s token vectors.

In another aspect, the apparatus for jointly training the business model provided by the server 82 may include:

an obtaining unit 821, configured to obtain each characterization vector obtained by each data party processing a local training sample according to a local coding model, and obtain each service processing result, where at least one layer in a single local coding model is a noise layer, and the noise layer is used to superimpose noise vectors generated according to predetermined distribution on a current expression vector of a corresponding training sample;

a gradient determining unit 822 configured to reversely derive gradients of model parameters of the global model based on comparison results of the respective service processing results and sample labels corresponding to the respective training samples, so as to adjust the model parameters held by the service party according to the obtained gradient data;

the gradient feedback unit 823 is configured to send the gradient data corresponding to each training sample to the corresponding data party, so that the corresponding data party derives the gradient of each model parameter in the local coding model according to the corresponding gradient data, and adjusts the local model parameter according to the obtained gradient data.

According to an embodiment in an aspect, there is further provided a business processing apparatus, which processes relevant business data by using a business model jointly trained by a plurality of data parties and a service party in advance to determine a corresponding business processing result. Here, the business model includes at least one local coding model corresponding to a single data side, and a global model corresponding to a service side. As shown in fig. 9, the apparatus 900 may include:

an obtaining unit 91, configured to obtain a characterization vector obtained by processing to-be-processed service data using at least one local coding model, where at least one layer in a single local coding model is a noise layer, and the noise layer is used to superimpose a noise vector generated according to a predetermined distribution on a current expression vector of to-be-processed service data;

the prediction unit 92 is configured to input the characterization vector of the service data to be processed into a pre-trained global model to obtain a corresponding output result, wherein the global model is a global model trained together with at least one local coding model;

and the result determining unit 93 is configured to determine a service processing result of the service data to be processed according to the output result of the global model.

According to one possible design, the characterization vectors for processing the service data are determined based on the processing of the service data to be processed by the local coding model corresponding to the corresponding data party, or based on the average vectors of the characterization vectors obtained by processing the service data to be processed by the local coding network of each data party.

It should be noted that the system 800, the data party 81, the service party 82, and the apparatus 900 shown in fig. 8 are product embodiments corresponding to the method embodiments shown in fig. 2, fig. 5, fig. 6, and fig. 7, respectively, and corresponding descriptions in the method embodiments are also applicable to the product embodiments and are not repeated herein.

According to an embodiment of another aspect, there is also provided a computer-readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method described in connection with fig. 2, 5, 6 or 7.

According to an embodiment of yet another aspect, there is also provided a computing device comprising a memory having stored therein executable code, and a processor that, when executing the executable code, implements the method described in connection with fig. 2, 5, 6 or 7.

Those skilled in the art will recognize that, in one or more of the examples described above, the functions described in the embodiments of this specification may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.

The above-mentioned embodiments are intended to explain the technical idea, technical solutions and advantages of the present specification in further detail, and it should be understood that the above-mentioned embodiments are merely specific embodiments of the technical idea of the present specification, and are not intended to limit the scope of the technical idea of the present specification, and any modification, equivalent replacement, improvement, etc. made on the basis of the technical solutions of the embodiments of the present specification should be included in the scope of the technical idea of the present specification.

Claims

1. A method for jointly training a business model, wherein the business model is used for processing related business data to determine a corresponding business processing result, the method is jointly executed by a plurality of data parties and a service party, the business model comprises at least one local coding model corresponding to a single data party and a global model arranged on the service party, the local coding model is used for coding a local training sample which is held by the corresponding single data party and is used as privacy data to obtain a corresponding characterization vector, and the global model is used for processing the characterization vector determined by the corresponding local coding model by the single data party and obtaining the business processing result; the method comprises the following steps:

2. A method for jointly training a business model, wherein the business model is used for processing related business data to determine a corresponding business processing result, the method is jointly executed by a plurality of data parties and a service party, the data parties include a first party, the business model comprises at least one local coding model corresponding to a single data party and a global model arranged on the service party, the single local coding model is used for coding local training samples which are held by the corresponding single data party and are used as privacy data to obtain corresponding characterization vectors, and the global model is used for processing the characterization vectors which are determined by the single data party through the corresponding local coding model and obtaining the business processing result; the local coding model corresponding to the first party is a first coding model, in which method the first party performs operations comprising:

processing a local first sample by using the first coding model to obtain a first characterization vector, and sending the first characterization vector to a server, so that the server processes the first characterization vector by using a global model and obtains a first service processing result corresponding to the first sample, thereby determining a first gradient of the global model based on a comparison result of the first service processing result and a first label corresponding to the first sample, wherein at least one layer in the first coding model is a noise layer, and the noise layer is used for superimposing noise vectors generated according to a predetermined distribution on a current expression vector of the first sample;

determining gradients of the model parameters in the first coding model based on the first gradients, so as to adjust local model parameters according to the obtained gradient data.

3. The method of claim 2, wherein said processing a local first sample using the first coding model to obtain a first characterization vector comprises:

performing predetermined s times of coding processing on the first sample by using the first coding model to obtain corresponding s characterization vectors, wherein the single coding processing corresponds to a single noise vector with predetermined distribution;

4. The method of claim 2, wherein the predetermined distribution is one of a gaussian distribution or a laplace distribution.

5. The method of claim 2, wherein the noise layer is an independent neural network layer and the current expression vector is an output vector of a layer preceding the first coding model.

6. The method according to claim 2, wherein in the noise layer, the elements in a single dimension of the output vector are the processing results of elements in each dimension in the output vector of the previous layer, and the superposition results of corresponding elements in the noise vector are superposed.

7. A method for jointly training a business model, wherein the business model is used for processing related business data to determine a corresponding business processing result, the method is jointly executed by a plurality of data parties and a service party, the business model comprises at least one local coding model corresponding to a single data party and a global model arranged on the service party, the single local coding model is used for coding a local training sample which is held by the corresponding single data party and is used as privacy data to obtain a corresponding characterization vector, and the global model is used for processing the characterization vector determined by the corresponding local coding model by the single data party and obtaining the business processing result; in the method, the operation performed by the server side comprises:

obtaining each characterization vector obtained by each data party processing a local training sample according to a local coding model to obtain each service processing result, wherein at least one layer in a single local coding model is a noise layer, and the noise layer is used for superposing noise vectors generated according to preset distribution on current expression vectors of corresponding training samples;

reversely deducing the gradient of each model parameter of the global model based on the comparison result of each business processing result and the sample label corresponding to each training sample so as to adjust the model parameter held by the server according to the obtained gradient data;

and respectively sending the gradient data corresponding to each training sample to a corresponding data party so that the corresponding data party can deduce the gradient of each model parameter in the local coding model according to the corresponding gradient data, thereby adjusting the local model parameter according to the obtained gradient data.

8. A method of business processing, wherein relevant business data is processed by a business model jointly trained by a plurality of data parties and a service party in advance to determine a corresponding business processing result, the business model comprises at least one local coding model corresponding to a single data party and a global model corresponding to the service party, the method comprises:

obtaining a characterization vector obtained by processing to-be-processed service data by using at least one local coding model, wherein at least one layer in a single local coding model is a noise layer, and the noise layer is used for superposing noise vectors generated according to preset distribution on a current expression vector of the to-be-processed service data;

inputting the characterization vector of the service data to be processed into a pre-trained global model to obtain a corresponding output result, wherein the global model is a global model trained together with at least one local coding model by using any one of the modes of claims 1 to 7;

and determining a service processing result of the service data to be processed according to the output result of the global model.

9. The method according to claim 8, wherein the characterization vector for processing the service data is determined based on processing of the service data to be processed by a local coding model corresponding to a corresponding data party, or is determined based on an average vector of each characterization vector obtained by processing of the service data to be processed by a local coding network of each data party.

10. A system for jointly training a business model is disclosed, wherein the business model is used for processing related business data to determine a corresponding business processing result, the system comprises a service party and a plurality of data parties, the business model comprises at least one local coding model corresponding to a single data party and a global model arranged on the service party, the local coding model is used for coding a local training sample which is held by the corresponding single data party and is used as private data to obtain a corresponding characterization vector, and the global model is used for processing the characterization vector determined by the corresponding local coding model by the single data party and obtaining the business processing result; the system is configured to:

each data side carries out coding processing on local training samples through corresponding local coding models respectively, and a representation vector of each preset dimension is obtained for each local training sample respectively, wherein at least one layer in a single coding model is a noise layer, and the noise layer is a noise vector generated by superposing current expression vectors of current local training samples according to preset distribution;

11. An apparatus for jointly training a business model, wherein the business model is configured to process related business data to determine a corresponding business processing result, a method for jointly training the business model is performed by a plurality of data parties and a service party, the plurality of data parties includes a first party, the business model includes at least one local coding model corresponding to a single data party, and a global model provided to the service party, the single local coding model is configured to encode a local training sample, which is held by the corresponding single data party and is used as private data, to obtain a corresponding token vector, and the global model is configured to process the token vector determined by the corresponding local coding model by the single data party and obtain the business processing result; the local coding model corresponding to the first party is a first coding model, and the apparatus is provided at the first party and includes:

the encoding unit is configured to process a local first sample by using the first encoding model to obtain a first characterization vector, and send the first characterization vector to a server, so that the server processes the first characterization vector by using a global model, and obtains a first service processing result corresponding to the first sample, thereby determining a first gradient of the global model based on a comparison result between the first service processing result and a first label corresponding to the first sample, wherein at least one layer in the first encoding model is a noise layer, and the noise layer is used for superimposing noise vectors generated according to a predetermined distribution on a current expression vector of the first sample;

a gradient determining unit configured to determine gradients of the respective model parameters in the first coding model based on the first gradient, so as to adjust local model parameters according to the obtained gradient data.

12. The apparatus of claim 11, wherein the encoding unit is further configured to:

13. The apparatus of claim 11, wherein the predetermined distribution is one of a gaussian distribution or a laplace distribution.

14. A device for jointly training a business model is disclosed, wherein the business model is used for processing related business data to determine a corresponding business processing result, a method for jointly training the business model is jointly executed by a plurality of data parties and a service party, the business model comprises at least one local coding model corresponding to a single data party and a global model arranged on the service party, the single local coding model is used for coding a local training sample which is held by the corresponding single data party and is used as privacy data to obtain a corresponding characterization vector, and the global model is used for processing the characterization vector determined by the corresponding local coding model by the single data party and obtaining a business processing result; the device is arranged on a server side and comprises:

15. An apparatus for business processing, wherein relevant business data is processed by a business model jointly trained by a plurality of data parties and a service party in advance to determine a corresponding business processing result, the business model comprises at least one local coding model corresponding to a single data party and a global model corresponding to the service party, the apparatus comprises:

a prediction unit configured to input the characterization vector of the traffic data to be processed into a pre-trained global model, which is trained together with at least one local coding model by using the system of claim 10 or the apparatus of any one of claims 11 to 14, to obtain a corresponding output result;

16. The apparatus according to claim 15, wherein the characterization vector for processing the service data is determined based on processing of the service data to be processed by a local coding model corresponding to a corresponding data party, or is determined based on an average vector of each characterization vector obtained by processing of the service data to be processed by a local coding network of each data party.

17. A computer-readable storage medium, on which a computer program is stored which, when executed in a computer, causes the computer to carry out the method of any one of claims 1-9.

18. A computing device comprising a memory and a processor, wherein the memory has stored therein executable code that, when executed by the processor, performs the method of any of claims 1-9.