CN109308418B - Model training method and device based on shared data - Google Patents

Model training method and device based on shared data Download PDF

Info

Publication number
CN109308418B
CN109308418B CN201710632357.5A CN201710632357A CN109308418B CN 109308418 B CN109308418 B CN 109308418B CN 201710632357 A CN201710632357 A CN 201710632357A CN 109308418 B CN109308418 B CN 109308418B
Authority
CN
China
Prior art keywords
data
value
trusted execution
execution environment
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710632357.5A
Other languages
Chinese (zh)
Other versions
CN109308418A (en
Inventor
王力
周俊
李小龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Advanced New Technologies Co Ltd
Original Assignee
Advanced New Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Advanced New Technologies Co Ltd filed Critical Advanced New Technologies Co Ltd
Priority to CN201710632357.5A priority Critical patent/CN109308418B/en
Publication of CN109308418A publication Critical patent/CN109308418A/en
Application granted granted Critical
Publication of CN109308418B publication Critical patent/CN109308418B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/602Providing cryptographic facilities or services
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes

Abstract

A model training method and device based on shared data are disclosed. On one hand, joint training can be performed according to data provided by a plurality of data providers, so that a more accurate and comprehensive data model is obtained; on the other hand, operations (such as data decryption operation, local model parameter updating operation and the like) related to private data in the model training process are packaged in a trusted execution environment of a data provider to be executed. No clear text of data can be obtained outside the trusted execution environment.

Description

Model training method and device based on shared data
Technical Field
The embodiment of the specification relates to the technical field of data mining, in particular to a model training method and device based on shared data.
Background
In the big data era, by mining mass data, useful information in various forms can be obtained, and therefore the importance of the data is self evident. Different organizations own respective data, but the data mining effect of any one organization is limited by the amount and kind of data owned by the organization. Aiming at the problem, a direct solution idea is as follows: the multiple mechanisms cooperate with each other to share data, so that a better data mining effect is realized, and a win-win situation is realized.
However, data itself is a very valuable asset for the data owner, and the data owner often does not want to provide the data directly due to the requirements of protecting privacy, preventing disclosure, etc., which makes the "data sharing" difficult to be actually operated in reality. Therefore, how to implement data sharing on the premise of sufficiently ensuring data security has become a problem of great concern in the industry.
Disclosure of Invention
In view of the above technical problems, embodiments of the present specification provide a method and an apparatus for model training based on shared data, and the technical solution is as follows:
according to a first aspect of embodiments herein, there is provided a method for model training based on shared data, the method comprising:
performing iterative training by using the following steps until the model training requirement is met:
respectively obtaining ciphertext data provided by at least 1 data provider;
respectively inputting the ciphertext data of each data provider into a trusted execution environment of the data provider;
obtaining an output value of each trusted execution environment, wherein the output value is obtained by calculation according to the ciphertext data;
calculating a deviation value between a model predicted value and a true value according to a given training target model; the model predicted value is determined according to the output value of each trusted execution environment, and the real value is a global tag value determined according to the data of each data provider;
respectively returning the deviation values to the trusted execution environments, so that the trusted execution environments respectively update the local model parameters;
wherein the following steps are performed inside any trusted execution environment:
decrypting the input ciphertext data to obtain a plaintext data characteristic value;
calculating an output value corresponding to the characteristic value of the plaintext data according to the current local model parameter;
and updating the local model parameters according to the returned deviation value.
According to the 2 nd aspect of the embodiments of the present specification, there is provided a shared data-based model training method, including:
respectively inputting ciphertext data of at least 1 data provider to a trusted execution environment of the data provider; in each trusted execution environment, respectively decrypting the input ciphertext data to obtain each plaintext data characteristic value;
performing iterative training by using the following steps until the model training requirement is met:
in each trusted execution environment, calculating an output value corresponding to a plaintext data characteristic value according to a current local model parameter;
calculating a deviation value between a model predicted value and a true value according to a given training target model; the model predicted value is determined according to the output value of each trusted execution environment, and the real value is a global tag value determined according to the data of each data provider;
respectively returning the deviation values to the trusted execution environments;
and in each trusted execution environment, updating local model parameters according to the returned deviation value.
According to a 3 rd aspect of embodiments herein, there is provided a shared data-based model training method, the method including:
performing iterative training by using the following steps until the model training requirement is met:
respectively obtaining data provided by a plurality of data providers, wherein the data provided by at least 1 data provider is ciphertext data, and the data provided by other data providers is plaintext data;
if the data form provided by the data provider is ciphertext data, inputting the ciphertext data into a trusted execution environment of the data provider correspondingly;
obtaining an output value of each trusted execution environment, wherein the output value is obtained by calculation according to the ciphertext data;
calculating deviation values of the model predicted values and the real values by utilizing the output values of the trusted execution environments and plaintext data provided by other data providers; the model prediction value is determined according to the output value of each trusted execution environment and the characteristic value of plaintext data; the real value is a global label value determined according to the data of each data provider;
respectively returning the deviation values to the trusted execution environments, so that the trusted execution environments respectively update the local model parameters;
wherein the following steps are performed inside any trusted execution environment:
decrypting the input ciphertext data to obtain a plaintext data characteristic value;
calculating an output value corresponding to the characteristic value of the plaintext data according to the current local model parameter;
and updating the local model parameters according to the returned deviation value.
According to the 4 th aspect of the embodiments of the present specification, there is provided a data prediction method based on shared data modeling, the method including:
respectively obtaining ciphertext data provided by at least 1 data provider;
respectively inputting the ciphertext data of each data provider into a trusted execution environment of the data provider;
obtaining an output value of each trusted execution environment, wherein the output value is obtained by calculation according to the ciphertext data;
inputting the output value of each trusted execution environment into a pre-trained prediction model, and calculating to obtain a predicted value;
wherein the following steps are performed inside any trusted execution environment:
decrypting the input ciphertext data to obtain a plaintext data characteristic value;
and calculating an output value corresponding to the characteristic value of the plaintext data according to the current local model parameter.
According to the 5 th aspect of the embodiments of the present specification, there is provided a shared data-based model training apparatus, including the following modules for implementing iterative training:
the data acquisition module is used for respectively acquiring ciphertext data provided by at least 1 data provider;
the data input module is used for correspondingly inputting the ciphertext data of each data provider into the trusted execution environment of the data provider;
an output value obtaining module, configured to obtain an output value of each trusted execution environment, where the output value is obtained through calculation according to the ciphertext data;
the deviation value calculation module is used for calculating the deviation value between the model predicted value and the actual value according to a given training target model; the model predicted value is determined according to the output value of each trusted execution environment, and the real value is a global tag value determined according to the data of each data provider;
the deviation value returning module is used for returning the deviation values to the trusted execution environments respectively so that the trusted execution environments update the local model parameters respectively;
wherein, arbitrary trusted execution environment includes internally:
the decryption submodule is used for decrypting the input ciphertext data to obtain a plaintext data characteristic value;
the output value operator module is used for calculating an output value corresponding to the characteristic value of the plaintext data according to the current local model parameter;
and the parameter updating submodule is used for updating the local model parameters according to the returned deviation value.
According to the 6 th aspect of the embodiments of the present specification, there is provided a shared data-based model training apparatus, including the following modules for implementing iterative training:
the data acquisition module is used for respectively acquiring data provided by a plurality of data providers, wherein the data provided by at least 1 data provider is in a ciphertext data form, and the data provided by other data providers is in a plaintext data form;
the data input module is used for correspondingly inputting the ciphertext data into a trusted execution environment of a data provider if the data provided by the data provider is ciphertext data;
an output value obtaining module, configured to obtain an output value of each trusted execution environment, where the output value is obtained through calculation according to the ciphertext data;
the deviation value calculation module is used for calculating the deviation value between the model predicted value and the real value by utilizing the output value of each trusted execution environment and plaintext data provided by other data providers; the model prediction value is determined according to the output value of each trusted execution environment and the characteristic value of plaintext data; the real value is a global label value determined according to the data of each data provider;
the deviation value returning module is used for returning the deviation values to the trusted execution environments respectively so that the trusted execution environments update the local model parameters respectively;
wherein, arbitrary trusted execution environment includes internally:
the decryption submodule is used for decrypting the input ciphertext data to obtain a plaintext data characteristic value;
the output value operator module is used for calculating an output value corresponding to the characteristic value of the plaintext data according to the current local model parameter;
and the parameter updating submodule is used for updating the local model parameters according to the returned deviation value.
According to the 7 th aspect of the embodiments of the present specification, there is provided a data prediction apparatus based on shared data modeling, the apparatus including:
the data acquisition module is used for respectively acquiring ciphertext data provided by at least 1 data provider;
the data input module is used for correspondingly inputting the ciphertext data of each data provider into the trusted execution environment of the data provider;
an output value obtaining module, configured to obtain an output value of each trusted execution environment, where the output value is obtained through calculation according to the ciphertext data;
the predicted value calculation module is used for inputting the output value of each trusted execution environment into a pre-trained prediction model and calculating to obtain a predicted value;
wherein any trusted execution environment EuThe inside includes:
the decryption submodule is used for decrypting the input ciphertext data to obtain a plaintext data characteristic value;
and the output value operator module is used for calculating an output value corresponding to the characteristic value of the plaintext data according to the current local model parameter.
The technical scheme provided by the embodiment of the specification is as follows: on one hand, joint training can be performed according to data provided by a plurality of data providers, so that a more accurate and comprehensive data model is obtained; on the other hand, operations (such as data decryption operation, local model parameter updating operation and the like) related to private data in the model training process are packaged in a trusted execution environment of a data provider to be executed. That is to say: and the data plaintext can not be acquired outside the trusted execution environment, so that the data security of the shared data provider is effectively ensured.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of embodiments of the invention as claimed.
Drawings
In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the present disclosure, and other drawings can be obtained by those skilled in the art according to the drawings.
FIG. 1 is a schematic diagram of a data sharing collaboration mode;
FIG. 2 is a schematic diagram of the architecture of the model training system disclosed in this specification;
FIG. 3 is a schematic diagram of the architecture of the data prediction system disclosed herein;
FIG. 4a is a schematic diagram of the architecture of a model training system of one embodiment of the present description;
FIG. 4b is a schematic diagram of the architecture of a model training system of another embodiment of the present description;
FIG. 5 is a schematic diagram of a shared data based model training apparatus disclosed in the present specification;
FIG. 6 is a schematic diagram of a data prediction device based on shared data modeling as disclosed in the present specification;
fig. 7 is a schematic structural diagram of a computer device disclosed in the present specification.
Detailed Description
Technical solutions in embodiments of the present specification will be described in detail below in order to enable those skilled in the art to better understand the technical solutions, with reference to the drawings in the embodiments of the present specification, and it is apparent that the described embodiments are only a part of the embodiments of the present specification, and not all embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present specification should fall within the scope of protection of the embodiments in the present specification.
As shown in fig. 1, in the cooperative mode of "data sharing", several roles are involved: the system comprises a data provider, a data mining party and a data attacking party. A plurality of data providers give data to a data miner in common for data sharing mining, but it is not desirable to provide data to the data miner as it is in order to protect data privacy. On the other hand, data providers also need to prevent data attackers from stealing data. In a broad sense, for any data provider, the data miner and other data providers actually constitute potential attackers.
Thus, one of the basic requirements for implementing secure data sharing security is: the data provider provides the data to the data miner after some encryption processing. The encrypted data still keeps a certain amount of information, so that a data mining party can still perform data mining by using the encrypted data, but cannot obtain specific data content.
In view of the above-mentioned needs, embodiments of the present specification provide a data sharing scheme based on a trusted execution environment. The scheme is used for training the data model according to the mass data samples, wherein the data samples are sourced from a plurality of data providers, and different data providers can provide data sample characteristics from different dimensions respectively, so that the data samples with richer characteristic dimensions can be formed after the shared data of each data provider are integrated, and the data model with better effect is trained.
First, a technique related to a Trusted Execution Environment (TEE) will be described: a trusted execution environment is a secure area on the device processor that can guarantee the security, confidentiality, and integrity of code and data loaded inside the environment. The trusted execution environment provides an isolated execution environment, the security features provided comprising: isolated execution, integrity of trusted applications, confidentiality of trusted data, secure storage, and the like. In general, a trusted execution environment may provide a higher level of security than an operating system. Trusted execution environments were originally proposed for mobile devices (e.g., smart phones, tablet computers, set-top boxes, smart televisions, etc.), and the current application scenarios have not been limited to the mobile domain. Common trusted execution environment implementations include AMD's PSP (platform Security processor), ARM's TrustZone, Intel's SGX (software Guard extensions), and so on.
In an application scenario of data sharing, for any data provider, except itself, any other party is considered to be untrustworthy, so trusted execution environments can be created for the data providers respectively, and operations related to data security risks are encapsulated in the trusted execution environments for execution, so that the requirements of the data providers on data security are met.
The architecture of a data sharing system provided by the embodiments of the present specification is shown in fig. 2. Assume that there are a total of U data providers: 1.2 … U, together providing data to the data miner for the data miner to train out a global model. The overall data sharing working principle is as follows:
each data provider provides the encrypted data to the data miner;
and the data mining party respectively creates a trusted execution environment for each data provider, and then correspondingly inputs the ciphertext data into each trusted execution environment.
In the trusted execution environment, the ciphertext data is decrypted, then an output value is calculated by using the model parameter stored in the trusted execution environment, the output value of the trusted execution environment can be used for model training of a data mining party, but the data mining party cannot obtain specific data content from the output value of the trusted execution environment.
And the data mining party performs joint training according to the output values of the trusted execution environments, so as to obtain a global data model.
In the whole process, operations related to data cleartext are packaged in each trusted execution environment, and the trusted execution environments are completely isolated from the data provider, so that the data security of the data provider can be effectively guaranteed.
The following describes a model training method provided in the embodiments of the present specification with reference to a specific flow of steps:
and (3) training the data model, searching for the optimal model parameter value through repeated iteration, updating the model parameter every iteration, and ending the training until the updated model parameter meets the training requirement. Referring to fig. 2, the following describes an embodiment of the present disclosure in a complete iterative process:
s101, respectively obtaining ciphertext data provided by U data providers;
the data provider provides sample data for model training. The data provided by different data providers may have completely different characteristics, respectively, or may have the same or partially the same characteristics. In practical application, any data provider and data miner can agree in advance which features need to be uploaded for model training, which is not limited in the embodiment of the present specification.
Assuming that there are a total of U data providers (where U ≧ 2), the data forms each provided are represented as follows:
the data provider 1: (x)1 1,x2 1,x3 1…) and is noted as (x)1,x2,x3,…)1→X1
The data provider 2: (x)1 2,x2 2,x3 2…) and is noted as (x)1,x2,x3,…)2→X2
……
A data provider U: (x)1 U,x2 U,x3 U…) and is noted as (x)1,x2,x3,…)U→XU
Where x is1,x2,x3…, respectively, represent different characteristic values of a piece of data, and 1,2, … U at superscript represents the identity of the respective data provider. For convenience of description, the data format may be uniformly expressed in a form of a whole superscript. It will be appreciated that the features provided by the various data providers may have the same or different meanings, and that the number of features uploaded by the various data providers may vary. However, obtained from multiple data providersIn the data, a plurality of groups of data for describing the same object can be extracted to form a global training sample.
For example, there are 3 data providers, each providing a data sample with different characteristics:
the features uploaded by the data provider 1 include: age, occupation;
the features uploaded by the data provider 2 include: sex, height, weight;
the features uploaded by the data provider 3 include: gender, blood pressure, heart rate;
if the feature data can be obtained from 3 data providers for any person i, a large number of training samples can be formed by combining the data provided by the 3 data providers, so that a model with 7 features (age, occupation, sex, height, weight, blood pressure, heart rate) can be jointly trained.
In order to ensure the safety of the data, each data provider encrypts the data and provides the data to the data miner in a ciphertext mode. It is considered that the encryption logic used by each data provider is known only to itself, and thus the encrypted data can be securely stored or transmitted in an untrusted environment.
In specific implementation, a data mining party can request data from a data provider in a network transmission mode, and the data provider encrypts the data and then sends ciphertext data to the data mining party through a network; in another embodiment, the ciphertext data may also be stored in a storage device of the data mining party, so that the data mining party directly reads the ciphertext data from the local.
S102, the ciphertext data of each data provider are respectively and correspondingly input into the trusted execution environment of each data provider;
the data mining party creates a trusted execution environment E for each data provider 1,2, … U, respectively1、E2…EUTo ensure that data security-related operations can only be performed on the data provided by any data provider U (U-1, 2, … U) at its corresponding EuIn E is carried outuExternal failing to feelKnowing these operations, it is also impossible to influence these operations.
Under different implementation schemes, specific ways of creating the trusted execution environment are different, and the embodiments of the present specification do not limit the specific ways of creating the trusted execution environment. In addition, the operation of creating the trusted execution environment by the data mining party can be executed after ciphertext data of the data providing party is obtained for the first time, or can be executed in advance.
Each trusted execution environment provides an input interface and an output interface to the outside, wherein one of the output interfaces functions as: ciphertext data input from outside is received. After the data mining party obtains the ciphertext data of the data provider, the E corresponding to each data provider u is determineduThen, the ciphertext data are respectively input into EuThe ciphertext data input interface.
S103, obtaining output values O of the trusted execution environments1、O2…OU
At each EuInternally, the input ciphertext data is decrypted first to obtain plaintext data XuThen according to a preset algorithm and the internal local model parameters WuFor plaintext data XuCalculating to obtain corresponding output value OuAnd through EuOne output interface of (2) will be OuOutput to EuAnd (3) an external part. That is, the data miner may obtain output values O of a plurality of trusted execution environments, respectively1、O2…OU
To fully describe the implementation of the system, the present embodiment first describes the overall processing flow of the data model training from the perspective of the data mining party, since EuThe internal operation is not visible to the outside, so in this embodiment each E can be useduSeen as a black box. With respect to each EuThe specific implementation of the internal part will be described in detail in the following embodiments.
S104, calculating a deviation value delta-Y-h [ z (O) according to a given training target model1,O2,…OU)];
Deviation values are model training iterationsAn important value to be calculated in the process is assumed to be in the form of (X) for a data sample ii,yi) Wherein:
Xi=(xi1,xi2,…),xi1,xi2… are a plurality of characteristic values of the data sample i, respectively;
yiis the label value of data sample i;
assuming that the training target model is of the form y ═ h (X), then for the data sample (X)i,yi) With predicted deviation value equal to tag value yiAnd the model predicted value h (X)i) The difference of (a) to (b), namely:
Δi=h(Xi)-yior Δi=yi-h(Xi)
Deviation value deltaiThere are two main roles in model training:
on one hand, the effect is to use the fitting effect of the evaluation model on the training sample set: for any one data sample i, ΔiThe smaller the value, the better the fitting effect of the model; if there are n groups of data samples, n Δ siThe smaller the value is as a whole, the better the fitting effect of the model is. In practical applications, generally speaking, Σ Δ is calculatediThe method (3) is used for evaluating the fitting effect of the function on the training sample set as a whole.
On the other hand, the function is to participate in the iterative update operation of model parameters: assume that there is a set of model parameters W ═ W (W)1,w2…), the basic form of iterative update of the parameters (various variations are possible in practical applications) is as follows:
W←W-αΔX
in the whole model training process, model parameters are continuously updated in an iterative manner, so that the fitting effect of the model on the training sample set meets the training requirement (for example, the deviation value is small enough). The following is a brief description of the parameter update formula, and for the specific derivation process of the parameter update formula, reference may be made to the description of the prior art.
In the above update formula, "W" on the right side of the arrow indicates the parameter value before each update, and "W" on the left side of the arrow indicates the parameter value after each update, and it can be seen that the change amount of each update is the product of α, Δ, and X.
Alpha represents a learning rate, also called a step size, which determines that each iteration is an updating amplitude of a parameter, if the learning rate is too small, the speed of the process of meeting the training requirement may be slow, and if the learning rate is too large, the phenomenon of overhoot the minimum may be caused, that is, the model cannot be made to approach to fitting along with the updating process. As to how to select the appropriate learning rate, reference may be made to the description of the prior art, and in the embodiments in the present specification, α is regarded as a preset value.
X represents the characteristic value of the data sample, and may represent different forms of the characteristic value according to the selected update formula, which will be further exemplified in the embodiments later in this specification.
In an embodiment of the present specification, the data mining party may obtain output values O of a plurality of trusted execution environments respectively1、O2…OULet y be h (z) be the global training objective model function, where z is with respect to O1、O2…OUFunction of (a), denoted as z (O)1,O2,…OU) I.e. a union function for the output values of U data providers, and O1、O2…OUAnd each is about X1、X2…XUAs a result, it can be seen that: y ═ h (z) also relates to X1、X2…XUAs a function of (c).
Definition of Δ ═ Y-h [ z (O)1,O2,…OU)]Or Δ ═ h [ z (O)1,O2,…OU)]-Y;
Wherein h [ z (O)1,O2,…OU)]Is z (O)1,O2,…OU) The model predicted value of (2); y is z (O)1,O2,…OU) The corresponding real value is the global label value determined according to the data of each data provider; the difference Δ is the deviation.
The actual form can be selected according to the actual training requirement, such as a linear regression model (linear regression model), a logistic regression model (logistic regression model), and so on. The examples in this specification are not necessarily limiting.
In addition, for each group O1,O2,…OUThe corresponding global tag value Y may be determined in a variety of ways, as will be described in detail in the following embodiments.
S105, respectively returning delta to E1、E2…EUSo that E1、E2…EUUpdating local model parameters W separately1、W2…WU. The updated parameters are used for calculating the output value O in the next iteration processu
In the above, from the perspective of the data mining party, the overall processing flow of the data model training is introduced, and the processing logic inside the trusted execution environment is specifically described as follows:
as shown in FIG. 2, for any trusted execution environment EuAnd 3 basic functions are realized inside the device:
1) data decryption:
corresponding to the encryption logic used by the data provider u, at EuIn which corresponding decryption logic is stored, such as decryption algorithm information, key information, etc. Based on this information, in EuThe inside can decrypt the input ciphertext data to obtain the plaintext data characteristic value Xu=(x1,x2,…)u
The data decryption operation is performed after S102.
2) Calculating an output value:
at EuIn which local model parameters W are storedu=(w1,w2,…)uAt EuInternally, it can be based on the current local model parameters WuCalculating XuCorresponding output value Ou(ii) a During the whole training process, WuThe method is continuously updated iteratively, and initialized parameter values are used in the first iterative calculation.
OuAccording to the global model y ═ a specific calculation modeh(z)=h[z(O1,O2,…OU)]For example, for linear and logistic regression models, the global model can be expressed as y ═ h (z) h (w ═ h) for both the linear and logistic regression models1x1+w2x2+ …), then the corresponding OuWith union function z (O)1,O2,…OU) May be in the form of:
Ou=w1 ux1 u+w2 ux2 u+ …, note (w)1x1+w2x2+…)u
Figure BDA0001364141530000121
In practice, the above-mentioned OuMay also include a constant term parameter buNamely:
Ou=bu+w1 ux1 u+w2 ux2 u+…
in fact, if let bu=w0 uAnd w is0 uIs understood as feature x0 uCorresponding parameters, and feature x0 uIs constant equal to 1, OuThe expression of (c) can be expressed as:
Ou=w0 ux0 u+w1 ux1 u+w2 ux2 u+…
it can be seen that the form of the overall expression is uniform regardless of the presence or absence of the constant term parameter, and therefore Ou=w1 ux1 u+w2 ux2 uThe expression of + … should be understood to cover both "constant term parameters" and "constant term parameters". In practical applications, for any u, the model parameters may include constantsThe number term parameter may or may not include the constant term parameter.
Of course, O is as defined aboveuWith union function z (O)1,O2,…OU) The form of the invention is merely illustrative and should not be construed as limiting the embodiments of the present invention.
The output value calculation operation is performed after the data decryption operation, and after the output value is calculated, S104 is continuously performed.
3) Updating parameters:
at each EuInternally, current local model parameters W are storeduReceive EuAfter the deviation value Delta returned from the outside, according to the form Wu←Wu-αΔXuParameter update formula pair WuAn update is performed (using initialized parameter values before the first update). Of course, the actually used parameter update formula is not limited to the above form. For example:
if each time a read is made from the data source and to EuWhen 1 piece of data i is input as a training sample, the parameter updating formula is as follows: wu←Wu-αΔiXi u
If each time a read is made from the data source and to EuWhen a plurality of pieces of data are input as training samples and a gradient descent method (gradient) is used for parameter updating, the parameter updating formula is as follows:
Figure BDA0001364141530000131
namely, all training samples participate in updating operation;
if each time a read is made from the data source and to EuWhen a plurality of pieces of data are input as training samples and a random gradient descent method (stochastic gradient device) is used for parameter updating, the parameter updating formula is as follows: wu←Wu-αΔiXi uWherein i is an arbitrary value, namely a training sample is randomly selected to participate in updating operation;
the above updating algorithm is only used for illustrative purposes and should not be construed as limiting the scheme. For example, to reduce the over-fitting phenomenon, a regularization correction term may be added to the update formula. There are other update algorithms available and the description is not intended to be exhaustive.
A parameter updating operation, executed after the above step S105, where after the parameter updating, one iteration updating is completed, and the parameter obtained after the updating is used for calculating the output value O in the next iteration processu
A complete iteration process is introduced above, and iteration is performed through the above steps until the model training requirement is met, where the model training requirement may be, for example: the deviation value delta of the global model is small enough, the delta difference value of two adjacent iterative computations is small enough, EuInternal twice-iteratively calculated OuThe difference is small enough, or a preset number of iterations is reached, etc., although an additional verification set may be used for verification, the specification does not need to limit the specific model training requirements.
It can be seen that, by applying the above scheme, operations (such as data decryption operation, local model parameter update operation, and the like) related to private data in the model training process are all encapsulated in the trusted execution environment of the data provider for execution. That is to say: data plaintext cannot be acquired outside the trusted execution environment, and in some specific embodiments, specific local model parameters cannot be acquired even outside the trusted execution environment, so that data security of a shared data provider is effectively guaranteed.
The model training scheme based on shared data provided by the embodiments of the present specification is introduced above in a whole, and in terms of overall detailed design, in combination with practical application requirements, there are some alternative implementations, for example, as follows:
in S101 to S102, only one piece of data may be read to the trusted execution environment at a time, or multiple pieces of data may be read to the trusted execution environment at a time. In each iteration process, N pieces of ciphertext data are respectively obtained from each data provider, wherein N can be a preset numerical value not less than 1, and the training samples are replaced by obtaining different data of contents each time.
In addition, it can be understood that the acquisition of the training sample data may be performed successively along with the iterative process, or may be performed once. For example, if the number of data required for each iteration is N, N pieces of data may be obtained in each iteration and decrypted in the trusted execution environment; or the data with the quantity larger than N (for example, the full quantity of data, or a multiple of N, etc.) is acquired at one time and then input into the trusted execution environment, and N pieces of data are decrypted at the trusted execution environment as required each time; the method can also be implemented by inputting the data with the quantity larger than N (such as full quantity data, or multiple of N, etc.) into the trusted execution environment at one time, and decrypting the input data at the trusted execution environment at one time; and so on.
It can be seen that the steps that must be executed in each iteration in practical applications include steps S103 to S105 and steps of "output value calculation" and "parameter update" inside the trusted execution environment, and steps S101 and S102 and step of "data decryption" inside the trusted execution environment do not have to be executed in each iteration. In a word, the mode of acquiring the sample data can be flexibly set according to the actual situation, and the mode does not influence the realization of the whole scheme.
The association of data among multiple data providers can be realized through some common and identification features, such as identification numbers, which can ensure that data obtained from multiple data providers are used for describing the same person. The identification feature does not need to participate in model training, and the security of the feature data can be improved in a hash mode and the like.
Each EuAre created based on information provided by the data provider itself, EuThe basic design criteria should be met for the overall design, but not necessarily exactly the same for the specific implementation. For example, different data decryption algorithms, different parameter update algorithms, etc. may be employed.
For each group O1,O2,…OUIts global tag value Y may be determined in a number of ways, for example:
1) a tag value Y provided by a certain data provider uuDetermined as Y;
2) root of herbaceous plantA tag value Y provided by a plurality of data providers u1, u2 …u1、Yu2…, and may be determined in a manner such as calculating a weighted average, a logical and, a logical or, and so forth;
3) determining Y through other channels except the data provider;
for example, a predictive model was established for the prevalence of a disease known to be associated with 5 characteristics (age, occupation, sex, height, weight) and:
the institution 1 may provide characteristic data: age, occupation;
the agency 2 may provide characteristic data: sex, height, weight;
the prediction model is assumed to be a binary model, i.e. the model output values include both "diseased" and "not diseased" (the corresponding prediction results may be presented as "high risk" and "low risk"). Each institution may or may not provide a tag value, i.e. the "diseased or not" result, based on the provided characteristic data. However, there may be various strategies for determining the global tag value, such as:
the global tag value is subject to the tag value provided by a certain institution, and the actual situation may be that the institution is more authoritative or that another institution cannot provide the tag value.
The global tag value is determined jointly from tag values provided by two institutions, for example: if the tag value provided by at least one institution is "diseased," the global tag value is determined to be "diseased.
In addition, in some cases, the data mining party may also directly know the "sick or not" result of a group of users from other channels, the requirement is to further mine the relationship between the result and other features, "the other features" may be obtained from the data provider, and the above-mentioned previously known result may be directly used as the global tag value.
After the training is finished, each EuThe parameters obtained by the last update can be output to the data mining party so that the data mining party can maintain a complete data model. Parameters can also be usedDistributed storage in respective EuAnd the safety is further improved.
If adopted, each EuSaving the parameters still in EuIn the internal scheme, in the model using stage, each data provider uploads ciphertext data to EuIn (E)uDecrypting the ciphertext data and calculating an output value, and finally mining the ciphertext data according to the E valuesuAnd calculating an output result of the global model. FIG. 3 illustrates a method of data prediction based on shared data modeling, which may include the steps of:
s201, respectively obtaining ciphertext data provided by U data providers, wherein U is more than or equal to 2;
s202, respectively corresponding the ciphertext data of each data provider to the trusted execution environment E of the input data provider1、E2…EU
S203 obtains the output value O of the trusted execution environment1、O2…OU
S204, according to a pre-trained prediction model, calculating to obtain a predicted value Y-h [ z (O)1,O2,…OU)];
Comparing fig. 2 and fig. 3, it can be seen that in the model using stage, a system architecture similar to the model training stage is still adopted, except that the parameter iteration update is not required, that is, the predicted result value y is output once according to the input data.
Accordingly, for any trusted execution environment EuLocal model parameter WuIt is preserved in advance, and in the model using stage, it realizes 2 basic functions:
1) decrypting the input ciphertext data to obtain a plaintext data characteristic value Xu
2) According to the current local model parameter WuCalculating XuCorresponding output value Ou
In the model using phase, specific implementation of each step may refer to a corresponding step in the model training phase, and a description of this embodiment is not repeated.
In the above embodiment, an implementation scheme that more than 2 data providers jointly provide a data joint training model is introduced, and it can be understood that other improvements can be made on the basis of the above scheme to meet the application requirements of some specific scenarios, for example, as follows:
when only 1 data provider provides data to a data mining party and has a secrecy requirement on the data, the following scheme can be utilized to realize model training:
the data mining party performs iterative training by using the following steps until model training requirements are met:
s101', ciphertext data provided by 1 data provider is obtained;
s102', inputting ciphertext data of a data provider into a trusted execution environment E of the data provider;
s103' obtaining an output value O of the trusted execution environment E;
s104', calculating a deviation value delta between a model predicted value and a true value according to a given training target model;
s105' returning the deviation value delta to the trusted execution environment E, so that the trusted execution environment updates the model parameters;
the embodiment can be applied to application scenarios in which a certain data provider entrusts a data miner to perform data mining and does not want to reveal data details to the data miner.
Compared with S101 to S105, S101 'to S105' are cases where U data providers are simplified to 1 data provider, and other implementations are basically the same, and a description thereof will not be repeated in this embodiment. Wherein, inside the trusted execution environment E, three functions of data decryption, output value calculation, and parameter update are still implemented.
When there are multiple data providers providing data to the data miner and some of them have no privacy requirement for the data, model training can be implemented using the following scheme:
the data mining party performs iterative training by using the following steps until model training requirements are met:
s101, respectively obtaining data provided by U data providers (wherein U is more than or equal to 2), wherein the data provided by at least 1 data provider is ciphertext data, and the data provided by other data providers is plaintext data;
s102, if the data form provided by the data provider u is ciphertext data, inputting the ciphertext data into the trusted execution environment E of the provider corresponding to the datauU here designates a data provider with data security requirements.
S103' obtaining output value O of each trusted execution environmentu
S104' utilizing trusted execution environments EuOutput value of OuAnd calculating the deviation value delta between the predicted value and the true value of the model according to plaintext data provided by other data providers;
the difference between this step and S104 is: for a provider providing plaintext data, a data miner can directly obtain corresponding plaintext data to participate in global computing without passing through a trusted execution environment.
S105, respectively returning the deviation values to the trusted execution environments, so that the trusted execution environments respectively update the local model parameters;
the difference between this step and S105 is: for providers that provide plaintext data, the data miner may be directly responsible for maintaining and updating the local model parameters.
In contrast to S101 to S105, S101 "to S105" described above divide U data providers into two categories: for a data provider without data confidentiality requirement, a data mining party can directly acquire plaintext data provided by the data mining party to perform model training; and for a data provider with a data security requirement, the ciphertext data provided by the data provider still needs to be processed by the trusted execution environment. The trusted execution environment is internally provided with three functions of data decryption, output value calculation and parameter updating.
The scheme of the embodiment is suitable for the scene that some data features required by the global model do not have the security requirement. Of course, from the data privacy perspective, the "no privacy requirements" herein is not generally in an absolute sense, but rather there are no privacy requirements within the data miners. For example, if a data provider has a deep cooperation relationship with a data miner, or if the data miner itself has a piece of data available to participate in the global model training (the data miner itself can be considered to be one of the data providers), then the data without the privacy requirement can be used directly by the data miner without going through the trusted execution environment.
The scheme of the embodiment of the present specification is described below with reference to specific examples;
assuming that the overall training requirements are: a model for predicting whether the user has the ability to repay the high loan on due date is established according to the user property data provided by two banking institutions.
The bank 1 may provide a data feature of x1,x2,x3
The bank 2 may provide a data feature of x4,x5
The global modeling uses a logistic regression model, the functional form is:
Figure BDA0001364141530000181
wherein:
z=(w1x1+w2x2+w3x3+w4x4+w5x5) (2)
w1,w2,w3as a local parameter of the bank 1, w4,w5Local parameters for the bank 2.
Defining:
sum1=w1x1+w2x2+w3x3 (3)
sum2=w4x4+w5x5 (4)
then, according to equations (1) to (4), the bias value calculation function of the global model can be obtained:
Figure BDA0001364141530000191
the trusted execution environment is realized by adopting the SGX technology of Intel, the created trusted execution environment is called enclave, specifically, the method is to encapsulate the security operation of legal software in one enclave, and once the software and data are located in the enclave, even an operating system or a VMM (Hypervisor) cannot influence the code and data in the enclave. The security boundary of enclave contains only the CPU and itself.
The overall architecture of the system for implementing model training is shown in fig. 4a, and the following describes the implementation of the system from the perspective of the data provider and the data miner:
1) a data provider:
each bank encrypts data which needs to be provided for a data mining party respectively, and the encrypted data can be stored in a hard disk of a data providing party. Of course, some parts of the data may also be provided in clear text, depending on the actual application requirements.
Each bank respectively provides an enclave definition file (. edl) and a corresponding dynamic link library (.dll or. so), and the generated enclave comprises the following functions or interfaces:
1.1) decrypting the encrypted text data which is input from the outside of the envelope and encrypted in advance by the bank to obtain the plaintext data. Inputting N ciphertext data each time in an iteration mode, wherein each piece of data corresponds to one user, and for any user i, the plaintext data of the bank 1 is xi1,xi2,xi3The plain text data of the bank 2 is xi4,xi5
1.2) respectively calculating the output value of each piece of data according to the current local parameter value and outputting the output value to the outside of enclave. For any user i, the output value of bank 1 is sum1iThe output value of bank 2 is sum2i
1.3) Delta returned externally according to enclaveiAnd updating local parameters by using a gradient descent method, wherein all N data participate in operation in each iteration, and an updating formula is as follows:
W←W-α∑iΔiXi (6)
namely:
w1←w1-α∑iΔixi1
w2←w2-α∑iΔixi2
w3←w3-α∑iΔixi3
w4←w4-α∑iΔixi4
w5←w5-α∑iΔixi5
wherein
Figure BDA0001364141530000201
Alpha is a preset learning rate, and the adoption alpha of the bank 1 and the bank 2 can be the same or different.
2) And (3) a data mining party:
the data miner first unifies a global tag value Y, which is used to represent: whether a user who already has too much loan behavior can repay the loan on an installment basis. This information may be obtained from two banks or from other lending institutions.
Loading enclave information provided by two banks respectively, creating enclave1 and enclave2, and establishing a model training application based on enclave1 and enclave2, wherein the running mechanism of the application is as follows:
2.1) reading a batch of ciphertext data from the hard disk at each iteration, and assuming that the number of ciphertext data read at each time is N. And the two bank data can be read in an associated manner through the identification card number. The encrypted data of the bank 1 is input into enclave1, and the encrypted data of the bank 2 is input into enclave 2.
2.2) respectively decrypting the ciphertext data in enclave1 and enclave2, and respectively calculating and obtaining sum1 by using an equation (3) and an equation (4) according to the current local parameter (using the initial parameter value in the first iteration)iAnd sum2iAnd output to the outside. 2.3) sum1 output according to enclave1 and enclave2iAnd sum2iCalculating Delta from equation (7)iAnd will beiRespectively returning to enclave1 and enclave 2;
2.4) within enclave1 and enclave2, the parameters are updated using equation (6), respectively.
Repeating the iteration until the model training condition is met to obtain the final parameter value w1,w2,w3,w4,w5And substituting the values into the formula (1) and the formula (2) to obtain the model needing to be trained.
FIG. 4b shows another overall architecture of a system implementing model training, where the corresponding overall training requirements are: the data mining party has some user property data and needs to establish a joint model for predicting whether the user has the ability to repay the high loan according to the property data provided by the bank 1 and the owned data, wherein:
the bank 1 may provide a data feature of x1,x2,x3(ii) a Corresponding local parameter is w1,w2,w3
The data provider has the data characteristic x4,x5(ii) a Corresponding local parameter is w4,w5
Compared with the previous embodiment, the overall model training thought is basically consistent, and the difference points out that: creating enclave only for Bank 1, for feature x4,x5In other words, the data provider can directly read the plaintext data to participate in the model training calculation.
Corresponding to the above method embodiment, an embodiment of the present specification further provides a shared data-based model training apparatus, and as shown in fig. 5, the apparatus may include the following modules for implementing iterative training:
a data obtaining module 110, configured to obtain ciphertext data provided by at least 1 data provider;
the data input module 120 is configured to respectively input the ciphertext data of each data provider to a trusted execution environment of the data provider;
an output value obtaining module 130, configured to obtain an output value of each trusted execution environment, where the output value is obtained through calculation according to the ciphertext data;
the deviation value calculation module 140 is used for calculating a deviation value between the model predicted value and the actual value according to a given training target model; the model predicted value is determined according to the output value of each trusted execution environment, and the real value is a global tag value determined according to the data of each data provider;
an offset value returning module 150, configured to return the offset value to each trusted execution environment respectively, so that each trusted execution environment updates the local model parameter respectively;
wherein, arbitrary trusted execution environment includes internally:
the decryption submodule is used for decrypting the input ciphertext data to obtain a plaintext data characteristic value;
the output value operator module is used for calculating an output value corresponding to the characteristic value of the plaintext data according to the current local model parameter;
and the parameter updating submodule is used for updating the local model parameters according to the returned deviation value.
In one embodiment provided in this specification, when there are a plurality of data providers providing data to the data miner and there is no secrecy requirement for the data by some of the data providers, then the module functions of the above apparatus may be configured as follows:
the data obtaining module 110 is configured to obtain data provided by multiple data providers, where at least 1 data provider provides data in a ciphertext data form, and other data providers provide data in a plaintext data form;
the data input module 120 is configured to, when the data format provided by the data provider is ciphertext data, input the ciphertext data corresponding to the trusted execution environment of the data provider;
an output value obtaining module 130, configured to obtain an output value of each trusted execution environment, where the output value is obtained through calculation according to the ciphertext data;
the deviation value calculation module 140 is configured to calculate a deviation value between the model predicted value and the real value by using the output value of each trusted execution environment and plaintext data provided by another data provider; the model prediction value is determined according to the output value of each trusted execution environment and the characteristic value of plaintext data; the real value is a global label value determined according to the data of each data provider;
an offset value returning module 150, configured to return the offset value to each trusted execution environment respectively, so that each trusted execution environment updates the local model parameter respectively;
wherein, arbitrary trusted execution environment includes internally:
the decryption submodule is used for decrypting the input ciphertext data to obtain a plaintext data characteristic value;
the output value operator module is used for calculating an output value corresponding to the characteristic value of the plaintext data according to the current local model parameter;
and the parameter updating submodule is used for updating the local model parameters according to the returned deviation value.
Referring to fig. 6, an embodiment of the present specification further provides a data prediction apparatus based on shared data modeling, where the apparatus may include:
a data obtaining module 210, configured to obtain ciphertext data provided by at least 1 data provider;
the data input module 220 is configured to respectively input the ciphertext data of each data provider to a trusted execution environment of the data provider;
an output value obtaining module 230, configured to obtain an output value of each trusted execution environment, where the output value is obtained through calculation according to the ciphertext data;
the predicted value calculation module 240 is configured to input the output value of each trusted execution environment into a pre-trained prediction model, and calculate to obtain a predicted value;
wherein any trusted execution environment EuThe inside includes:
the decryption submodule is used for decrypting the input ciphertext data to obtain a plaintext data characteristic value;
and the output value operator module is used for calculating an output value corresponding to the characteristic value of the plaintext data according to the current local model parameter.
Embodiments of the present specification further provide a computer device, which at least includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor can implement the aforementioned model training method or data prediction method when executing the program.
Fig. 7 is a more specific hardware structure diagram of a computing device provided in an embodiment of the present specification, where the device may include: a processor 1010, a memory 1020, an input/output interface 1030, a communication interface 1040, and a bus 1050. Wherein the processor 1010, memory 1020, input/output interface 1030, and communication interface 1040 are communicatively coupled to each other within the device via bus 1050.
The processor 1010 may be implemented by a general-purpose CPU (Central Processing Unit), a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits, and is configured to execute related programs to implement the technical solutions provided in the embodiments of the present disclosure.
The Memory 1020 may be implemented in the form of a ROM (Read Only Memory), a RAM (Random Access Memory), a static storage device, a dynamic storage device, or the like. The memory 1020 may store an operating system and other application programs, and when the technical solution provided by the embodiments of the present specification is implemented by software or firmware, the relevant program codes are stored in the memory 1020 and called to be executed by the processor 1010.
The input/output interface 1030 is used for connecting an input/output module to input and output information. The i/o module may be configured as a component in a device (not shown) or may be external to the device to provide a corresponding function. The input devices may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc., and the output devices may include a display, a speaker, a vibrator, an indicator light, etc.
The communication interface 1040 is used for connecting a communication module (not shown in the drawings) to implement communication interaction between the present apparatus and other apparatuses. The communication module can realize communication in a wired mode (such as USB, network cable and the like) and also can realize communication in a wireless mode (such as mobile network, WIFI, Bluetooth and the like).
Bus 1050 includes a path that transfers information between various components of the device, such as processor 1010, memory 1020, input/output interface 1030, and communication interface 1040.
It should be noted that although the above-mentioned device only shows the processor 1010, the memory 1020, the input/output interface 1030, the communication interface 1040 and the bus 1050, in a specific implementation, the device may also include other components necessary for normal operation. In addition, those skilled in the art will appreciate that the above-described apparatus may also include only those components necessary to implement the embodiments of the present description, and not necessarily all of the components shown in the figures.
Embodiments of the present specification also provide a computer readable storage medium, on which a computer program is stored, which when executed by a processor implements the aforementioned model training method or data prediction method:
computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
From the above description of the embodiments, it is clear to those skilled in the art that the embodiments of the present disclosure can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the embodiments of the present specification may be essentially or partially implemented in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments of the present specification.
The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. A typical implementation device is a computer, which may take the form of a personal computer, laptop computer, cellular telephone, camera phone, smart phone, personal digital assistant, media player, navigation device, email messaging device, game console, tablet computer, wearable device, or a combination of any of these devices.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the apparatus embodiment, since it is substantially similar to the method embodiment, it is relatively simple to describe, and reference may be made to some descriptions of the method embodiment for relevant points. The above-described apparatus embodiments are merely illustrative, and the modules described as separate components may or may not be physically separate, and the functions of the modules may be implemented in one or more software and/or hardware when implementing the embodiments of the present disclosure. And part or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
The foregoing is only a specific embodiment of the embodiments of the present disclosure, and it should be noted that, for those skilled in the art, a plurality of modifications and decorations can be made without departing from the principle of the embodiments of the present disclosure, and these modifications and decorations should also be regarded as the protection scope of the embodiments of the present disclosure.

Claims (42)

1. A method of model training based on shared data, the method comprising:
performing iterative training by using the following steps until the model training requirement is met:
obtaining ciphertext data provided by a plurality of data providers;
respectively inputting the ciphertext data of each data provider into a trusted execution environment of the data provider;
obtaining an output value of each trusted execution environment, wherein the output value is obtained by calculation according to the ciphertext data;
calculating a deviation value between a model predicted value and a true value according to a given training target model; the model predicted value is determined according to the output value of each trusted execution environment, and the real value is a global tag value determined according to the data of each data provider;
respectively returning the deviation values to the trusted execution environments, so that the trusted execution environments respectively update the local model parameters;
wherein the following steps are performed inside any trusted execution environment:
decrypting the input ciphertext data to obtain a plaintext data characteristic value;
calculating an output value corresponding to the characteristic value of the plaintext data according to the current local model parameter;
and updating the local model parameters according to the returned deviation value.
2. The method according to claim 1, wherein the calculating an output value corresponding to the plaintext data feature value according to the current local model parameter comprises:
according to the current local model parameter Wu=(w1,w2,…)uCalculating a weighted sum O of the characteristic values of the plaintext datau=(w1x1+w2x2+…)uWherein u represents the identification of the data provider and x represents the characteristic value of the data.
3. The method of claim 1, the training target model being a logistic regression model.
4. The method of claim 1, the trusted execution environment being: enclave created using software protection extension SGX technology.
5. The method of claim 1, said updating local model parameters based on returned bias values, comprising:
updating local model parameters by using a gradient descent method according to the returned deviation value; or
And updating local model parameters by using a random gradient descent method according to the returned deviation value.
6. The method of claim 1, wherein the global tag value is determined from a tag value provided by one data provider or jointly determined from tag values provided by a plurality of data providers.
7. A method of model training based on shared data, the method comprising:
respectively inputting ciphertext data of a data provider to a trusted execution environment of the data provider, wherein the number of the data providers is multiple; in each trusted execution environment, respectively decrypting the input ciphertext data to obtain each plaintext data characteristic value;
performing iterative training by using the following steps until the model training requirement is met:
in each trusted execution environment, calculating an output value corresponding to a plaintext data characteristic value according to a current local model parameter;
calculating a deviation value between a model predicted value and a true value according to a given training target model; the model predicted value is determined according to the output value of each trusted execution environment, and the real value is a global tag value determined according to the data of each data provider;
respectively returning the deviation values to the trusted execution environments;
and in each trusted execution environment, updating local model parameters according to the returned deviation value.
8. The method according to claim 7, wherein calculating the output value corresponding to the plaintext data feature value according to the current local model parameter comprises:
according to the current local model parameter Wu=(w1,w2,…)uCalculating a weighted sum O of the characteristic values of the plaintext datau=(w1x1+w2x2+…)uWherein u represents the identification of the data provider and x represents the characteristic value of the data.
9. The method of claim 7, the training target model being a logistic regression model.
10. The method of claim 7, the trusted execution environment being: enclave created using software protection extension SGX technology.
11. The method of claim 7, wherein updating local model parameters based on the returned bias values comprises:
updating local model parameters by using a gradient descent method according to the returned deviation value; or
And updating local model parameters by using a random gradient descent method according to the returned deviation value.
12. The method of claim 7, wherein the global tag value is determined from a tag value provided by one data provider or jointly determined from tag values provided by a plurality of data providers.
13. A method of model training based on shared data, the method comprising:
performing iterative training by using the following steps until the model training requirement is met:
respectively obtaining data provided by a plurality of data providers, wherein the data provided by at least 1 data provider is ciphertext data, and the data provided by other data providers is plaintext data;
if the data form provided by the data provider is ciphertext data, inputting the ciphertext data into a trusted execution environment of the data provider correspondingly;
obtaining an output value of each trusted execution environment, wherein the output value is obtained by calculation according to the ciphertext data;
calculating deviation values of the model predicted values and the real values by utilizing the output values of the trusted execution environments and plaintext data provided by other data providers; the model prediction value is determined according to the output value of each trusted execution environment and the characteristic value of plaintext data; the real value is a global label value determined according to the data of each data provider;
respectively returning the deviation values to the trusted execution environments, so that the trusted execution environments respectively update the local model parameters;
wherein the following steps are performed inside any trusted execution environment:
decrypting the input ciphertext data to obtain a plaintext data characteristic value;
calculating an output value corresponding to the characteristic value of the plaintext data according to the current local model parameter;
and updating the local model parameters according to the returned deviation value.
14. The method according to claim 13, wherein calculating the output value corresponding to the plaintext data feature value according to the current local model parameter comprises:
according to the current local model parameter Wu=(w1,w2,…)uCalculating a weighted sum O of the characteristic values of the plaintext datau=(w1x1+w2x2+…)uWherein u represents the identification of the data provider and x represents the characteristic value of the data.
15. The method of claim 13, training the target model is a logistic regression model.
16. The method of claim 13, the trusted execution environment being: enclave created using software protection extension SGX technology.
17. The method of claim 13, wherein updating local model parameters based on the returned bias values comprises:
updating local model parameters by using a gradient descent method according to the returned deviation value; or
And updating local model parameters by using a random gradient descent method according to the returned deviation value.
18. The method of claim 13, wherein the global tag value is determined from a tag value provided by one data provider or jointly determined from tag values provided by a plurality of data providers.
19. A method of data prediction based on shared data modeling, the method comprising:
obtaining ciphertext data provided by a plurality of data providers;
respectively inputting the ciphertext data of each data provider into a trusted execution environment of the data provider;
obtaining an output value of each trusted execution environment, wherein the output value is obtained by calculation according to the ciphertext data;
inputting the output value of each trusted execution environment into a pre-trained prediction model, and calculating to obtain a predicted value;
wherein the following steps are performed inside any trusted execution environment:
decrypting the input ciphertext data to obtain a plaintext data characteristic value;
and calculating an output value corresponding to the characteristic value of the plaintext data according to the current local model parameter.
20. The method of claim 19, wherein calculating the output value corresponding to the plaintext data feature value according to the current local model parameter comprises:
according to the current local model parameter Wu=(w1,w2,…)uCalculating a weighted sum O of the characteristic values of the plaintext datau=(w1x1+w2x2+…)uWherein u represents the identification of the data provider and x represents the characteristic value of the data.
21. The method of claim 19, training the target model is a logistic regression model.
22. The method of claim 19, the trusted execution environment being: enclave created using software protection extension SGX technology.
23. A shared data based model training apparatus, the apparatus comprising the following means for performing iterative training:
the data acquisition module is used for acquiring ciphertext data provided by data providers, and the number of the data providers is multiple;
the data input module is used for correspondingly inputting the ciphertext data of each data provider into the trusted execution environment of the data provider;
an output value obtaining module, configured to obtain an output value of each trusted execution environment, where the output value is obtained through calculation according to the ciphertext data;
the deviation value calculation module is used for calculating the deviation value between the model predicted value and the actual value according to a given training target model; the model predicted value is determined according to the output value of each trusted execution environment, and the real value is a global tag value determined according to the data of each data provider;
the deviation value returning module is used for returning the deviation values to the trusted execution environments respectively so that the trusted execution environments update the local model parameters respectively;
wherein, arbitrary trusted execution environment includes internally:
the decryption submodule is used for decrypting the input ciphertext data to obtain a plaintext data characteristic value;
the output value operator module is used for calculating an output value corresponding to the characteristic value of the plaintext data according to the current local model parameter;
and the parameter updating submodule is used for updating the local model parameters according to the returned deviation value.
24. The apparatus of claim 23, wherein the output value calculation module is specifically configured to:
according to the current local model parameter Wu=(w1,w2,…)uCalculating a weighted sum O of the characteristic values of the plaintext datau=(w1x1+w2x2+…)uWherein u represents the identification of the data provider and x represents the characteristic value of the data.
25. The apparatus of claim 23, the training target model is a logistic regression model.
26. The apparatus of claim 23, the trusted execution environment being: enclave created using software protection extension SGX technology.
27. The apparatus according to claim 23, wherein the parameter update sub-module is specifically configured to:
updating local model parameters by using a gradient descent method according to the returned deviation value; or
And updating local model parameters by using a random gradient descent method according to the returned deviation value.
28. The apparatus of claim 23, wherein the global tag value is determined from a tag value provided by one data provider or jointly determined from tag values provided by a plurality of data providers.
29. A shared data based model training apparatus, the apparatus comprising the following means for performing iterative training:
the data acquisition module is used for respectively acquiring data provided by a plurality of data providers, wherein the data provided by at least 1 data provider is in a ciphertext data form, and the data provided by other data providers is in a plaintext data form;
the data input module is used for correspondingly inputting the ciphertext data into a trusted execution environment of a data provider if the data provided by the data provider is ciphertext data;
an output value obtaining module, configured to obtain an output value of each trusted execution environment, where the output value is obtained through calculation according to the ciphertext data;
the deviation value calculation module is used for calculating the deviation value between the model predicted value and the real value by utilizing the output value of each trusted execution environment and plaintext data provided by other data providers; the model prediction value is determined according to the output value of each trusted execution environment and the characteristic value of plaintext data; the real value is a global label value determined according to the data of each data provider;
the deviation value returning module is used for returning the deviation values to the trusted execution environments respectively so that the trusted execution environments update the local model parameters respectively;
wherein, arbitrary trusted execution environment includes internally:
the decryption submodule is used for decrypting the input ciphertext data to obtain a plaintext data characteristic value;
the output value operator module is used for calculating an output value corresponding to the characteristic value of the plaintext data according to the current local model parameter;
and the parameter updating submodule is used for updating the local model parameters according to the returned deviation value.
30. The apparatus of claim 29, wherein the output value calculation module is specifically configured to:
according to the current local model parameter Wu=(w1,w2,…)uCalculating a weighted sum O of the characteristic values of the plaintext datau=(w1x1+w2x2+…)uWherein u represents the identification of the data provider and x represents the characteristic value of the data.
31. The apparatus of claim 29, the training target model is a logistic regression model.
32. The apparatus of claim 29, the trusted execution environment being: enclave created using software protection extension SGX technology.
33. The apparatus according to claim 29, wherein the parameter update sub-module is specifically configured to:
updating local model parameters by using a gradient descent method according to the returned deviation value; or
And updating local model parameters by using a random gradient descent method according to the returned deviation value.
34. The apparatus of claim 29, wherein the global tag value is determined from a tag value provided by one data provider or jointly determined from tag values provided by a plurality of data providers.
35. A data prediction apparatus based on shared data modeling, the apparatus comprising:
the data acquisition module is used for acquiring ciphertext data provided by data providers, and the number of the data providers is multiple;
the data input module is used for correspondingly inputting the ciphertext data of each data provider into the trusted execution environment of the data provider;
an output value obtaining module, configured to obtain an output value of each trusted execution environment, where the output value is obtained through calculation according to the ciphertext data;
the predicted value calculation module is used for inputting the output value of each trusted execution environment into a pre-trained prediction model and calculating to obtain a predicted value;
wherein any trusted execution environment EuThe inside includes:
the decryption submodule is used for decrypting the input ciphertext data to obtain a plaintext data characteristic value;
and the output value operator module is used for calculating an output value corresponding to the characteristic value of the plaintext data according to the current local model parameter.
36. The apparatus of claim 35, wherein the output value obtaining module is specifically configured to:
according to the current local model parameter Wu=(w1,w2,…)uCalculating a weighted sum O of the characteristic values of the plaintext datau=(w1x1+w2x2+…)u
37. The apparatus of claim 35, the training target model is a logistic regression model.
38. The apparatus of claim 35, the trusted execution environment being: enclave created using software protection extension SGX technology.
39. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of any of claims 1 to 6 when executing the program.
40. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of any of claims 7 to 12 when executing the program.
41. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of any of claims 13 to 18 when executing the program.
42. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of any one of claims 19 to 22 when executing the program.
CN201710632357.5A 2017-07-28 2017-07-28 Model training method and device based on shared data Active CN109308418B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710632357.5A CN109308418B (en) 2017-07-28 2017-07-28 Model training method and device based on shared data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710632357.5A CN109308418B (en) 2017-07-28 2017-07-28 Model training method and device based on shared data

Publications (2)

Publication Number Publication Date
CN109308418A CN109308418A (en) 2019-02-05
CN109308418B true CN109308418B (en) 2021-09-24

Family

ID=65205429

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710632357.5A Active CN109308418B (en) 2017-07-28 2017-07-28 Model training method and device based on shared data

Country Status (1)

Country Link
CN (1) CN109308418B (en)

Families Citing this family (35)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111612167B (en) * 2019-02-26 2024-04-16 京东科技控股股份有限公司 Combined training method, device, equipment and storage medium of machine learning model
CN110162981B (en) * 2019-04-18 2020-10-02 阿里巴巴集团控股有限公司 Data processing method and device
US10846413B2 (en) 2019-04-18 2020-11-24 Advanced New Technologies Co., Ltd. Data processing method and device
CN110210233A (en) * 2019-04-19 2019-09-06 平安科技(深圳)有限公司 Joint mapping method, apparatus, storage medium and the computer equipment of prediction model
CN110162995B (en) * 2019-04-22 2023-01-10 创新先进技术有限公司 Method and device for evaluating data contribution degree
CN110569228B (en) * 2019-08-09 2020-08-04 阿里巴巴集团控股有限公司 Model parameter determination method and device and electronic equipment
US10803184B2 (en) 2019-08-09 2020-10-13 Alibaba Group Holding Limited Generation of a model parameter
CN110569663A (en) * 2019-08-15 2019-12-13 深圳市莱法照明通信科技有限公司 Method, device, system and storage medium for educational data sharing
CN110543776A (en) * 2019-08-30 2019-12-06 联想(北京)有限公司 model processing method, model processing device, electronic equipment and medium
CN110674528B (en) * 2019-09-20 2024-04-09 深圳前海微众银行股份有限公司 Federal learning privacy data processing method, device, system and storage medium
US11676011B2 (en) * 2019-10-24 2023-06-13 International Business Machines Corporation Private transfer learning
CN112749812A (en) * 2019-10-29 2021-05-04 华为技术有限公司 Joint learning system, training result aggregation method and equipment
CN111126628B (en) * 2019-11-21 2021-03-02 支付宝(杭州)信息技术有限公司 Method, device and equipment for training GBDT model in trusted execution environment
CN110942147B (en) * 2019-11-28 2021-04-20 支付宝(杭州)信息技术有限公司 Neural network model training and predicting method and device based on multi-party safety calculation
CN111079152B (en) * 2019-12-13 2022-07-22 支付宝(杭州)信息技术有限公司 Model deployment method, device and equipment
CN111027632B (en) * 2019-12-13 2023-04-25 蚂蚁金服(杭州)网络技术有限公司 Model training method, device and equipment
CN111027870A (en) * 2019-12-14 2020-04-17 支付宝(杭州)信息技术有限公司 User risk assessment method and device, electronic equipment and storage medium
CN111080123A (en) * 2019-12-14 2020-04-28 支付宝(杭州)信息技术有限公司 User risk assessment method and device, electronic equipment and storage medium
CN110955915B (en) * 2019-12-14 2022-03-25 支付宝(杭州)信息技术有限公司 Method and device for processing private data
CN111079153B (en) * 2019-12-17 2022-06-03 支付宝(杭州)信息技术有限公司 Security modeling method and device, electronic equipment and storage medium
CN111079182B (en) * 2019-12-18 2022-11-29 北京百度网讯科技有限公司 Data processing method, device, equipment and storage medium
CN111125735B (en) * 2019-12-20 2021-11-02 支付宝(杭州)信息技术有限公司 Method and system for model training based on private data
CN111079947B (en) * 2019-12-20 2022-05-17 支付宝(杭州)信息技术有限公司 Method and system for model training based on optional private data
CN111260053A (en) * 2020-01-13 2020-06-09 支付宝(杭州)信息技术有限公司 Method and apparatus for neural network model training using trusted execution environments
CN111310208A (en) * 2020-02-14 2020-06-19 云从科技集团股份有限公司 Data processing method, system, platform, equipment and machine readable medium
CN111460528B (en) * 2020-04-01 2022-06-14 支付宝(杭州)信息技术有限公司 Multi-party combined training method and system based on Adam optimization algorithm
CN111582496B (en) * 2020-04-26 2023-05-30 暨南大学 SGX-based safe and efficient deep learning model prediction system and method
CN112487460B (en) * 2020-05-09 2022-04-12 支付宝(杭州)信息技术有限公司 Privacy protection-based business prediction model training method and device
CN111339536B (en) * 2020-05-15 2020-11-24 支付宝(杭州)信息技术有限公司 Data verification method and device based on secure execution environment
CN111935179B (en) * 2020-09-23 2021-01-12 支付宝(杭州)信息技术有限公司 Model training method and device based on trusted execution environment
CN112308236A (en) * 2020-10-30 2021-02-02 北京百度网讯科技有限公司 Method, device, electronic equipment and storage medium for processing user request
CN112417485B (en) * 2020-11-30 2024-02-02 支付宝(杭州)信息技术有限公司 Model training method, system and device based on trusted execution environment
CN112580085A (en) * 2021-02-22 2021-03-30 支付宝(杭州)信息技术有限公司 Model training method and device
CN113268727A (en) * 2021-07-19 2021-08-17 天聚地合(苏州)数据股份有限公司 Joint training model method, device and computer readable storage medium
CN114548255A (en) * 2022-02-17 2022-05-27 支付宝(杭州)信息技术有限公司 Model training method, device and equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104732247A (en) * 2015-03-09 2015-06-24 北京工业大学 Human face feature positioning method
CN105261358A (en) * 2014-07-17 2016-01-20 中国科学院声学研究所 N-gram grammar model constructing method for voice identification and voice identification system
CN105989441A (en) * 2015-02-11 2016-10-05 阿里巴巴集团控股有限公司 Model parameter adjustment method and device
CN106664563A (en) * 2014-08-29 2017-05-10 英特尔公司 Pairing computing devices according to a multi-level security protocol

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105261358A (en) * 2014-07-17 2016-01-20 中国科学院声学研究所 N-gram grammar model constructing method for voice identification and voice identification system
CN106664563A (en) * 2014-08-29 2017-05-10 英特尔公司 Pairing computing devices according to a multi-level security protocol
CN105989441A (en) * 2015-02-11 2016-10-05 阿里巴巴集团控股有限公司 Model parameter adjustment method and device
CN104732247A (en) * 2015-03-09 2015-06-24 北京工业大学 Human face feature positioning method

Also Published As

Publication number Publication date
CN109308418A (en) 2019-02-05

Similar Documents

Publication Publication Date Title
CN109308418B (en) Model training method and device based on shared data
TWI689841B (en) Data encryption, machine learning model training method, device and electronic equipment
CN109388661B (en) Model training method and device based on shared data
CN109309652B (en) Method and device for training model
US11301547B1 (en) Methods, systems, and devices for an encrypted and obfuscated algorithm in a computing environment
CN111046433B (en) Model training method based on federal learning
US11295381B2 (en) Data auditing method and device
CN110955907B (en) Model training method based on federal learning
CN107465505B (en) Key data processing method and device and server
CN110457912B (en) Data processing method and device and electronic equipment
CN109426732B (en) Data processing method and device
TWI724579B (en) Block chain data processing method, device, system, processing equipment and storage medium
CN105874464B (en) System and method for introducing variation in subsystem output signal to prevent device-fingerprint from analyzing
CN109388662B (en) Model training method and device based on shared data
CN111737279B (en) Service processing method, device, equipment and storage medium based on block chain
US11676011B2 (en) Private transfer learning
CN111079152B (en) Model deployment method, device and equipment
CN109214193B (en) Data encryption and machine learning model training method and device and electronic equipment
CN112580085A (en) Model training method and device
US10740489B2 (en) System and method for prediction preserving data obfuscation
Belinschi et al. Operator-valued free multiplicative convolution: analytic subordination theory and applications to random matrix theory
US20170041300A1 (en) Electronic apparatus to predict preference of user and provide information and control method thereof
CN112199697A (en) Information processing method, device, equipment and medium based on shared root key
US10049222B1 (en) Establishing application trust levels using taint propagation
US11366893B1 (en) Systems and methods for secure processing of data streams having differing security level classifications

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20191211

Address after: P.O. Box 31119, grand exhibition hall, hibiscus street, 802 West Bay Road, Grand Cayman, Cayman Islands

Applicant after: Innovative advanced technology Co., Ltd

Address before: A four-storey 847 mailbox in Grand Cayman Capital Building, British Cayman Islands

Applicant before: Alibaba Group Holding Co., Ltd.

REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40004192

Country of ref document: HK

GR01 Patent grant
GR01 Patent grant