Summary of the invention
In view of the above technical problems, the embodiment of this specification provide a kind of model training method based on shared data and
Device, technical solution are as follows:
According to the 1st of this specification embodiment the aspect, a kind of model training method based on shared data, this method are provided
Include:
It is iterated training using following steps, until meeting model training requirement:
The ciphertext data that at least one data providing provides are obtained respectively;
The ciphertext data of each data providing are respectively corresponded to the credible performing environment of input data provider;
The output valve of each credible performing environment is obtained, the output valve is calculated according to the ciphertext data;
According to given training objective model, the deviation of computation model predicted value and true value;The model predication value
It is determined according to the output valve of each credible performing environment, the true value is the overall situation according to determined by the data of each data providing
Label value;
The deviation is back to each credible performing environment respectively, so that each credible performing environment updates localized mode respectively
Shape parameter;
Wherein, following steps are executed inside any credible performing environment:
The ciphertext data of input are decrypted, clear data characteristic value is obtained;
According to current partial model parameter, the corresponding output valve of clear data characteristic value is calculated;
According to the deviation of return, partial model parameter is updated.
According to the 2nd of this specification embodiment the aspect, a kind of model training method based on shared data, this method are provided
Include:
The ciphertext data of at least 1 data providing are respectively corresponded to the credible performing environment of input data provider;?
In each credible performing environment, the ciphertext data of input are decrypted respectively, obtain each clear data characteristic value;
It is iterated training using following steps, until meeting model training requirement:
In each credible performing environment, according to current partial model parameter, it is corresponding defeated to calculate clear data characteristic value
It is worth out;
According to given training objective model, the deviation of computation model predicted value and true value;The model predication value
It is determined according to the output valve of each credible performing environment, the true value is the overall situation according to determined by the data of each data providing
Label value;
The deviation is back to each credible performing environment respectively;
In each credible performing environment, partial model parameter is updated according to the deviation of return.
According to the 3rd of this specification embodiment the aspect, a kind of model training method based on shared data, this method are provided
Include:
It is iterated training using following steps, until meeting model training requirement:
The data that multiple data providings provide are obtained respectively, wherein the data mode that at least one data providing provides
For ciphertext data, the data mode that other data providings provide is clear data;
If the data mode that data providing provides is ciphertext data, ciphertext data are corresponded into the input data provider
Credible performing environment;
The output valve of each credible performing environment is obtained, the output valve is calculated according to the ciphertext data;
The clear data provided using the output valve of each credible performing environment and other data providings, computation model
The deviation of predicted value and true value;The model predication value is according to the output valve determination of each credible performing environment and plaintext number
According to characteristic value determine;The true value is the overall situation label value according to determined by the data of each data providing;
The deviation is back to each credible performing environment respectively, so that each credible performing environment updates localized mode respectively
Shape parameter;
Wherein, following steps are executed inside any credible performing environment:
The ciphertext data of input are decrypted, clear data characteristic value is obtained;
According to current partial model parameter, the corresponding output valve of clear data characteristic value is calculated;
According to the deviation of return, partial model parameter is updated.
According to the 4th of this specification embodiment the aspect, a kind of data predication method based on shared data modeling is provided, it should
Method includes:
The ciphertext data that at least one data providing provides are obtained respectively;
The ciphertext data of each data providing are respectively corresponded to the credible performing environment of input data provider;
The output valve of each credible performing environment is obtained, the output valve is calculated according to the ciphertext data;
By the output valve input of each credible performing environment prediction model trained in advance, predicted value is calculated;
Wherein, following steps are executed inside any credible performing environment:
The ciphertext data of input are decrypted, clear data characteristic value is obtained;
According to current partial model parameter, the corresponding output valve of clear data characteristic value is calculated.
According to the 5th of this specification embodiment the aspect, a kind of model training apparatus based on shared data, the device are provided
Including below for realizing the module of repetitive exercise:
Data obtaining module, for obtaining the ciphertext data of at least one data providing offer respectively;
Data input module, for the ciphertext data of each data providing to be respectively corresponded the credible of input data provider
Performing environment;
Output valve obtains module, and for obtaining the output valve of each credible performing environment, the output valve is according to described close
Literary data are calculated;
Deviation computing module, for according to given training objective model, computation model predicted value and true value it is inclined
Difference;The model predication value determines that the true value is provides according to each data according to the output valve of each credible performing environment
Overall situation label value determined by the data of side;
Deviation return module credible is held for the deviation to be back to each credible performing environment respectively so that each
Row environment updates partial model parameter respectively;
Wherein, include: inside any credible performing environment
It decrypts submodule and obtains clear data characteristic value for the ciphertext data of input to be decrypted;
Output valve computational submodule, for it is corresponding to calculate clear data characteristic value according to current partial model parameter
Output valve;
Parameter updates submodule, for the deviation according to return, updates partial model parameter.
According to the 6th of this specification embodiment the aspect, a kind of model training apparatus based on shared data, the device are provided
Including below for realizing the module of repetitive exercise:
Data obtaining module, the data provided for obtaining multiple data providings respectively, wherein at least one data mention
The data mode that supplier provides is ciphertext data, and the data mode that other data providings provide is clear data;
Data input module, if the data mode for data providing to provide is ciphertext data, by ciphertext data pair
Answer the credible performing environment of the input data provider;
Output valve obtains module, and for obtaining the output valve of each credible performing environment, the output valve is according to described close
Literary data are calculated;
Deviation computing module, for the output valve and the offer of other data providings using each credible performing environment
Clear data, the deviation of computation model predicted value and true value;The model predication value is according to each credible performing environment
The characteristic value of output valve determination and clear data determines;The true value is according to determined by the data of each data providing
Global label value;
Deviation return module credible is held for the deviation to be back to each credible performing environment respectively so that each
Row environment updates partial model parameter respectively;
Wherein, include: inside any credible performing environment
It decrypts submodule and obtains clear data characteristic value for the ciphertext data of input to be decrypted;
Output valve computational submodule, for it is corresponding to calculate clear data characteristic value according to current partial model parameter
Output valve;
Parameter updates submodule, for the deviation according to return, updates partial model parameter.
According to the 7th of this specification embodiment the aspect, a kind of data prediction meanss based on shared data modeling are provided, it should
Device includes:
Data obtaining module, for obtaining the ciphertext data of at least one data providing offer respectively;
Data input module, for the ciphertext data of each data providing to be respectively corresponded the credible of input data provider
Performing environment;
Output valve obtains module, and for obtaining the output valve of each credible performing environment, the output valve is according to described close
Literary data are calculated;
Predictor calculation module, the prediction model trained in advance for the output valve input by each credible performing environment, meter
Calculation obtains predicted value;
Wherein, any credible performing environment EuInside includes:
It decrypts submodule and obtains clear data characteristic value for the ciphertext data of input to be decrypted;
Output valve computational submodule, for it is corresponding to calculate clear data characteristic value according to current partial model parameter
Output valve.
Technical solution provided by this specification embodiment: on the one hand, the number that can be provided according to multiple data providings
According to joint training is carried out, to obtain more accurate comprehensive data model;It on the other hand, will be hidden involved in model training process
The operation (such as data deciphering operation, partial model parameter updating operation etc.) of private data is all encapsulated in the credible of data providing
It is executed in performing environment.That is: data clear text can not be obtained except credible performing environment, to be effectively guaranteed altogether
Enjoy the Information Security of data providing.
It should be understood that above general description and following detailed description be only it is exemplary and explanatory, not
The embodiment of this specification can be limited.
When it is implemented, data mining side can be by way of network transmission from data providing request data, data
After provider encrypts data, ciphertext data are sent to data mining side by network;In another embodiment,
Ciphertext data can also be stored in the storage equipment of data mining side, so that data mining side is directly from local reading.
The ciphertext data of each data providing are respectively corresponded the credible execution ring for inputting each data providing by S102
Border;
Credible performing environment E is respectively created to each data providing 1,2 ... U in data mining policy1、E2…EU, to protect
For card for data provided by arbitrary data providing u (u=1,2 ... U), the operation for being related to data safety can only be right at its
The E answereduMiddle progress, in EuExcept can not both perceive these operations, can not also influence these operations.
Under different implementations, the concrete mode for creating credible performing environment is different, and this specification embodiment is not right
The concrete mode for creating credible performing environment is defined.In addition, data mining side creates the operation of credible performing environment, both may be used
To be executed after the ciphertext data for obtaining data providing for the first time, can also first carry out in advance.
Each credible performing environment provides input interface and output interface, the function of one of output interface to outside
It is: receives externally input ciphertext data.After data mining side obtains the ciphertext data of data providing, determining and each data
The corresponding E of provider uu, ciphertext data are then inputted into each E respectivelyuCiphertext Data Input Interface.
S103 obtains the output valve O of each credible performing environment1、O2…OU;
In each EuInside can first be decrypted the ciphertext data inputted, obtain clear data Xu, then according to pre-
If algorithm and inside partial model parameter WuTo clear data XuIt is calculated, obtains corresponding output valve Ou, and pass through
EuAn output interface by OuIt exports to EuIt is external.That is, data mining side can obtain multiple credible execution rings respectively
The output valve O in border1、O2…OU。
For the implementation of complete description system, the present embodiment first from the angle of data mining side, is instructed data model
Experienced disposed of in its entirety process is introduced, due to EuInternal operation is externally sightless, therefore in the present embodiment can be with
By each EuRegard black box as.About each EuInternal specific implementation, will be described in detail in the embodiment below.
S104 calculates deviation Δ=Y-h [z (O according to given training objective model1,O2,…OU)];
Deviation is a calculative significant in value in model training iterative process, it is assumed that for a data sample
I, data mode can be expressed as (Xi, yi), in which:
Xi=(xi1,xi2...), xi1,xi2... it is respectively multiple characteristic values of data sample i;
yiFor the label value of data sample i;
Assuming that the form of training objective model is y=h (X), then for data sample (Xi, yi), prediction deviation value is equal to
Label value yiWith model predication value h (Xi) difference, it may be assumed that
Δi=h (Xi)-yiOr Δi=yi-h(Xi)
Deviation ΔiMainly there are two aspect effects in model training:
On the one hand effect is the fitting effect with evaluation model to training sample set: for any bar data sample i, Δi
It is worth smaller, illustrates that the fitting effect of model is better;If there is n group data sample, then n ΔiValue is smaller on the whole, illustrates mould
The fitting effect of type is better.It is general by calculating Σ Δ in practical applicationiMode, come on the whole evaluation function to training
The fitting effect of sample set.
On the other hand effect is to participate in model parameter iteration to update operation: assuming that there is a group model parameter W=(w1,
w2...), then the citation form (may have various deformation in practical application) that parameter iteration updates is as follows:
W←W-αΔX
The process of entire model training is that continuous iteration updates model parameter, so that fitting of the model to training sample set
Effect reaches training requirement (such as deviation is sufficiently small).Parameter more new formula is briefly described below, more about parameter
The specific derivation process of new formula, reference can be made to the introduction of the prior art.
In above-mentioned more new formula, " W " on the right side of arrow indicates that the parameter value before updating every time, " W " on the left of arrow indicate
Each updated parameter value, it can be seen that the variable quantity updated every time is the product of α, Δ, X three.
α indicates learning rate, also referred to as step-length, which determines that each iteration is the update amplitude of parameter, if learning rate
It is too small, may cause reach training requirement this process speed it is slow, if learning rate is excessive, may cause
Overshoot the minimum phenomenon, i.e., approach fitting with update process without decree model.It is appropriate on how to choose
Learning rate may refer to the introduction of the prior art, and in embodiment in the present specification, α is considered as preset numerical value.
X indicates the characteristic value of data sample, according to the difference of selected more new formula, X may also representation eigenvalue not
Same form further can be illustrated in this specification latter embodiments.
In this specification example scheme, data mining side can obtain the output valve of multiple credible performing environments respectively
O1、O2…OU, it is assumed that y=h (z) is global training objective pattern function, and wherein z is about O1、O2…OUFunction, be denoted as z
(O1,O2,…OU), that is, it is directed to the Copula of U data providing output valve, and O1、O2…OUIt is again about X respectively1、X2…
XUFunction, in summary: y=h (z) be also about X1、X2…XUFunction.
Define Δ=Y-h [z (O1,O2,…OU)] or Δ=h [z (O1,O2,…OU)]-Y;
Wherein h [z (O1,O2,…OU)] it is z (O1,O2,…OU) model predication value;Y is z (O1,O2,…OU) corresponding true
Real value, the i.e. overall situation label value according to determined by the data of each data providing;The difference DELTA of the two is deviation.
Y=h (z) actual form can be selected according to hands-on demand, such as linear regression model (LRM) (linear
Regression model), logistical regression model (logistic regression model), etc..This specification is real
Example is applied not need to be defined.
In addition, for every group of O1,O2,…OU, corresponding overall situation label value Y can be determining according to various ways, later
Embodiment in will be explained in.
Δ is back to E by S105 respectively1、E2…EU, so that E1、E2…EUPartial model parameter W is updated respectively1、W2…WU。
Updated parameter will be used to calculate output valve O during next iterationu。
It is the angle from data mining side above, the disposed of in its entirety process of data model training is introduced, lower mask
Body introduces the processing logic inside credible performing environment:
As shown in Fig. 2, for any credible performing environment Eu, it is internal to realize 3 kinds of basic functions:
1) data deciphering:
The encryption logic that corresponding data provider u is used, in EuIn be stored with corresponding decryption logic, such as decipherment algorithm
Information, key information etc..According to these information, in EuInside can be decrypted the ciphertext data of input, obtain in plain text
Data feature values Xu=(x1,x2,…)u。
Data deciphering operation executes after S102.
2) output valve calculates:
In EuIn be stored with partial model parameter Wu=(w1,w2,…)u, in EuInside, can be according to current partial model
Parameter Wu, calculate XuCorresponding output valve Ou;In entire training process, WuIt is that continuous iteration updates, when iterating to calculate for the first time
Use the parameter value of initialization.
OuSpecific calculation, be according to world model y=h (z)=h [z (O1,O2,…OU)] form determine,
For example, world model is illustrated as y=h (z)=h (w for linear regression model (LRM) and logistical regression model1x1+
w2x2+ ...) form, then corresponding OuWith Copula z (O1,O2,…OU) can be following form respectively:
Ou=w1 ux1 u+w2 ux2 u+ ..., it is denoted as (w1x1+w2x2+…)u
In practical application, above-mentioned OuExpression formula in, can also include a constant term parameter bu, it may be assumed that
Ou=bu+w1 ux1 u+w2 ux2 u+…
In fact, if enabling bu=w0 u, and by w0 uUnderstanding is characterized x0 uCorresponding parameter and feature x0 uCharacteristic value
It is constantly equal to 1, then OuExpression formula can indicate are as follows:
Ou=w0 ux0 u+w1 ux1 u+w2 ux2 u+…
As it can be seen that regardless of whether there are constant term parameter, the form of whole expression formula is unified, therefore Ou=w1 ux1 u+
w2 ux2 u+ ... expression formula be interpreted as covering " having constant term parameter " and " no constant term parameter " two kinds of situations simultaneously.It is practical
It both may include constant term parameter in model parameter for arbitrary u in, and can not also include constant term ginseng
Number.
Certainly, above-mentioned OuWith Copula z (O1,O2,…OU) form be only used for schematically illustrating, should not be construed as pair
The restriction of this specification example scheme.
Output valve calculating operation executes after the operation of above-mentioned data deciphering, after output valve is calculated, continues to execute S104.
3) parameter updates:
In each EuCurrent partial model parameter W is preserved in insideu, receive EuAfter the deviation Δ that outside returns,
According to shaped like Wu←Wu-αΔXuParameter more new formula to WuIt is updated (using the parameter of initialization before updating for the first time
Value).Certainly, the parameter of actual use more new formula is not limited to the above form.Such as:
If read every time from data source and to EuIn have input 1 data i as training sample, then parameter more new formula
Are as follows: Wu←Wu-αΔiXi u;
If read every time from data source and to EuIn have input a plurality of data and decline as training sample, and using gradient
Method (gradient descent) carries out parameter update, then parameter more new formula are as follows:I.e. all instructions
Practice sample and both participates in update operation;
If read every time from data source and to EuIn have input a plurality of data as training sample, and use stochastic gradient
Descent method (stochastic gradient descent) carries out parameter update, then parameter more new formula are as follows: Wu←Wu-αΔiXi u, wherein i is arbitrary value, that is, randomly selects a training sample and participate in updating operation;
More new algorithm is only used for schematically illustrating above, should not be construed as the restriction to scheme.For example, quasi- in order to reduce
Phenomenon is closed, regularization correction item can be increased in more new formula.In addition there are other available more new algorithms, and this specification is not
It enumerates one by one again.
Parameter updating operation executes after above-mentioned S105, and after parameter updates, an iteration, which updates, to be completed, after this update
Obtained parameter will be used to calculate output valve O during next iterationu。
Primary complete iterative process is described above, iterates through the above steps, until meeting model instruction
Practice and require, model training here requires to can be for example: the deviation Δ of world model is sufficiently small, adjacent iterates to calculate twice
Δ difference is sufficiently small, EuThe internal O iterated to calculate twiceuDifference is sufficiently small or reach preset the number of iterations etc., when
Additional verifying collection also so can be used to be verified, this specification requires not needing to limit to specific model training
It is fixed.
As it can be seen that using above scheme, by the operation of private data involved in model training process, (such as data deciphering is grasped
Work, partial model parameter updating operation etc.) it is all encapsulated in the credible performing environment of data providing and executes.That is:
Data clear text can not be obtained except credible performing environment, in some embodiments, except credible performing environment even nothing
Method gets specific partial model parameter, to be effectively guaranteed the Information Security of shared data provider.
Above from the whole model training scheme based on shared data for describing this specification embodiment and providing, in conjunction with reality
Border application demand, in terms of details whole design, there are also some alternative embodiments, are exemplified below:
In S101~S102, a data can be only read every time to credible performing environment, can also disposably be read more
Data is to credible performing environment.In i.e. each iterative process, N ciphertext data are obtained from each data providing respectively, wherein N
It can be preset 1 numerical value of being not less than and realize the replacement of training sample by obtaining content different data every time.
It will additionally be appreciated that the acquisition of training sample data, both can gradually obtain with iterative process, be also possible to
It is disposable to obtain.For example, data bulk needed for iteration is N every time, then either obtain in each iteration N data,
And N data is decrypted in credible performing environment;It is (such as complete greater than the data of N to be also possible to disposably to obtain quantity
Amount data or the multiple of N etc.) credible performing environment is inputted afterwards, every time in credible performing environment on demand to the progress of N data
Decryption;Can also be it is disposable obtain that data (such as full dose data or the multiple of N etc.) of the quantity greater than N input afterwards can
Believe performing environment and is disposably decrypted in data of the credible performing environment to input;Etc..
As it can be seen that the step of each iteration is had to carry out in practical application includes S103-S105 and credible performing environment
Internal " output valve calculating ", " parameter update " step, the and " data inside step S101, S102 and credible performing environment
Decryption " step, necessarily executes in each iteration.In short, the mode that sample data obtains can be clever according to the actual situation
Setting living, these have no effect on the realization of overall plan.
The relevance of data between multiple data providings, can feature reality general by certain and having mark action
It is existing, such as pass through ID card No., it is ensured that from multiple data providings data obtained to be for describing same people.
The identification characteristics do not necessarily participate in model training, and can improve this feature Information Security by modes such as Hash.
Each EuThe information for being all based on data providing oneself offer is created, EuIt should meet in whole design basic
Design standard, but be not required in specific implementation completely the same.Such as different data deciphering algorithms, difference can be used
Parameter more new algorithm etc..
For every group of O1,O2,…OU, global label value Y can be determining according to various ways, such as:
1) the label value Y that some data providing u is provideduIt is determined as Y;
2) according to the label value Y of multiple data providing u1, u2 ... offersu1、Yu2... Y, specific method of determination are determined jointly
It can be and for example calculate weighted average, " logical AND ", " logic or " etc.;
3) Y is determined by other channels except data providing;
For example, establishing a prediction model to certain disease illness rate, it is known that the disease and (age, occupation, property
Not, height, weight) 5 feature correlations, and:
Mechanism 1 can provide characteristic: age, occupation;
Mechanism 2 can provide characteristic: gender, height, weight;
Assuming that the prediction model is two disaggregated models, i.e., model output value includes that two kinds of " illness " and " non-illness " is (corresponding
Prediction result can be presented as " high risk " and " low-risk ").Each mechanism both can on the basis of characteristic is provided,
Further provide for label value, i.e., " whether illness " as a result, label value can not also be provided.And the determination of global label value,
Can there are many strategy, such as:
Global label value be subject to a certain mechanism offer label value, actual conditions may be that this mechanism of family more weighs
Prestige, it is also possible to which another mechanism can not provide label value.
The label value that global label value is provided according to Liang Jia mechanism determines jointly, such as: if at least a mechanism provides
Label value be " illness ", then global label value is determined as " illness ".
In addition, in some cases, a collection of user " whether illness " directly may also be known from other channels in data mining side
As a result, demand is further to excavate the relationship of the result Yu other features, " other features " can be obtained from data providing,
And the above-mentioned result known in advance can be directly as global label value.
After training, each EuIt will can update for the last time obtained parameter to export to data mining side, so as to data
Safeguard complete data model in excavation side.Parameter distribution formula can also be stored in each EuInside, further to promote safety
Property.
If using by each EuParameter is still stored in EuInternal scheme, then in model service stage, each data are mentioned
Supplier respectively uploads ciphertext data to EuIn, by EuCiphertext data are decrypted and calculated with output valve, finally by data mining side
According to each EuOutput valve, calculate the output result of world model.Fig. 3, which shows, has handled a kind of data based on shared data modeling
Prediction technique, this method may comprise steps of:
S201 obtains the ciphertext data of U data providing offer, U >=2 respectively;
The ciphertext data of each data providing are respectively corresponded the credible performing environment E of input data provider by S2021、
E2…EU;
S203 obtains the output valve O of credible performing environment1、O2…OU;
Predicted value Y=h [z (O is calculated according to prediction model trained in advance in S2041,O2,…OU)];
Comparison diagram 2 and Fig. 3 can be seen that in model service stage, still use close copy training stage system architecture,
Difference place is not needing to carry out parameter iteration update, i.e., according to the input data, disposable to export prediction result value y.
Accordingly for any credible performing environment Eu, partial model parameter WuIt has been pre-saved that, in model service stage,
Its 2 kinds of basic function of internal realization:
1) the ciphertext data of input are decrypted, obtain clear data characteristic value Xu;
2) according to current partial model parameter Wu, calculate XuCorresponding output valve Ou;
In model service stage, the specific implementation of each step can be found in the correspondence step in model training stage, this reality
Example is applied to be not repeated to illustrate.
In the above embodiments, 2 above data providers are described, the reality of data aggregate training pattern is provided jointly
Existing scheme, it is to be understood that other improvements can also be done, on the basis of above scheme to meet answering for some special scenes
With demand, it is exemplified below:
When only 1 data providing provides data and there are privacy requirements to data to data mining side, can use
The training of following scheme implementation model:
Data mining side is iterated training using following steps, until meeting model training requirement:
S101 ' obtains the ciphertext data that 1 data providing provides;
The ciphertext data of data providing are inputted the credible performing environment E of the data providing by S102 ';
S103 ' obtains the output valve O of credible performing environment E;
S104 ' is according to given training objective model, the deviation Δ of computation model predicted value and true value;
Deviation Δ is back to credible performing environment E by S105 ', so that the credible performing environment updates model parameter;
The present embodiment, which can be adapted for some data providing commission data mining side, to carry out data mining and is not intended to
To the application scenarios of data mining side's leak data details.
Compared with S101~S105, above-mentioned S101 '~S105 ' is that U data providing is reduced to 1 data providing
Situation, other implementations are almost the same, are not repeated to illustrate in the present embodiment.Wherein, in credible performing environment E
Portion, still realization data deciphering, output valve calculate, parameter updates three kinds of functions.
When there are multiple data providings to data mining side's offer data and wherein to have partial data provider to data
When there is no privacy requirements, the training of following scheme implementation model can use:
Data mining side is iterated training using following steps, until meeting model training requirement:
S101 " obtains the data of U data providing (wherein U >=2) offer respectively, wherein at least one data providing
The data mode of offer is ciphertext data, and the data mode that other data providings provide is clear data;
If the data mode that S102 " data providing u is provided is ciphertext data, ciphertext data are corresponded to input data should
The credible performing environment E of provideru, u here refers in particular to the data providing of data confidentiality demand.
S103 " obtains the output valve O of each credible performing environmentu;
S104 " utilizes each credible performing environment EuOutput valve OuAnd the clear data that other data providings provide,
The deviation Δ of computation model predicted value and true value;
The difference of this step and S104 are: for providing the provider of clear data, data mining side can directly be obtained
It gets corresponding clear data and participates in global calculation, need not move through credible performing environment.
The deviation is back to each credible performing environment by S105 " respectively, so that each credible performing environment updates respectively
Partial model parameter;
The difference of this step and S105 are: for providing the provider of clear data, data mining side can directly be born
Duty is safeguarded and updates partial model parameter.
Compared with S101~S105, above-mentioned S101 "~S105 " is that U data providing is divided into two classes: for not counting
According to the data providing of privacy requirements, the clear data that data mining side can directly acquire its offer carries out model training;And
For there is the data providing of data confidentiality demand, the ciphertext data provided still need to be handled by credible performing environment.
Wherein, inside credible performing environment, still realization data deciphering, output valve are calculated, parameter updates three kinds of functions.
The scheme of the present embodiment does not have the scene of privacy requirements suitable for certain data characteristicses needed for world model.When
So, from the perspective of data-privacy, " do not have privacy requirements " here be not generally in absolute sense, but in data
There is no privacy requirements inside excavation side.Such as certain data providing and data mining side have depth cooperation relationship or data
Also there are a data can be used to participate in world model's training (it is considered that data mining side oneself is exactly to count for excavation side oneself
According to one of provider), then for data mining side, these do not have the data of privacy requirements can be without credible execution ring
Border and directly use.
Below with reference to specific example, the scheme of this specification embodiment is illustrated;
Assuming that whole training demand is: the user's asset data provided according to Liang Jia banking institution, establishing one, " prediction is used
The model of the whether capable great number of the repayment on schedule loan in family ".
The data characteristics that bank 1 can provide is x1,x2,x3;
The data characteristics that bank 2 can provide is x4,x5;
Holistic modeling uses logistical regression model, functional form are as follows:
Wherein:
Z=(w1x1+w2x2+w3x3+w4x4+w5x5) (2)
w1,w2,w3For the local parameter of bank 1, w4,w5, it is the local parameter of bank 2.
Definition:
Sum1=w1x1+w2x2+w3x3 (3)
Sum2=w4x4+w5x5 (4)
Then according to formula (1)~(4), the deviation of available world model calculates function:
Credible performing environment realizes that the credible performing environment of creation is known as enclave using the SGX technology of Intel, has
For body, this mode be the safety operation of legal software is encapsulated in an enclave, once software and data are located at
In enclave, even if operating system or can not also influence the code and data inside enclave with VMM (Hypervisor).
The security boundary of enclave only includes CPU and own.
Implementation model training system overall architecture as shown in fig. 4 a, separately below from data providing and data mining side
Angle, the implementation of system is illustrated:
1) data providing:
Every bank respectively encrypts the data for being provided to data mining side, can store after data encryption to number
According to the hard disk of provider.Certainly, according to practical application request, certain parts in data can also be provided with clear-text way.
Every bank provide respectively enclave define file (.edl) and its corresponding dynamic link library (.dll or
.so), the enclave of output includes following function or interface:
1.1) the ciphertext data encrypted in advance to the externally input bank of enclave are decrypted, and obtain clear data.Often
Secondary iteration inputs N ciphertext data, and the corresponding user of every data, for any user i, the clear data of bank 1 is xi1,
xi2,xi3, the clear data of bank 2 is xi4,xi5;
1.2) it according to current local parameter values, calculates separately the output valve of pieces of data and exports to enclave.
For any user i, the output valve of bank 1 is sum1i, the output valve of bank 2 is sum2i;
1.3) according to the Δ returned outside enclaveiValue updates local parameter, updates and uses gradient descent method, every time repeatedly
Operation is both participated in for all N datas, more new formula is as follows:
W←W-α∑iΔiXi (6)
That is:
w1←w1-α∑iΔixi1
w2←w2-α∑iΔixi2
w3←w3-α∑iΔixi3
w4←w4-α∑iΔixi4
w5←w5-α∑iΔixi5
Wherein
α is preset learning rate, and bank 1 and bank 2 can be the same or different using α.
2) data mining side:
Data mining side global label value Y unified first, Y value is for indicating: there is the user of great number behavior of lending,
Whether can repay the loan on schedule.The information can be obtained from Liang Jia bank, can also be obtained from other lending agencies.
The enclave information that load Liang Jia bank provides respectively, creates enclave1 and enclave2, is based on enclave1
Model training application is established with enclave2, the operating mechanism of the application is as follows:
2.1) iteration reads a collection of ciphertext data from hard disk every time, it is assumed that reading quantity every time is N.Identity card can be passed through
Number is associated reading to two bank datas.By the ciphertext data input enclave1 of bank 1, the ciphertext data of bank 2
Input enclave2.
2.2) inside enclave1 and enclave2, ciphertext data are decrypted respectively, are joined according to current part
Number (uses initial parameter value) for the first time when iteration, calculate separately to obtain sum1 using formula (3) and formula (4)iAnd sum2iAnd it exports extremely
It is external.2.3) sum1 exported according to enclave1 and enclave2iAnd sum2i, Δ is calculated using formula (7)i, and by ΔiRespectively
Return to enclave1 and enclave2;
2.4) it inside enclave1 and enclave2, is utilized respectively formula (6) and parameter is updated.
It repeats the above iteration and obtains final parameter value w until meeting model training condition1,w2,w3,w4,w5, by these
It is worth substitution formula (1) and formula (2) to get the model trained to needs.
Fig. 4 b shows the system overall architecture of another implementation model training, and corresponding whole training demand is: number
The assets for possessing some user's asset datas according to excavation side oneself and needing to be provided according to one's own data and bank 1
Data establish the conjunctive model of one " whether prediction user has the ability to repay great number loan on schedule ", in which:
The data characteristics that bank 1 can provide is x1,x2,x3;Corresponding local parameter is w1,w2,w3;
The one's own data characteristics of data providing is x4,x5;Corresponding local parameter is w4,w5;
Compared with a upper embodiment, whole model training thinking is almost the same, and difference is pointed out to be only that: only for bank
1 creation enclave, for feature x4,x5For, data providing oneself can be read directly clear data and participate in model training
It calculates.
Corresponding to above method embodiment, this specification embodiment also provides a kind of model training dress based on shared data
It sets, shown in Figure 5, the apparatus may include below for realizing the module of repetitive exercise:
Data obtaining module 110, for obtaining the ciphertext data of at least one data providing offer respectively;
Data input module 120, for the ciphertext data of each data providing to be respectively corresponded input data provider's
Credible performing environment;
Output valve obtains module 130, and for obtaining the output valve of each credible performing environment, the output valve is according to
Ciphertext data are calculated;
Deviation computing module 140, the training objective model given for basis, computation model predicted value and true value
Deviation;The model predication value determines that the true value is to mention according to each data according to the output valve of each credible performing environment
Overall situation label value determined by the data of supplier;
Deviation return module 150, for the deviation to be back to each credible performing environment respectively, so that each credible
Performing environment updates partial model parameter respectively;
Wherein, include: inside any credible performing environment
It decrypts submodule and obtains clear data characteristic value for the ciphertext data of input to be decrypted;
Output valve computational submodule, for it is corresponding to calculate clear data characteristic value according to current partial model parameter
Output valve;
Parameter updates submodule, for the deviation according to return, updates partial model parameter.
In a kind of specific embodiment that this specification provides, mentioned when there are multiple data providings to data mining side
For data and when wherein having partial data provider not have privacy requirements to data, then the functions of modules of above-mentioned apparatus can match
It sets as follows:
Data obtaining module 110, the data provided for obtaining multiple data providings respectively, wherein at least one data
The data mode that provider provides is ciphertext data, and the data mode that other data providings provide is clear data;
Data input module 120 will be close in the case that the data mode for providing in data providing is ciphertext data
Literary data correspond to the credible performing environment of the input data provider;
Output valve obtains module 130, and for obtaining the output valve of each credible performing environment, the output valve is according to
Ciphertext data are calculated;
Deviation computing module 140, for being mentioned using the output valve and other data providings of each credible performing environment
The clear data of confession, the deviation of computation model predicted value and true value;The model predication value is according to each credible performing environment
Output valve is determining and the characteristic value of clear data determines;The true value is is determined according to the data of each data providing
Global label value;
Deviation return module 150, for the deviation to be back to each credible performing environment respectively, so that each credible
Performing environment updates partial model parameter respectively;
Wherein, include: inside any credible performing environment
It decrypts submodule and obtains clear data characteristic value for the ciphertext data of input to be decrypted;
Output valve computational submodule, for it is corresponding to calculate clear data characteristic value according to current partial model parameter
Output valve;
Parameter updates submodule, for the deviation according to return, updates partial model parameter.
Shown in Figure 6, this specification embodiment also provides a kind of data prediction meanss based on shared data modeling, should
Device may include:
Data obtaining module 210, for obtaining the ciphertext data of at least one data providing offer respectively;
Data input module 220, for the ciphertext data of each data providing to be respectively corresponded input data provider's
Credible performing environment;
Output valve obtains module 230, and for obtaining the output valve of each credible performing environment, the output valve is according to
Ciphertext data are calculated;
Predictor calculation module 240, the prediction model trained in advance for the output valve input by each credible performing environment,
Predicted value is calculated;
Wherein, any credible performing environment EuInside includes:
It decrypts submodule and obtains clear data characteristic value for the ciphertext data of input to be decrypted;
Output valve computational submodule, for it is corresponding to calculate clear data characteristic value according to current partial model parameter
Output valve.
This specification embodiment also provides a kind of computer equipment, includes at least memory, processor and is stored in
On reservoir and the computer program that can run on a processor, wherein processor may be implemented above-mentioned when executing described program
Model training method or data predication method.
Fig. 7 shows one kind provided by this specification embodiment and more specifically calculates device hardware structural schematic diagram,
The equipment may include: processor 1010, memory 1020, input/output interface 1030, communication interface 1040 and bus
1050.Wherein processor 1010, memory 1020, input/output interface 1030 and communication interface 1040 are real by bus 1050
The now communication connection inside equipment each other.
Processor 1010 can use general CPU (Central Processing Unit, central processing unit), micro- place
Reason device, application specific integrated circuit (Application Specific Integrated Circuit, ASIC) or one
Or the modes such as multiple integrated circuits are realized, for executing relative program, to realize technical side provided by this specification embodiment
Case.
Memory 1020 can use ROM (Read Only Memory, read-only memory), RAM (Random Access
Memory, random access memory), static storage device, the forms such as dynamic memory realize.Memory 1020 can store
Operating system and other applications are realizing technical side provided by the embodiment of this specification by software or firmware
When case, relevant program code is stored in memory 1020, and execution is called by processor 1010.
Input/output interface 1030 is for connecting input/output module, to realize information input and output.Input and output/
Module can be used as component Configuration (not shown) in a device, can also be external in equipment to provide corresponding function.Wherein
Input equipment may include keyboard, mouse, touch screen, microphone, various kinds of sensors etc., output equipment may include display,
Loudspeaker, vibrator, indicator light etc..
Communication interface 1040 is used for connection communication module (not shown), to realize the communication of this equipment and other equipment
Interaction.Wherein communication module can be realized by wired mode (such as USB, cable etc.) and be communicated, can also be wirelessly
(such as mobile network, WIFI, bluetooth etc.) realizes communication.
Bus 1050 include an access, equipment various components (such as processor 1010, memory 1020, input/it is defeated
Outgoing interface 1030 and communication interface 1040) between transmit information.
It should be noted that although above equipment illustrates only processor 1010, memory 1020, input/output interface
1030, communication interface 1040 and bus 1050, but in the specific implementation process, which can also include realizing normal fortune
Other assemblies necessary to row.In addition, it will be appreciated by those skilled in the art that, it can also be only comprising real in above equipment
Component necessary to existing this specification example scheme, without including all components shown in figure.
This specification embodiment also provides a kind of computer readable storage medium, is stored thereon with computer program, the journey
Foregoing model training method or data predication method are realized when sequence is executed by processor:
Computer-readable medium includes permanent and non-permanent, removable and non-removable media can be by any method
Or technology come realize information store.Information can be computer readable instructions, data structure, the module of program or other data.
The example of the storage medium of computer includes, but are not limited to phase change memory (PRAM), static random access memory (SRAM), moves
State random access memory (DRAM), other kinds of random access memory (RAM), read-only memory (ROM), electric erasable
Programmable read only memory (EEPROM), flash memory or other memory techniques, read-only disc read only memory (CD-ROM) (CD-ROM),
Digital versatile disc (DVD) or other optical storage, magnetic cassettes, tape magnetic disk storage or other magnetic storage devices
Or any other non-transmission medium, can be used for storage can be accessed by a computing device information.As defined in this article, it calculates
Machine readable medium does not include temporary computer readable media (transitory media), such as the data-signal and carrier wave of modulation.
As seen through the above description of the embodiments, those skilled in the art can be understood that this specification
Embodiment can be realized by means of software and necessary general hardware platform.Based on this understanding, this specification is implemented
Substantially the part that contributes to existing technology can be embodied in the form of software products the technical solution of example in other words,
The computer software product can store in storage medium, such as ROM/RAM, magnetic disk, CD, including some instructions are to make
It is each to obtain computer equipment (can be personal computer, server or the network equipment etc.) execution this specification embodiment
Method described in certain parts of a embodiment or embodiment.
System, device, module or the unit that above-described embodiment illustrates can specifically realize by computer chip or entity,
Or it is realized by the product with certain function.A kind of typically to realize that equipment is computer, the concrete form of computer can
To be personal computer, laptop computer, cellular phone, camera phone, smart phone, personal digital assistant, media play
In device, navigation equipment, E-mail receiver/send equipment, game console, tablet computer, wearable device or these equipment
The combination of any several equipment.
All the embodiments in this specification are described in a progressive manner, same and similar portion between each embodiment
Dividing may refer to each other, and each embodiment focuses on the differences from other embodiments.Especially for device reality
For applying example, since it is substantially similar to the method embodiment, so describing fairly simple, related place is referring to embodiment of the method
Part explanation.The apparatus embodiments described above are merely exemplary, wherein described be used as separate part description
Module may or may not be physically separated, can be each module when implementing this specification example scheme
Function realize in the same or multiple software and or hardware.Can also select according to the actual needs part therein or
Person's whole module achieves the purpose of the solution of this embodiment.Those of ordinary skill in the art are not the case where making the creative labor
Under, it can it understands and implements.
The above is only the specific embodiment of this specification embodiment, it is noted that for the general of the art
For logical technical staff, under the premise of not departing from this specification embodiment principle, several improvements and modifications can also be made, this
A little improvements and modifications also should be regarded as the protection scope of this specification embodiment.