CN109165683A - Sample predictions method, apparatus and storage medium based on federation's training - Google Patents

Sample predictions method, apparatus and storage medium based on federation's training Download PDF

Info

Publication number
CN109165683A
CN109165683A CN201810913869.3A CN201810913869A CN109165683A CN 109165683 A CN109165683 A CN 109165683A CN 201810913869 A CN201810913869 A CN 201810913869A CN 109165683 A CN109165683 A CN 109165683A
Authority
CN
China
Prior art keywords
sample
training
node
split
tree
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810913869.3A
Other languages
Chinese (zh)
Other versions
CN109165683B (en
Inventor
范涛
成柯葳
马国强
刘洋
陈天健
杨强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
WeBank Co Ltd
Original Assignee
WeBank Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by WeBank Co Ltd filed Critical WeBank Co Ltd
Priority to CN201810913869.3A priority Critical patent/CN109165683B/en
Publication of CN109165683A publication Critical patent/CN109165683A/en
Priority to PCT/CN2019/080297 priority patent/WO2020029590A1/en
Application granted granted Critical
Publication of CN109165683B publication Critical patent/CN109165683B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers

Abstract

The invention discloses a kind of sample predictions methods based on federation's training, the following steps are included: carrying out federal training using the training sample that XGboost algorithm is aligned two, tree-model is promoted to construct gradient, wherein, it includes more regression trees that the gradient, which promotes tree-model, and a split vertexes of the regression tree correspond to a feature of training sample;Tree-model is promoted based on the gradient, forecast sample is treated and carries out associated prediction, with the prediction score of the sample class of determination sample to be predicted or acquisition sample to be predicted.The invention also discloses a kind of sample predictions devices and computer readable storage medium based on federation's training.The present invention, which is realized, carries out federal training modeling, and then the model realization sample predictions based on foundation using the training sample of different data side.

Description

Sample predictions method, apparatus and storage medium based on federation's training
Technical field
The present invention relates to machine learning techniques field more particularly to a kind of sample predictions methods based on federation's training, dress It sets and computer readable storage medium.
Background technique
Certain behaviors in current information epoch, people can be come out by Data Representation, such as consumer behavior, thus derivative Big data analysis is gone out, corresponding Analysis model of network behaviors has been constructed by machine learning, and then can classify to the behavior of people Or the behavioural characteristic based on user is predicted etc..
Usually all it is that stand-alone training is carried out to sample data by a side in existing machine learning techniques, that is to say that folk prescription is built Mould.Meanwhile the mathematical model based on foundation, it may be determined that the feature that sample characteristics concentrate significance level relatively high.However very In the big data analysis scene in multispan field, such as the existing consumer behavior of user, also there is a lend-borrow action, and consumer consumption behavior number According to generation in consumer service provider, and user's lend-borrow action data are generated in financial service provider, if financial service mentions Supplier needs the lend-borrow action of the prediction user of the consumer behavior feature based on user, then needs disappearing using consumer service provider Expense behavioral data simultaneously carries out machine learning together with the lend-borrow action data of we to construct prediction model.
Therefore, for above-mentioned application scenarios, a kind of new modeling pattern is needed to realize the sample of different data provider The joint training of data, and then realize that both sides participate in modeling jointly.
Summary of the invention
The main purpose of the present invention is to provide a kind of sample predictions method, apparatus based on federation's training and computer can Read storage medium, it is intended to which solving the prior art cannot achieve the joint training of sample data of different data provider, Jin Erwu Method realizes the technical issues of both sides participate in modeling and sample predictions jointly.
To achieve the above object, the present invention provides a kind of sample predictions method based on federation's training, described based on federation Trained sample predictions method the following steps are included:
Federal training is carried out using the training sample that XGboost algorithm is aligned two, promotes tree-model to construct gradient, Wherein, it includes more regression trees that the gradient, which promotes tree-model, and a split vertexes of the regression tree correspond to training sample One feature;
Tree-model is promoted based on the gradient, forecast sample is treated and carries out associated prediction, with the sample of determination sample to be predicted This classification or the prediction score for obtaining sample to be predicted.
Optionally, the sample predictions method based on federation's training includes:
Before carrying out federal training, using Proxy Signature and rsa encryption algorithm, the ID of sample data is interacted and is added It is close;
By comparing both sides' encrypted ID encryption string, the intersection part in both sides' sample is identified, and by the friendship in sample Collection part is as the training sample after sample alignment.
Optionally, the training sample of described two alignment is respectively the first training sample and the second training sample;
The first training sample attribute includes sample ID and part sample characteristics, the second training sample attribute packet Include sample ID, another part sample characteristics and data label;
First training sample is provided by the first data side and is stored in the first data side local, the second training sample This is provided by the second data side and is stored in the second data side local.
Optionally, the training sample being aligned using XGboost algorithm to two carries out federal training, to construct gradient Promoting tree-model includes:
In second data side side, the First-order Gradient of each training sample in the corresponding sample set of epicycle node split is obtained With second order gradient;
If epicycle node split is the first run node split for constructing regression tree, to the First-order Gradient and two ladder Degree is sent to the first data side after being encrypted together with the sample ID of the sample set, in the first data side The First-order Gradient and the second order gradient of the side group in encryption calculate local training sample corresponding with the sample ID every The financial value of split vertexes under a kind of divisional mode;
If epicycle node split is the non-first run node split for constructing regression tree, the sample ID of the sample set is sent To the first data side, in first data side lateral edge First-order Gradient used in first run node split and second order Gradient calculates the financial value of local training sample split vertexes under each divisional mode corresponding with the sample ID;
Second data side receives the encryption financial value for all split vertexes that the first data side returns and is decrypted;
The local and sample is calculated based on the First-order Gradient and the second order gradient in second data side side The financial value of the corresponding training sample of ID split vertexes under each divisional mode;
Based on the financial value of the respective calculated all split vertexes of both sides, best point of the overall situation of epicycle node split is determined Split node;
The best split vertexes of the overall situation based on epicycle node split, divide the corresponding sample set of present node, raw The node of Cheng Xin is to construct the regression tree that gradient promotes tree-model.
Optionally, described in second data side side, obtain each trained sample in the corresponding sample set of epicycle node split Before the step of this First-order Gradient and second order gradient, further includes:
When carrying out node split, judge whether epicycle node split corresponds to first regression tree of construction;
If epicycle node split first regression tree of corresponding construction, judge whether epicycle node split is first recurrence of construction The first run node split of tree;
If epicycle node split is the first run node split for constructing first regression tree, in second data side side, just The First-order Gradient of each training sample and second order gradient in the corresponding sample set of beginningization epicycle node split;If epicycle node split is The non-first run node split for constructing first regression tree, then continue to use First-order Gradient used in first run node split and second order gradient;
If epicycle node split is corresponding to construct non-first regression tree, judge whether epicycle node split is construction non-first The first run node split of regression tree;
If epicycle node split is the first run node split for constructing non-first regression tree, more according to last round of federal training New First-order Gradient and second order gradient;If epicycle node split is the non-first run node split for constructing non-first regression tree, continue to use First-order Gradient used in first run node split and second order gradient.
Optionally, the sample predictions method based on federation's training further include:
When generating new node to construct the regression tree of gradient promotion tree-model, in second data side side, judgement Whether the depth of epicycle regression tree reaches predetermined depth threshold value;
If the depth of epicycle regression tree reaches the predetermined depth threshold value, Stop node division obtains gradient boosted tree Otherwise one regression tree of model continues next round node split;
When Stop node division, in second data side side, judge whether the total quantity of epicycle regression tree reaches pre- If amount threshold;
If the total quantity of epicycle regression tree reaches the preset quantity threshold value, stop federal training, otherwise continues next The federal training of wheel.
Optionally, the sample predictions method based on federation's training further include:
In second data side side, the related letter for the best split vertexes of the overall situation that each round node split determines is recorded Breath;
Wherein, the relevant information include: the provider of corresponding sample data, corresponding sample data feature coding and Financial value.
Optionally, the statistics gradient promotes the average yield value of the corresponding split vertexes of same feature in tree-model Include:
In second data side side, is promoted in tree-model using each global best split vertexes as the gradient and respectively returned The split vertexes of tree count the average yield value of the corresponding split vertexes of same feature coding.
Optionally, described that tree-model is promoted based on the gradient, it treats forecast sample and carries out associated prediction, to determine to pre- The sample class of test sample sheet or the prediction score for obtaining sample to be predicted include:
In second data side side, traverses the gradient and promote the corresponding regression tree of tree-model;
If the attribute value of current traverse node is recorded in the second data side, by comparing local sample to be predicted The attribute value of data point and current traverse node, with the next traverse node of determination;
If the attribute value of current traverse node is recorded in the first data side, initiate to inquire to the first data side Request, in first data side side, by comparing the data point of local sample to be predicted and the category of current traverse node Property value, determines next traverse node and returns to the nodal information to the second data side;
When having traversed the gradient and promoting the corresponding regression tree of tree-model, based on corresponding to the affiliated node of sample to be predicted Sample data label, determine the sample class of sample to be predicted, or the weighted value based on the affiliated node of sample to be predicted, obtain Obtain the prediction score of sample to be predicted.
Further, to achieve the above object, the present invention also provides a kind of sample predictions device based on federation's training, institutes The sample predictions device based on federation's training is stated to include memory, processor and be stored on the memory and can be described The sample predictions program run on processor realizes as above any one institute when the sample predictions program is executed by the processor The step of sample predictions method based on federation's training stated.
Further, to achieve the above object, the present invention also provides a kind of computer readable storage medium, the computers It is stored with sample predictions program on readable storage medium storing program for executing, as above any one is realized when the sample predictions program is executed by processor The step of described sample predictions method based on federation's training.
The present invention carries out federal training using the training sample that XGboost algorithm is aligned two, to construct gradient promotion Tree-model, wherein it is regression tree set that gradient, which promotes tree-model, comprising there are more regression trees, one point of every regression tree Split the feature that node corresponds to training sample;Finally based on gradient promoted tree-model, treat forecast sample combine it is pre- It surveys, with the prediction score of the sample class of determination sample to be predicted or acquisition sample to be predicted.The present invention is realized using different The training sample of data side carries out federal training modeling, and then can realize pre- to the sample progress with multi-party sample data feature It surveys.
Detailed description of the invention
Fig. 1 is the knot for the hardware running environment being related to the present invention is based on the sample predictions Installation practice scheme of federation's training Structure schematic diagram;
Fig. 2 is that the present invention is based on the flow diagrams of one embodiment of sample predictions method of federation's training;
Fig. 3 is that the present invention is based on the process signals that sample alignment is carried out in one embodiment of sample predictions method of federation's training Figure;
Fig. 4 is the refinement flow diagram of mono- embodiment of step S10 in Fig. 2;
Fig. 5 is that the present invention is based on the training result schematic diagrames of one embodiment of sample predictions method of federation's training.
The embodiments will be further described with reference to the accompanying drawings for the realization, the function and the advantages of the object of the present invention.
Specific embodiment
It should be appreciated that described herein, specific examples are only used to explain the present invention, is not intended to limit the present invention.
The present invention provides a kind of sample predictions device based on federation's training.
As shown in Figure 1, Fig. 1 is the hardware fortune being related to the present invention is based on the sample predictions Installation practice scheme of federation's training The structural schematic diagram of row environment.
The present invention is based on the sample predictions devices of federation's training can be PC, and being also possible to server etc. has meter The equipment for calculating processing capacity.
As shown in Figure 1, the sample predictions device based on federation's training may include: processor 1001, such as CPU, network Interface 1004, user interface 1003, memory 1005, communication bus 1002.Wherein, communication bus 1002 is for realizing these groups Connection communication between part.User interface 1003 may include display screen (Display), input unit such as keyboard (Keyboard), optional user interface 1003 can also include standard wireline interface and wireless interface.Network interface 1004 is optional May include standard wireline interface and wireless interface (such as WI-FI interface).Memory 1005 can be high speed RAM memory, It is also possible to stable memory (non-volatile memory), such as magnetic disk storage.Memory 1005 optionally may be used also To be independently of the storage device of aforementioned processor 1001.
It will be understood by those skilled in the art that the sample predictions apparatus structure based on federation's training shown in Fig. 1 is not The restriction of structure twin installation may include perhaps combining certain components or different portions than illustrating more or fewer components Part arrangement.
As shown in Figure 1, as may include that operating system, network are logical in a kind of memory 1005 of computer storage medium Believe module, Subscriber Interface Module SIM and file copy program.
In sample predictions device based on federation's training shown in Fig. 1, network interface 1004 is mainly used for connection backstage Server carries out data communication with background server;User interface 1003 is mainly used for connecting client (user terminal), with client End carries out data communication;And processor 1001 can be used for calling the sample predictions program stored in memory 1005, and execute It operates below:
Federal training is carried out using the training sample that XGboost algorithm is aligned two, promotes tree-model to construct gradient, Wherein, it includes more regression trees that the gradient, which promotes tree-model, and a split vertexes of the regression tree correspond to training sample One feature;
Tree-model is promoted based on the gradient, forecast sample is treated and carries out associated prediction, with the sample of determination sample to be predicted This classification or the prediction score for obtaining sample to be predicted.
Further, processor 1001 calls the sample predictions program stored in memory 1005 also to execute following operation:
Before carrying out federal training, using Proxy Signature and rsa encryption algorithm, the ID of sample data is interacted and is added It is close;
By comparing both sides' encrypted ID encryption string, the intersection part in both sides' sample is identified, and by the friendship in sample Collection part is as the training sample after sample alignment.
Further, the training sample of described two alignment is respectively the first training sample and the second training sample;It is described First training sample attribute includes sample ID and part sample characteristics, and the second training sample attribute includes sample ID, another A part of sample characteristics and data label;First training sample is provided by the first data side and is stored in the first data side Local, second training sample is provided by the second data side and is stored in the second data side local;The calling of processor 1001 is deposited The sample predictions program stored in reservoir 1005 also executes following operation:
In second data side side, the First-order Gradient of each training sample in the corresponding sample set of epicycle node split is obtained With second order gradient;
If epicycle node split is the first run node split for constructing regression tree, to the First-order Gradient and two ladder Degree is sent to the first data side after being encrypted together with the sample ID of the sample set, in the first data side The First-order Gradient and the second order gradient of the side group in encryption calculate local training sample corresponding with the sample ID every The financial value of split vertexes under a kind of divisional mode;
If epicycle node split is the non-first run node split for constructing regression tree, the sample ID of the sample set is sent To the first data side, in first data side lateral edge First-order Gradient used in first run node split and second order Gradient calculates the financial value of local training sample split vertexes under each divisional mode corresponding with the sample ID;
Second data side receives the encryption financial value for all split vertexes that the first data side returns and is decrypted;
The local and sample is calculated based on the First-order Gradient and the second order gradient in second data side side The financial value of the corresponding training sample of ID split vertexes under each divisional mode;
Based on the financial value of the respective calculated all split vertexes of both sides, best point of the overall situation of epicycle node split is determined Split node;
The best split vertexes of the overall situation based on epicycle node split, divide the corresponding sample set of present node, raw The node of Cheng Xin is to construct the regression tree that gradient promotes tree-model.
Further, processor 1001 calls the sample predictions program stored in memory 1005 also to execute following operation:
When carrying out node split, judge whether epicycle node split corresponds to first regression tree of construction;
If epicycle node split first regression tree of corresponding construction, judge whether epicycle node split is first recurrence of construction The first run node split of tree;
If epicycle node split is the first run node split for constructing first regression tree, in second data side side, just The First-order Gradient of each training sample and second order gradient in the corresponding sample set of beginningization epicycle node split;If epicycle node split is The non-first run node split for constructing first regression tree, then continue to use First-order Gradient used in first run node split and second order gradient;
If epicycle node split is corresponding to construct non-first regression tree, judge whether epicycle node split is construction non-first The first run node split of regression tree;
If epicycle node split is the first run node split for constructing non-first regression tree, more according to last round of federal training New First-order Gradient and second order gradient;If epicycle node split is the non-first run node split for constructing non-first regression tree, continue to use First-order Gradient used in first run node split and second order gradient.
Further, processor 1001 calls the sample predictions program stored in memory 1005 also to execute following operation:
In first data side side, the First-order Gradient and the second order gradient based on encryption calculate local and institute State the financial value of the corresponding training sample of sample ID split vertexes under each divisional mode;
Or in first data side side, First-order Gradient used in first run node split and second order gradient are continued to use, it counts Calculate the financial value of local training sample split vertexes under each divisional mode corresponding with the sample ID;
The second data side is sent to after encrypting to the financial value of all split vertexes.
Further, processor 1001 calls the sample predictions program stored in memory 1005 also to execute following operation:
When generating new node to construct the regression tree of gradient promotion tree-model, in second data side side, judgement Whether the depth of epicycle regression tree reaches predetermined depth threshold value;
If the depth of epicycle regression tree reaches the predetermined depth threshold value, Stop node division obtains gradient boosted tree Otherwise one regression tree of model continues next round node split;
When Stop node division, in second data side side, judge whether the total quantity of epicycle regression tree reaches pre- If amount threshold;
If the total quantity of epicycle regression tree reaches the preset quantity threshold value, stop federal training, otherwise continues next The federal training of wheel.
Further, processor 1001 calls the sample predictions program stored in memory 1005 also to execute following operation:
In second data side side, the related letter for the best split vertexes of the overall situation that each round node split determines is recorded Breath;
Wherein, the relevant information include: the provider of corresponding sample data, corresponding sample data feature coding and Financial value.
Further, processor 1001 calls the sample predictions program stored in memory 1005 also to execute following operation:
In second data side side, traverses the gradient and promote the corresponding regression tree of tree-model;
If the attribute value of current traverse node is recorded in the second data side, by comparing local sample to be predicted The attribute value of data point and current traverse node, with the next traverse node of determination;
If the attribute value of current traverse node is recorded in the first data side, initiate to inquire to the first data side Request, in first data side side, by comparing the data point of local sample to be predicted and the category of current traverse node Property value, determines next traverse node and returns to the nodal information to the second data side;
When having traversed the gradient and promoting the corresponding regression tree of tree-model, based on corresponding to the affiliated node of sample to be predicted Sample data label, determine the sample class of sample to be predicted, or the weighted value based on the affiliated node of sample to be predicted, obtain Obtain the prediction score of sample to be predicted.
Based on the hardware running environment that the above-mentioned sample predictions Installation practice scheme based on federation's training is related to, this is proposed The following embodiment of sample predictions method of the invention based on federation's training.
It is that the present invention is based on the flow diagrams of one embodiment of sample predictions method of federation's training referring to Fig. 2, Fig. 2.This In embodiment, it is described based on federation training sample predictions method the following steps are included:
Step S10 carries out federal training using the training sample that XGboost algorithm is aligned two, is mentioned with constructing gradient Rise tree-model, wherein it includes more regression trees, the corresponding instruction of a split vertexes of the regression tree that the gradient, which promotes tree-model, Practice a feature of sample;
XGboost (eXtreme Gradient Boosting) algorithm is in GBDT (Gradient Boosting Decision Tree, gradient boosted tree) improvement that Boosting algorithm is carried out on the basis of algorithm, the use of internal decision making tree Be regression tree, it includes more regression trees that algorithm output, which is the set of regression tree, and the basic ideas of training study are traversal instructions All dividing methods (namely mode of node split) for practicing all features of sample, select the dividing method of loss reduction, obtain Two leaves (namely split vertexes and generate new node), then proceed to traverse, until:
(1) stop splitting condition if meeting, export a regression tree;
(2) stop iterated conditional if meeting, export a regression tree set.
In the present embodiment, the training sample that XGboost algorithm uses is two independent training samples namely each instruction Practice sample and belongs to different data sides respectively.If two training samples are regarded as a whole training sample, due to two Training sample belongs to different data sides, therefore, can regard as and carry out cutting to whole training sample, and then training sample is The different characteristic of same sample (sample is longitudinal sectional).
Further, since two training samples belong to different data sides respectively, therefore, to realize federal training modeling, need Sample alignment is carried out to the raw sample data that both sides provide.
In the present embodiment, federation's training refers to sample training process by two data sides cooperate it is common complete, it is final trained To the gradient boosted tree model regression tree that includes, split vertexes correspond to the feature of both sides' training sample.
In XGboost algorithm, when traversing all dividing methods of all features of training sample, evaluated by financial value The superiority and inferiority of dividing method, each split vertexes all select the dividing method of loss reduction.Therefore, the financial value of split vertexes can be made It is characterized the Appreciation gist of importance, the financial value of split vertexes is bigger, then node allocation loss is smaller, and then the split vertexes The importance of corresponding feature is also bigger.
It include more regression trees since the gradient that training obtains is promoted in tree-model in the present embodiment, and different recurrence For tree there is a possibility that carrying out node allocation with same characteristic features, therefore, it is necessary to statistical gradients to promote all recurrence that tree-model includes The average yield value of the corresponding split vertexes of same feature in tree, and using average yield value as the scoring of character pair.
Step S20 promotes tree-model based on the gradient, treats forecast sample and carries out associated prediction, to be predicted with determination The sample class of sample or the prediction score for obtaining sample to be predicted.
In the present embodiment, promoting tree-model using the gradient that the training of XGboost algorithm obtains be may be implemented to forecast sample Associated prediction is carried out, forecast sample is classified or given a mark to realize.
The present embodiment carries out federal training using the training sample that XGboost algorithm is aligned two, is mentioned with constructing gradient Rise tree-model, wherein it is regression tree set that gradient, which promotes tree-model, comprising there are more regression trees, one of every regression tree Split vertexes correspond to a feature of training sample;Tree-model is finally being promoted based on gradient, forecast sample is being treated and is combined Prediction, with the prediction score of the sample class of determination sample to be predicted or acquisition sample to be predicted.The present invention is realized using not Training sample with data side carries out federal training modeling, and then can realize and carry out to the sample with multi-party sample data feature Prediction.
Further, to guarantee in federal modeling process, the sample gradient that different data side uses is consistent, therefore, two Data side first carries out both sides' sample registration process before carrying out federal modeling, and specific process flow is as shown in Figure 3.
The alignment of both sides' sample interacts encipherment scheme to sample ID using Proxy Signature and rsa encryption algorithm, passes through ratio More encrypted ID encryption string identifies in both sides' sample that (Private Parts, both sides can not each other with non-intersection part for intersection part See), to realize the secret protection to non-intersection part sample data, the present invention needs in sample alignment procedure to sample data It is encrypted.
Assuming that the sample id of the data side A is identified as XA: the sample id of { u1, u2, u3, u4 }, the data side B are identified as XB: u1, U2, u3, u5 }, the Proxy Signature of data x is E (x), and the RSA key that the side B generates is (n, e, d), the RSA key that the side A obtains be (n, E), following instantiation procedure is carried out:
(1) side A encrypts id: YA={ (re%n) * E (u) | u ∈ XA, wherein r corresponds to XAIn each is different The different random numbers that sample id is generated, then the side A is YAIt is sent to the side B;
(2) side B again encrypts id encryption string: ZA={ yd|y∈YA, the side B is again the string Z of double layer encryptionAHair To the side A;
(3) side A is to ZAIt proceeds as follows:
(4) side B encrypts id: ZB={ E (E (u))d|u∈XB, then ZBIt is sent to the side A;
(5) side A compares DAAnd ZBIf the two encryption strings are equal, then it represents that XAAnd XBIt is equal.Equal id is then sample This intersection part ({ u1, u2, u3 }) retains;Unequal part ({ u4, u5 }) because be encryption form, both sides to this not As it can be seen that discardable.
Further, the specific implementation of joint training of the invention for ease of description, the present embodiment is specifically with two Independent training sample is illustrated.
In the present embodiment, the first data side provide the first training sample, the first training sample attribute include sample ID and Part sample characteristics;Second data side provides the second training sample, and the second training sample attribute includes sample ID, another part sample Eigen and data label.
Wherein, sample characteristics refer to that the feature that sample shows or has, such as sample are behaved, then corresponding sample characteristics It can be age, gender, income, educational background etc..Data label is for classifying to multiple and different samples, the result tool of classification The feature that body is dependent on sample carries out determining to obtain.
The major significance that federal training of the invention is modeled is to realize the two-way secret protection of both sides' sample data.Cause This, in federal training process, the first training sample is stored in the first data side local, and the second training sample is stored in the second number According to square local, such as in following table 1, data are provided by the first data side and are stored in the first data side local, number in surface table 2 It is local according to being provided by the second data side and being stored in the second data side.
Table 1
As shown in Table 1, the first training sample attribute include sample ID (X1~X5), Age feature, Gender feature with And Amount of given credit feature.
Table 2
Sample ID Bill Payment Education Lable
X1 3102 2 24
X2 17250 3 14
X3 14027 2 16
X4 6787 1 10
X5 280 1 26
Shown in table 2 as above, the second training sample attribute include sample ID (X1~X5), Bill Payment feature, Education feature and data label Lable.
It further, is the refinement flow diagram of mono- embodiment of step S10 in Fig. 2 referring to Fig. 4, Fig. 4.Based on above-mentioned reality Apply example, in the present embodiment, above-mentioned steps S10 is specifically included:
Step S101 obtains each training sample in the corresponding sample set of epicycle node split in second data side side First-order Gradient and second order gradient;
XGboost algorithm is a kind of machine learning modeling method, is needed using classifier (namely classification function) sample Data are mapped to some in given classification, predict so as to be applied to data.Utilizing classifier learning classification rule In the process, need to judge using loss function the error of fitting size of machine learning.
In the present embodiment, when carrying out node split every time, in the second data side side, it is corresponding to obtain epicycle node split The First-order Gradient of each training sample and second order gradient in sample set.
Wherein, gradient promotion tree-model needs to carry out the training of more wheel federations, and the training of each round federation is corresponding to be generated one time Gui Shu, and the generation of a regression tree needs to carry out multiple node split.
Therefore, in each round federation training process, node split uses the training sample for most starting to save for the first time, Node split next time then will use the training sample that new node caused by last node split corresponds to sample set, and In the federal training process of same wheel, each round node split all continues to use First-order Gradient used in first run node split and two ladders Degree.And federation's training of next round will use last round of federal training result and update a ladder used in last round of federal training Degree and second order gradient.
XGboost algorithm supports customized loss function, asks single order inclined objective function using customized loss function Derivative and second-order partial differential coefficient, the corresponding First-order Gradient and second order gradient for obtaining local sample data to be trained.
Therefore the explanation for promoting tree-model in based on the above embodiment for XGboost algorithm and gradient constructs regression tree It needs to be determined that split vertexes, and split vertexes can be determined by financial value.The calculation formula of financial value gain is as follows:
Wherein, ILRepresent the sample set for including of present node division rear left child node, IRAfter representing present node division The sample set for including of right child node, giIndicate the First-order Gradient of sample i, hiIndicate the second order gradient of sample i, λ, γ are normal Number.
Since sample data to be trained is respectively present the first data side and the second data side, therefore, it is necessary in the first number The financial value of respective sample data split vertexes under each divisional mode is calculated separately according to square side and the second data side side.
In the present embodiment, it is aligned since the first data side has carried out sample with the second data side in advance, thus both sides have Therefore identical Gradient Features, are based on the second data simultaneously because data label is present in the sample data of the second data side The First-order Gradient and second order gradient of the sample data of side, calculate both sides' sample data split vertexes under each divisional mode Financial value.
Step S102, if epicycle node split be construct regression tree first run node split, to the First-order Gradient with The second order gradient is sent to the first data side together with the sample ID of the sample set after being encrypted, for described The First-order Gradient and the second order gradient of the first data side's side group in encryption, calculate local instruction corresponding with the sample ID Practice the financial value of sample split vertexes under each divisional mode;
In the present embodiment, to realize the two-way secret protection for realizing both sides' sample data in federal training process, therefore, if Epicycle node split is the first run node split for constructing regression tree, then the single order of sample data is calculated in the second data side side After gradient and second order gradient, is first encrypted, be then then forwarded to the first data side.
In the first data side side, First-order Gradient and second order gradient and above-mentioned income based on the sample data received The receipts of first data side local sample data split vertexes under each divisional mode are calculated in the calculation formula of value gain Benefit value, since First-order Gradient and second order gradient are encrypted, the financial value being calculated is also secret value, thus nothing Financial value need to be encrypted.
Under the various partitioning schemes for calculating sample data after the financial value of split vertexes, generation new node can be divided To construct regression tree.The present embodiment is preferably had the leading building gradient boosted tree in the second data side of data label by sample data The regression tree of model.Therefore, it is necessary to the first data side local sample datas that will be calculated in the first data side side each The financial value of split vertexes is sent to the second data side under kind divisional mode.
Step S103, if epicycle node split is the non-first run node split for constructing regression tree, by the sample set Sample ID is sent to the first data side, in first data side lateral edge single order used in first run node split Gradient and second order gradient calculate local training sample split vertexes under each divisional mode corresponding with the sample ID Financial value;
It, only need to be by epicycle section if epicycle node split is the non-first run node split for constructing regression tree in the present embodiment The sample ID of the corresponding sample set of dot splitting is sent to the first data side, and when the first data side continues to continue to use first run node split Used First-order Gradient and second order gradient calculate local training sample corresponding with the sample ID received in each division The financial value of split vertexes under mode.
Step S104, the second data side receive the encryption financial value for all split vertexes that the first data side returns simultaneously It is decrypted;
Step S105, in second data side side, based on the First-order Gradient and the second order gradient, calculate it is local with The financial value of the corresponding training sample split vertexes under each divisional mode of the sample ID;
In the second data side side, First-order Gradient and second order gradient and above-mentioned receipts based on the sample data being calculated The calculation formula of beneficial value gain calculates the local sample data to be trained in the second data side and divides section under each divisional mode The financial value of point.
Step S106 determines epicycle node split based on the financial value of the respective calculated all split vertexes of both sides Global best split vertexes;
Since the initial sample data of both sides has carried out sample alignment, respectively calculated all divisions save both sides The financial value of point can regard the financial value to both sides' overall data sample split vertexes under each divisional mode as, because This, by comparing the size of financial value, using the maximum split vertexes of financial value as best point of the overall situation of epicycle node split Split node.
It should be noted that the best corresponding sample characteristics of split vertexes of the overall situation be both likely to belong to the first data side Training sample, it is also possible to belong to the training sample of the second data side.
Optionally, it is dominated since the regression tree that gradient promotes tree-model is constructed by the second data side, in the second data Square side needs to record the relevant information for the best split vertexes of the overall situation that each round node split determines;Relevant information includes: correspondence The provider of sample data, the feature coding and financial value for corresponding to sample data.
For example, if data side A holds the corresponding feature f of global optimal partition pointi, then this is recorded as (Site A, EA (fi),gain).Conversely, if data side B holds the corresponding feature f of global optimal partition pointi, then this is recorded as (Site B, EB (fi),gain).Wherein, EA(fi) indicate data side A to feature fiIt is encoded, EB(fi) indicate data side B to feature fiIt carries out Coding can indicate feature f by codingiWithout revealing its initial characteristic data.
Optionally, when carrying out feature selecting in the above-described embodiments, preferably using each global best split vertexes as gradient The split vertexes for promoting each regression tree in tree-model, count the average yield value of the corresponding split vertexes of same feature coding.
Step S107, the best split vertexes of the overall situation based on epicycle node split, to the corresponding sample set of present node into Line splitting generates new node to construct the regression tree that gradient promotes tree-model.
If the best corresponding sample characteristics of split vertexes of the overall situation of epicycle node split belong to the training sample of the first data side This, then the corresponding sample data of present node of epicycle segmentation belongs to the first data side.Correspondingly, if epicycle node split it is complete The best corresponding sample characteristics of split vertexes of office belong to the training sample of the second data side, then the present node of epicycle segmentation is corresponding Sample data belong to the second data side.
By node split, that is, new node (left child node and right child node) is produced, to construct regression tree.And lead to Excessive wheel node split, then can be continuously generated new node, and then obtain the tree deeper regression tree of depth, and if Stop node The regression tree that gradient promotes tree-model then can be obtained in division.
In the present embodiment, since the data that both sides calculate communication are all the encryption data of model intermediate result, training Process will not reveal initial characteristic data.Guarantee the privacy of data in entire training process using Encryption Algorithm simultaneously. It is preferred that using part homomorphic encryption algorithm, additive homomorphism is supported.
Further, in one embodiment, the difference based on node split condition, is used for especially by following manner The First-order Gradient and second order gradient of the training sample of node split:
1, first regression tree of the corresponding construction of epicycle node split
If 1.1, epicycle node split is the first run node split for constructing first regression tree, in the second data side side, just The First-order Gradient of each training sample and second order gradient in the corresponding sample set of beginningization epicycle node split;
If 1.2, epicycle node split is the non-first run node split for constructing first regression tree, first run node split is continued to use Used First-order Gradient and second order gradient.
2, epicycle node split is corresponding constructs non-first regression tree
If 2.1, the corresponding first run node split for constructing non-first regression tree of epicycle node split, according to last round of federation Training updates First-order Gradient and second order gradient;
If 2.2, epicycle node split is the non-first run node split for constructing non-first regression tree, first run node point is continued to use First-order Gradient used in splitting and second order gradient.
Further, in one embodiment, be reduce the complexity of regression tree, therefore the depth threshold of default regression tree with Carry out node split limitation.
In the present embodiment, when each round, which generates new node, promotes the regression tree of tree-model to construct gradient, second Data side side, judges whether the depth of epicycle regression tree reaches predetermined depth threshold value;
If the depth of epicycle regression tree reaches predetermined depth threshold value, Stop node division, and then obtains gradient boosted tree Otherwise one regression tree of model continues next round node split.
It should be noted that the condition of limitation node split is also possible to the Stop node point when node cannot continue division It splits, such as the corresponding sample of present node, then can not continue node split.
Further, in another embodiment, to avoid training process overfitting, therefore the quantity threshold of regression tree is preset Value is to limit the generation quantity of regression tree.
In the present embodiment, when Stop node division, in the second data side side, judge epicycle regression tree total quantity whether Reach preset quantity threshold value;
If the total quantity of epicycle regression tree reaches preset quantity threshold value, stop federal training, otherwise continues next round connection Nation's training.
It should be noted that the condition of the generation quantity of limitation regression tree is also possible to stop when node cannot continue division Only construct regression tree.
For a better understanding of the invention, below based on sample data in table 1,2 in above-described embodiment, to federal instruction of the invention White silk is illustrated with modeling process.
First round federation training: first regression tree of training
(1) first round node split
1.1, in the second data side side, computational chart 2 sample data First-order Gradient (gi) and second order gradient (hi);To gi And hiThe first data side is sent to after being encrypted;
1.2, in the first data side side, it is based on giAnd hi, lower point of all possible divisional mode of sample data in computational chart 1 Split the financial value gain of node;Financial value gain is sent to the second data side;
Since Age feature with 5 kinds of sample data division modes, Gender feature there are 2 kinds of sample datas to divide in table 1 Mode, Amount of given credit 5 kinds of sample data division modes of feature, therefore, sample data has altogether in table 1 12 kinds of divisional modes, namely need to calculate the financial value of the corresponding split vertexes of 12 kinds of division modes.
1.3, in the second data side side, computational chart 2 under all possible divisional mode of sample data split vertexes receipts Beneficial value gain;
Due in table 2 Bill Payment feature with 5 kinds of sample data division modes, Education feature have 3 kinds Sample data division mode, therefore, sample data has 8 kinds of divisional modes altogether in table 2, namely needs to calculate 8 kinds of division sides The financial value of the corresponding split vertexes of formula.
1.4, from the financial value of the corresponding split vertexes of the calculated 12 kinds of division modes in the first data side side and from In the financial value of the corresponding split vertexes of the calculated 8 kinds of division modes in two data sides side, the corresponding spy of maximum return value is selected Levy the best split vertexes of the overall situation as epicycle node split;
1.5, the best split vertexes of the overall situation based on epicycle node split, divide the corresponding sample data of present node It splits, generates new node to construct the regression tree that gradient promotes tree-model.
1.6, judge whether the depth of epicycle regression tree reaches predetermined depth threshold value;If the depth of epicycle regression tree reaches pre- If depth threshold, then Stop node divides, and then obtains the regression tree that gradient promotes tree-model, otherwise continues next round section Dot splitting;
1.7, judge whether the total quantity of epicycle regression tree reaches preset quantity threshold value;If the total quantity of epicycle regression tree reaches To preset quantity threshold value, then stop federal training, otherwise enters the training of next round federation.
(2) second and third wheel node split
2.1, assume that the corresponding feature of last round of node split is that Bill Payment is less than or equal to 3102, then this feature As split vertexes (corresponding sample be X1, X2, X3, X4, X5), two new partial nodes are generated, wherein left sibling is to should be less than Or the sample set (X1, X5) equal to 3102, and right node is to the sample set (X2, X3, X4) that should be greater than 3102, by sample set It closes (X1, X5) and sample set (X2, X3, X4) and continues second and third wheel node split respectively as new sample set, with right respectively Two new nodes are divided, and new node is generated.;
2.2, since second and third wheel node split belongs to the federal training of same wheel, continue to continue to use first round node point Sample gradient value used in splitting.Assuming that the corresponding feature of a split vertexes of epicycle is Amount of given credit Less than or equal to 200, then this feature generates two new partial nodes, wherein left as split vertexes (corresponding sample is X1, X5) The corresponding sample X5 less than or equal to 200 of node, and right node is to the sample X1 that should be greater than 200;Similarly, epicycle another The corresponding feature of split vertexes is that Age is less than or equal to 35, then this feature is as split vertexes (corresponding sample be X2, X3, X4), Generate two new partial nodes, wherein left sibling it is corresponding be less than or equal to 35 sample X2, X3, and right node is to should be greater than 35 Sample X4.Specific implementation flow refers to first round node split process.
The federal training of second wheel: second regression tree of training
3.1, it since epicycle node split belongs to the training of next round federation, is updated with last round of federal training result First-order Gradient and second order gradient used in the federal training of one wheel continue the federal training of the second wheel and carry out node split, to generate New node constructs next regression tree, and specific implementation flow refers to the building process of previous regression tree.
3.2, as shown in figure 5, sample data produces two after the training of two-wheeled federation in table 1,2 in above-described embodiment Regression tree, first regression tree includes three split vertexes, is respectively: Bill Payment is less than or equal to 3102, Amount Of given credit is less than or equal to 200, Age and is less than or equal to 35;Second regression tree includes two split vertexes, point Be not: Bill Payment is less than or equal to 6787, Gender==1.
3.3, two regression trees of tree-model are promoted based on gradient as shown in Figure 5, the feature of sample data is corresponding flat Equal financial value: Bill Payment is (gain1+gain4)/2;Education is 0;Age is gain3;Gender is gain5; Amount of given credit is gain2.
Further, the present invention is based on federation training one embodiment of sample predictions method in, treat forecast sample into The specific implementation flow of row associated prediction includes:
(1) it in the second data side side, traverses gradient and promotes the corresponding regression tree of tree-model;
(2) if the attribute value of current traverse node is recorded in the second data side, by comparing local sample to be predicted The attribute value of data point and current traverse node, with the next traverse node of determination;
(3) if the attribute value of current traverse node is recorded in the first data side, inquiry request is initiated to the first data side, For being determined in the first data side side by comparing the data point of local sample to be predicted and the attribute value of current traverse node Next traverse node simultaneously returns to the nodal information to the second data side;
(4) when having traversed gradient and promoting the corresponding regression tree of tree-model, based on corresponding to the affiliated node of sample to be predicted Sample data label, determine the sample class of sample to be predicted, or the weighted value based on the affiliated node of sample to be predicted, obtain Obtain the prediction score of sample to be predicted.
In the present embodiment, since when generating regression tree, the split vertexes record of regression tree is stored in the second data side side, Therefore the present embodiment is by the leading associated prediction for completing to treat forecast sample in the second data side, especially by traversal gradient boosted tree The corresponding regression tree of model is with the affiliated node of determination sample to be predicted.Wherein, the affiliated node of sample to be predicted especially by The data point of sample more to be predicted and the attribute value of current traverse node are determined.
After the affiliated node of sample to be predicted has been determined, the corresponding trained sample of the affiliated node of sample to be predicted can be based on This data label determines the sample class of sample to be predicted, or the weighted value based on the affiliated node of sample to be predicted, obtain to The prediction score of forecast sample.
The present invention also provides a kind of computer readable storage mediums.
Sample predictions program is stored on computer readable storage medium of the present invention, the sample predictions program is by processor The step of sample predictions method as described in the examples such as any of the above-described based on federation's training is realized when execution.
Through the above description of the embodiments, those skilled in the art can be understood that above-described embodiment side Method can be realized by means of software and necessary general hardware platform, naturally it is also possible to by hardware, but in many cases The former is more preferably embodiment.Based on this understanding, technical solution of the present invention substantially in other words does the prior art The part contributed out can be embodied in the form of software products, which is stored in a storage medium In (such as ROM/RAM), including some instructions are used so that a terminal (can be mobile phone, computer, server or network are set It is standby etc.) execute method described in each embodiment of the present invention.
The embodiment of the present invention is described with above attached drawing, but the invention is not limited to above-mentioned specific Embodiment, the above mentioned embodiment is only schematical, rather than restrictive, those skilled in the art Under the inspiration of the present invention, without breaking away from the scope protected by the purposes and claims of the present invention, it can also make very much Form, it is all using equivalent structure or equivalent flow shift made by description of the invention and accompanying drawing content, directly or indirectly Other related technical areas are used in, all of these belong to the protection of the present invention.

Claims (10)

1. a kind of sample predictions method based on federation's training, which is characterized in that the sample predictions side based on federation's training Method the following steps are included:
Federal training is carried out using the training sample that XGboost algorithm is aligned two, promotes tree-model to construct gradient, In, it includes more regression trees that the gradient, which promotes tree-model, and a split vertexes of the regression tree correspond to the one of training sample A feature;
Tree-model is promoted based on the gradient, forecast sample is treated and carries out associated prediction, with the sample class of determination sample to be predicted Prediction score that is other or obtaining sample to be predicted.
2. the sample predictions method as described in claim 1 based on federation's training, which is characterized in that described based on federal training Sample predictions method include:
Before carrying out federal training, using Proxy Signature and rsa encryption algorithm, encryption is interacted to the ID of sample data;
By comparing both sides' encrypted ID encryption string, the intersection part in both sides' sample is identified, and by the intersection portion in sample It is allocated as the training sample after being aligned for sample.
3. the sample predictions method as claimed in claim 2 based on federation's training, which is characterized in that the instruction of described two alignment Practicing sample is respectively the first training sample and the second training sample;
The first training sample attribute includes sample ID and part sample characteristics, and the second training sample attribute includes sample This ID, another part sample characteristics and data label;
First training sample provided by the first data side and be stored in the first data side local, second training sample by Second data side provides and is stored in the second data side local.
4. the sample predictions method as claimed in claim 3 based on federation's training, which is characterized in that described to use XGboost The training sample that algorithm is aligned two carries out federal training, includes: to construct gradient promotion tree-model
In second data side side, the First-order Gradient and two of each training sample in the corresponding sample set of epicycle node split is obtained Ladder degree;
If epicycle node split is the first run node split for constructing regression tree, to the First-order Gradient and the second order gradient into The first data side is sent to together with the sample ID of the sample set after row encryption, in first data side side group The First-order Gradient and the second order gradient in encryption calculate local training sample corresponding with the sample ID at each The financial value of split vertexes under divisional mode;
If epicycle node split is the non-first run node split for constructing regression tree, the sample ID of the sample set is sent to institute The first data side is stated, in first data side lateral edge First-order Gradient used in first run node split and two ladders Degree calculates the financial value of local training sample split vertexes under each divisional mode corresponding with the sample ID;
Second data side receives the encryption financial value for all split vertexes that the first data side returns and is decrypted;
Local and ID pairs of the sample is calculated based on the First-order Gradient and the second order gradient in second data side side The financial value of the training sample answered split vertexes under each divisional mode;
Based on the financial value of the respective calculated all split vertexes of both sides, the best division section of the overall situation of epicycle node split is determined Point;
The best split vertexes of the overall situation based on epicycle node split, divide the corresponding sample set of present node, generate new Node with construct gradient promoted tree-model regression tree.
5. the sample predictions method as claimed in claim 4 based on federation's training, which is characterized in that described in second number According to square side, the step of obtaining the First-order Gradient and second order gradient of each training sample in the corresponding sample set of epicycle node split it Before, further includes:
When carrying out node split, judge whether epicycle node split corresponds to first regression tree of construction;
If epicycle node split first regression tree of corresponding construction, judge whether epicycle node split is first regression tree of construction First run node split;
If epicycle node split is the first run node split for constructing first regression tree, in second data side side, initialization The First-order Gradient of each training sample and second order gradient in the corresponding sample set of epicycle node split;If epicycle node split is construction The non-first run node split of first regression tree, then continue to use First-order Gradient used in first run node split and second order gradient;
If epicycle node split is corresponding to construct non-first regression tree, judge whether epicycle node split is the non-first recurrence of construction The first run node split of tree;
If epicycle node split is the first run node split for constructing non-first regression tree, one is updated according to last round of federal training Ladder degree and second order gradient;If epicycle node split is the non-first run node split for constructing non-first regression tree, the first run is continued to use First-order Gradient used in node split and second order gradient.
6. the sample predictions method as claimed in claim 4 based on federation's training, which is characterized in that described based on federal training Sample predictions method further include:
When generating new node to construct the regression tree of gradient promotion tree-model, in second data side side, epicycle is judged Whether the depth of regression tree reaches predetermined depth threshold value;
If the depth of epicycle regression tree reaches the predetermined depth threshold value, Stop node division obtains gradient and promotes tree-model A regression tree, otherwise continue next round node split;
When Stop node division, in second data side side, judge whether the total quantity of epicycle regression tree reaches present count Measure threshold value;
If the total quantity of epicycle regression tree reaches the preset quantity threshold value, stop federal training, otherwise continues next round connection Nation's training.
7. the sample predictions method as claimed in claim 4 based on federation's training, which is characterized in that described based on federal training Sample predictions method further include:
In second data side side, the relevant information for the best split vertexes of the overall situation that each round node split determines is recorded;
Wherein, the relevant information includes: the feature coding and income of the provider of corresponding sample data, corresponding sample data Value.
8. the sample predictions method as claimed in claim 7 based on federation's training, which is characterized in that described to be based on the gradient Tree-model is promoted, forecast sample is treated and carries out associated prediction, with the sample class of determination sample to be predicted or obtains sample to be predicted This prediction score includes:
In second data side side, traverses the gradient and promote the corresponding regression tree of tree-model;
If the attribute value of current traverse node is recorded in the second data side, by comparing the data of local sample to be predicted The attribute value of point and current traverse node, with the next traverse node of determination;
If the attribute value of current traverse node is recorded in the first data side, inquiry is initiated to the first data side and is asked It asks, in first data side side, by comparing the data point of local sample to be predicted and the attribute of current traverse node Value, determines next traverse node and returns to the nodal information to the second data side;
When having traversed the gradient and promoting the corresponding regression tree of tree-model, based on sample corresponding to the affiliated node of sample to be predicted This data label determines the sample class of sample to be predicted, or the weighted value based on the affiliated node of sample to be predicted, obtain to The prediction score of forecast sample.
9. a kind of sample predictions device based on federation's training, which is characterized in that the sample predictions dress based on federation's training It sets including memory, processor and is stored in the sample predictions journey that can be run on the memory and on the processor Sequence is realized as of any of claims 1-8 when the sample predictions program is executed by the processor based on federation The step of trained sample predictions method.
10. a kind of computer readable storage medium, which is characterized in that it is pre- to be stored with sample on the computer readable storage medium Ranging sequence is realized as of any of claims 1-8 when the sample predictions program is executed by processor based on federation The step of trained sample predictions method.
CN201810913869.3A 2018-08-10 2018-08-10 Sample prediction method, device and storage medium based on federal training Active CN109165683B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201810913869.3A CN109165683B (en) 2018-08-10 2018-08-10 Sample prediction method, device and storage medium based on federal training
PCT/CN2019/080297 WO2020029590A1 (en) 2018-08-10 2019-03-29 Sample prediction method and device based on federated training, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810913869.3A CN109165683B (en) 2018-08-10 2018-08-10 Sample prediction method, device and storage medium based on federal training

Publications (2)

Publication Number Publication Date
CN109165683A true CN109165683A (en) 2019-01-08
CN109165683B CN109165683B (en) 2023-09-12

Family

ID=64895662

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810913869.3A Active CN109165683B (en) 2018-08-10 2018-08-10 Sample prediction method, device and storage medium based on federal training

Country Status (2)

Country Link
CN (1) CN109165683B (en)
WO (1) WO2020029590A1 (en)

Cited By (33)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109670484A (en) * 2019-01-16 2019-04-23 电子科技大学 A kind of mobile phone individual discrimination method based on bispectrum feature and boosted tree
CN110443378A (en) * 2019-08-02 2019-11-12 深圳前海微众银行股份有限公司 Feature correlation analysis method, device and readable storage medium storing program for executing in federation's study
CN110717671A (en) * 2019-10-08 2020-01-21 深圳前海微众银行股份有限公司 Method and device for determining contribution degree of participants
WO2020029590A1 (en) * 2018-08-10 2020-02-13 深圳前海微众银行股份有限公司 Sample prediction method and device based on federated training, and storage medium
CN110796266A (en) * 2019-10-30 2020-02-14 深圳前海微众银行股份有限公司 Method, device and storage medium for implementing reinforcement learning based on public information
CN110851869A (en) * 2019-11-14 2020-02-28 深圳前海微众银行股份有限公司 Sensitive information processing method and device and readable storage medium
CN110944011A (en) * 2019-12-16 2020-03-31 支付宝(杭州)信息技术有限公司 Joint prediction method and system based on tree model
CN110968886A (en) * 2019-12-20 2020-04-07 支付宝(杭州)信息技术有限公司 Method and system for screening training samples of machine learning model
CN111242385A (en) * 2020-01-19 2020-06-05 苏宁云计算有限公司 Prediction method, device and system of gradient lifting tree model
CN111309848A (en) * 2020-01-19 2020-06-19 苏宁云计算有限公司 Generation method and system of gradient lifting tree model
CN111444956A (en) * 2020-03-25 2020-07-24 平安科技(深圳)有限公司 Low-load information prediction method and device, computer system and readable storage medium
CN111598186A (en) * 2020-06-05 2020-08-28 腾讯科技(深圳)有限公司 Decision model training method, prediction method and device based on longitudinal federal learning
CN111667075A (en) * 2020-06-12 2020-09-15 杭州浮云网络科技有限公司 Service execution method, device and related equipment
CN111695697A (en) * 2020-06-12 2020-09-22 深圳前海微众银行股份有限公司 Multi-party combined decision tree construction method and device and readable storage medium
CN111915019A (en) * 2020-08-07 2020-11-10 平安科技(深圳)有限公司 Federal learning method, system, computer device, and storage medium
CN112183759A (en) * 2019-07-04 2021-01-05 创新先进技术有限公司 Model training method, device and system
CN112199706A (en) * 2020-10-26 2021-01-08 支付宝(杭州)信息技术有限公司 Tree model training method and business prediction method based on multi-party safety calculation
CN112464287A (en) * 2020-12-12 2021-03-09 同济大学 Multi-party XGboost safety prediction model training method based on secret sharing and federal learning
CN112529101A (en) * 2020-12-24 2021-03-19 深圳前海微众银行股份有限公司 Method and device for training classification model, electronic equipment and storage medium
CN112651458A (en) * 2020-12-31 2021-04-13 深圳云天励飞技术股份有限公司 Method and device for training classification model, electronic equipment and storage medium
WO2021082634A1 (en) * 2019-10-29 2021-05-06 支付宝(杭州)信息技术有限公司 Tree model-based prediction method and apparatus
CN112766514A (en) * 2021-01-22 2021-05-07 支付宝(杭州)信息技术有限公司 Method, system and device for joint training of machine learning model
CN113392164A (en) * 2020-03-13 2021-09-14 京东城市(北京)数字科技有限公司 Method, main server, service platform and system for constructing longitudinal federated tree
CN113554476A (en) * 2020-04-23 2021-10-26 京东数字科技控股有限公司 Training method and system of credit prediction model, electronic device and storage medium
CN113642669A (en) * 2021-08-30 2021-11-12 平安医疗健康管理股份有限公司 Fraud prevention detection method, device and equipment based on feature analysis and storage medium
CN113705727A (en) * 2021-09-16 2021-11-26 四川新网银行股份有限公司 Decision tree modeling method, prediction method, device and medium based on difference privacy
CN113723477A (en) * 2021-08-16 2021-11-30 同盾科技有限公司 Cross-feature federal abnormal data detection method based on isolated forest
CN113807544A (en) * 2020-12-31 2021-12-17 京东科技控股股份有限公司 Method and device for training federated learning model and electronic equipment
CN113822311A (en) * 2020-12-31 2021-12-21 京东科技控股股份有限公司 Method and device for training federated learning model and electronic equipment
EP3975089A1 (en) * 2020-09-25 2022-03-30 Beijing Baidu Netcom Science And Technology Co. Ltd. Multi-model training method and device based on feature extraction, an electronic device, and a medium
CN114362948A (en) * 2022-03-17 2022-04-15 蓝象智联(杭州)科技有限公司 Efficient federal derivative feature logistic regression modeling method
WO2022144001A1 (en) * 2020-12-31 2022-07-07 京东科技控股股份有限公司 Federated learning model training method and apparatus, and electronic device
CN113554476B (en) * 2020-04-23 2024-04-19 京东科技控股股份有限公司 Training method and system of credit prediction model, electronic equipment and storage medium

Families Citing this family (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111414646B (en) * 2020-03-20 2024-03-29 矩阵元技术(深圳)有限公司 Data processing method and device for realizing privacy protection
CN111402095A (en) * 2020-03-23 2020-07-10 温州医科大学 Method for detecting student behaviors and psychology based on homomorphic encrypted federated learning
CN111461874A (en) * 2020-04-13 2020-07-28 浙江大学 Credit risk control system and method based on federal mode
CN111666576B (en) * 2020-04-29 2023-08-04 平安科技(深圳)有限公司 Data processing model generation method and device, and data processing method and device
CN111882054B (en) * 2020-05-27 2024-04-12 杭州中奥科技有限公司 Method for cross training of encryption relationship network data of two parties and related equipment
CN113824546B (en) * 2020-06-19 2024-04-02 百度在线网络技术(北京)有限公司 Method and device for generating information
CN111814985B (en) * 2020-06-30 2023-08-29 平安科技(深圳)有限公司 Model training method under federal learning network and related equipment thereof
CN111898765A (en) * 2020-07-29 2020-11-06 深圳前海微众银行股份有限公司 Feature binning method, device, equipment and readable storage medium
CN111914277B (en) * 2020-08-07 2023-09-01 平安科技(深圳)有限公司 Intersection data generation method and federal model training method based on intersection data
US11914678B2 (en) 2020-09-23 2024-02-27 International Business Machines Corporation Input encoding for classifier generalization
CN112288094B (en) * 2020-10-09 2022-05-17 武汉大学 Federal network representation learning method and system
CN112381307B (en) * 2020-11-20 2023-12-22 平安科技(深圳)有限公司 Meteorological event prediction method and device and related equipment
CN113824677B (en) * 2020-12-28 2023-09-05 京东科技控股股份有限公司 Training method and device of federal learning model, electronic equipment and storage medium
CN113807380B (en) * 2020-12-31 2023-09-01 京东科技信息技术有限公司 Training method and device of federal learning model and electronic equipment
CN112749749B (en) * 2021-01-14 2024-04-16 深圳前海微众银行股份有限公司 Classification decision tree model-based classification method and device and electronic equipment
CN112836830B (en) * 2021-02-01 2022-05-06 广西师范大学 Method for voting and training in parallel by using federated gradient boosting decision tree
CN113807534B (en) * 2021-03-08 2023-09-01 京东科技控股股份有限公司 Model parameter training method and device of federal learning model and electronic equipment
CN114882333A (en) * 2021-05-31 2022-08-09 北京百度网讯科技有限公司 Training method and device of data processing model, electronic equipment and storage medium
CN113204443B (en) * 2021-06-03 2024-04-16 京东科技控股股份有限公司 Data processing method, device, medium and product based on federal learning framework
CN113435537B (en) * 2021-07-16 2022-08-26 同盾控股有限公司 Cross-feature federated learning method and prediction method based on Soft GBDT
CN113722987B (en) * 2021-08-16 2023-11-03 京东科技控股股份有限公司 Training method and device of federal learning model, electronic equipment and storage medium
CN113657996A (en) * 2021-08-26 2021-11-16 深圳市洞见智慧科技有限公司 Method and device for determining feature contribution degree in federated learning and electronic equipment
CN113722739B (en) * 2021-09-06 2024-04-09 京东科技控股股份有限公司 Gradient lifting tree model generation method and device, electronic equipment and storage medium

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101056166A (en) * 2007-05-28 2007-10-17 北京飞天诚信科技有限公司 A method for improving the data transmission security
CN104009842A (en) * 2014-05-15 2014-08-27 华南理工大学 Communication data encryption and decryption method based on DES encryption algorithm, RSA encryption algorithm and fragile digital watermarking
CN107704966A (en) * 2017-10-17 2018-02-16 华南理工大学 A kind of Energy Load forecasting system and method based on weather big data
CN107767183A (en) * 2017-10-31 2018-03-06 常州大学 Brand loyalty method of testing based on combination learning and profile point
US20180089587A1 (en) * 2016-09-26 2018-03-29 Google Inc. Systems and Methods for Communication Efficient Distributed Mean Estimation
CN107993139A (en) * 2017-11-15 2018-05-04 华融融通(北京)科技有限公司 A kind of anti-fake system of consumer finance based on dynamic regulation database and method
CN108021984A (en) * 2016-11-01 2018-05-11 第四范式(北京)技术有限公司 Determine the method and system of the feature importance of machine learning sample
TWM561279U (en) * 2018-02-12 2018-06-01 林俊良 Blockchain system and node server for processing strategy model scripts of financial assets
CN108257105A (en) * 2018-01-29 2018-07-06 南华大学 A kind of light stream estimation for video image and denoising combination learning depth network model
CN108375808A (en) * 2018-03-12 2018-08-07 南京恩瑞特实业有限公司 Dense fog forecasting procedures of the NRIET based on machine learning

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018031597A1 (en) * 2016-08-08 2018-02-15 Google Llc Systems and methods for data aggregation based on one-time pad based sharing
CN107423339A (en) * 2017-04-29 2017-12-01 天津大学 Popular microblogging Forecasting Methodology based on extreme Gradient Propulsion and random forest
CN107832581B (en) * 2017-12-15 2022-02-18 百度在线网络技术(北京)有限公司 State prediction method and device
CN109165683B (en) * 2018-08-10 2023-09-12 深圳前海微众银行股份有限公司 Sample prediction method, device and storage medium based on federal training

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101056166A (en) * 2007-05-28 2007-10-17 北京飞天诚信科技有限公司 A method for improving the data transmission security
CN104009842A (en) * 2014-05-15 2014-08-27 华南理工大学 Communication data encryption and decryption method based on DES encryption algorithm, RSA encryption algorithm and fragile digital watermarking
US20180089587A1 (en) * 2016-09-26 2018-03-29 Google Inc. Systems and Methods for Communication Efficient Distributed Mean Estimation
CN108021984A (en) * 2016-11-01 2018-05-11 第四范式(北京)技术有限公司 Determine the method and system of the feature importance of machine learning sample
CN107704966A (en) * 2017-10-17 2018-02-16 华南理工大学 A kind of Energy Load forecasting system and method based on weather big data
CN107767183A (en) * 2017-10-31 2018-03-06 常州大学 Brand loyalty method of testing based on combination learning and profile point
CN107993139A (en) * 2017-11-15 2018-05-04 华融融通(北京)科技有限公司 A kind of anti-fake system of consumer finance based on dynamic regulation database and method
CN108257105A (en) * 2018-01-29 2018-07-06 南华大学 A kind of light stream estimation for video image and denoising combination learning depth network model
TWM561279U (en) * 2018-02-12 2018-06-01 林俊良 Blockchain system and node server for processing strategy model scripts of financial assets
CN108375808A (en) * 2018-03-12 2018-08-07 南京恩瑞特实业有限公司 Dense fog forecasting procedures of the NRIET based on machine learning

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
H. BRENDAN MCMAHAN 等: "Communication-efficient learning of deep networks from decentralized data", 《ARTIFICIAL INTELLIGENCE AND STATISTICS》 *
JAKUB 等: "Federated learning strategies for improving communication efficiency", 《ARXIV.ORG》 *
STEPHEN HARDY 等: "Private federated learning on vertically partitioned data via entity resolution and additively homomorphic encryption", 《ARXIV.ORG》 *
TIANQI CHEN 等: "XGBoost: A Scalable Tree Boosting System", 《KDD"16: PROCEEDINGS OF THE 22ND ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING》 *
许裕栗 等: "Xgboost算法在区域用电预测中的应用", 《自动化仪表》 *

Cited By (52)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020029590A1 (en) * 2018-08-10 2020-02-13 深圳前海微众银行股份有限公司 Sample prediction method and device based on federated training, and storage medium
CN109670484B (en) * 2019-01-16 2022-03-25 电子科技大学 Mobile phone individual identification method based on bispectrum characteristics and lifting tree
CN109670484A (en) * 2019-01-16 2019-04-23 电子科技大学 A kind of mobile phone individual discrimination method based on bispectrum feature and boosted tree
CN112183759A (en) * 2019-07-04 2021-01-05 创新先进技术有限公司 Model training method, device and system
CN112183759B (en) * 2019-07-04 2024-02-13 创新先进技术有限公司 Model training method, device and system
CN110443378A (en) * 2019-08-02 2019-11-12 深圳前海微众银行股份有限公司 Feature correlation analysis method, device and readable storage medium storing program for executing in federation's study
CN110443378B (en) * 2019-08-02 2023-11-03 深圳前海微众银行股份有限公司 Feature correlation analysis method and device in federal learning and readable storage medium
CN110717671B (en) * 2019-10-08 2021-08-31 深圳前海微众银行股份有限公司 Method and device for determining contribution degree of participants
CN110717671A (en) * 2019-10-08 2020-01-21 深圳前海微众银行股份有限公司 Method and device for determining contribution degree of participants
WO2021082634A1 (en) * 2019-10-29 2021-05-06 支付宝(杭州)信息技术有限公司 Tree model-based prediction method and apparatus
CN110796266A (en) * 2019-10-30 2020-02-14 深圳前海微众银行股份有限公司 Method, device and storage medium for implementing reinforcement learning based on public information
CN110851869A (en) * 2019-11-14 2020-02-28 深圳前海微众银行股份有限公司 Sensitive information processing method and device and readable storage medium
CN110851869B (en) * 2019-11-14 2023-09-19 深圳前海微众银行股份有限公司 Sensitive information processing method, device and readable storage medium
CN110944011B (en) * 2019-12-16 2021-12-07 支付宝(杭州)信息技术有限公司 Joint prediction method and system based on tree model
CN110944011A (en) * 2019-12-16 2020-03-31 支付宝(杭州)信息技术有限公司 Joint prediction method and system based on tree model
CN110968886A (en) * 2019-12-20 2020-04-07 支付宝(杭州)信息技术有限公司 Method and system for screening training samples of machine learning model
CN111242385A (en) * 2020-01-19 2020-06-05 苏宁云计算有限公司 Prediction method, device and system of gradient lifting tree model
CN111309848A (en) * 2020-01-19 2020-06-19 苏宁云计算有限公司 Generation method and system of gradient lifting tree model
CN113392164B (en) * 2020-03-13 2024-01-12 京东城市(北京)数字科技有限公司 Method for constructing longitudinal federal tree, main server, service platform and system
CN113392164A (en) * 2020-03-13 2021-09-14 京东城市(北京)数字科技有限公司 Method, main server, service platform and system for constructing longitudinal federated tree
CN111444956A (en) * 2020-03-25 2020-07-24 平安科技(深圳)有限公司 Low-load information prediction method and device, computer system and readable storage medium
CN111444956B (en) * 2020-03-25 2023-10-31 平安科技(深圳)有限公司 Low-load information prediction method, device, computer system and readable storage medium
CN113554476B (en) * 2020-04-23 2024-04-19 京东科技控股股份有限公司 Training method and system of credit prediction model, electronic equipment and storage medium
CN113554476A (en) * 2020-04-23 2021-10-26 京东数字科技控股有限公司 Training method and system of credit prediction model, electronic device and storage medium
CN111598186A (en) * 2020-06-05 2020-08-28 腾讯科技(深圳)有限公司 Decision model training method, prediction method and device based on longitudinal federal learning
CN111695697B (en) * 2020-06-12 2023-09-08 深圳前海微众银行股份有限公司 Multiparty joint decision tree construction method, equipment and readable storage medium
CN111695697A (en) * 2020-06-12 2020-09-22 深圳前海微众银行股份有限公司 Multi-party combined decision tree construction method and device and readable storage medium
WO2021249086A1 (en) * 2020-06-12 2021-12-16 深圳前海微众银行股份有限公司 Multi-party joint decision tree construction method, device and readable storage medium
CN111667075A (en) * 2020-06-12 2020-09-15 杭州浮云网络科技有限公司 Service execution method, device and related equipment
CN111915019A (en) * 2020-08-07 2020-11-10 平安科技(深圳)有限公司 Federal learning method, system, computer device, and storage medium
CN111915019B (en) * 2020-08-07 2023-06-20 平安科技(深圳)有限公司 Federal learning method, system, computer device, and storage medium
EP3975089A1 (en) * 2020-09-25 2022-03-30 Beijing Baidu Netcom Science And Technology Co. Ltd. Multi-model training method and device based on feature extraction, an electronic device, and a medium
CN112199706A (en) * 2020-10-26 2021-01-08 支付宝(杭州)信息技术有限公司 Tree model training method and business prediction method based on multi-party safety calculation
CN112464287A (en) * 2020-12-12 2021-03-09 同济大学 Multi-party XGboost safety prediction model training method based on secret sharing and federal learning
CN112464287B (en) * 2020-12-12 2022-07-05 同济大学 Multi-party XGboost safety prediction model training method based on secret sharing and federal learning
CN112529101A (en) * 2020-12-24 2021-03-19 深圳前海微众银行股份有限公司 Method and device for training classification model, electronic equipment and storage medium
CN112651458B (en) * 2020-12-31 2024-04-02 深圳云天励飞技术股份有限公司 Classification model training method and device, electronic equipment and storage medium
WO2022144001A1 (en) * 2020-12-31 2022-07-07 京东科技控股股份有限公司 Federated learning model training method and apparatus, and electronic device
CN113822311A (en) * 2020-12-31 2021-12-21 京东科技控股股份有限公司 Method and device for training federated learning model and electronic equipment
CN113807544A (en) * 2020-12-31 2021-12-17 京东科技控股股份有限公司 Method and device for training federated learning model and electronic equipment
CN113822311B (en) * 2020-12-31 2023-09-01 京东科技控股股份有限公司 Training method and device of federal learning model and electronic equipment
CN112651458A (en) * 2020-12-31 2021-04-13 深圳云天励飞技术股份有限公司 Method and device for training classification model, electronic equipment and storage medium
CN113807544B (en) * 2020-12-31 2023-09-26 京东科技控股股份有限公司 Training method and device of federal learning model and electronic equipment
CN112766514B (en) * 2021-01-22 2021-12-24 支付宝(杭州)信息技术有限公司 Method, system and device for joint training of machine learning model
CN112766514A (en) * 2021-01-22 2021-05-07 支付宝(杭州)信息技术有限公司 Method, system and device for joint training of machine learning model
CN113723477A (en) * 2021-08-16 2021-11-30 同盾科技有限公司 Cross-feature federal abnormal data detection method based on isolated forest
CN113642669A (en) * 2021-08-30 2021-11-12 平安医疗健康管理股份有限公司 Fraud prevention detection method, device and equipment based on feature analysis and storage medium
CN113642669B (en) * 2021-08-30 2024-04-05 平安医疗健康管理股份有限公司 Feature analysis-based fraud prevention detection method, device, equipment and storage medium
CN113705727B (en) * 2021-09-16 2023-05-12 四川新网银行股份有限公司 Decision tree modeling method, prediction method, equipment and medium based on differential privacy
CN113705727A (en) * 2021-09-16 2021-11-26 四川新网银行股份有限公司 Decision tree modeling method, prediction method, device and medium based on difference privacy
CN114362948B (en) * 2022-03-17 2022-07-12 蓝象智联(杭州)科技有限公司 Federated derived feature logistic regression modeling method
CN114362948A (en) * 2022-03-17 2022-04-15 蓝象智联(杭州)科技有限公司 Efficient federal derivative feature logistic regression modeling method

Also Published As

Publication number Publication date
WO2020029590A1 (en) 2020-02-13
CN109165683B (en) 2023-09-12

Similar Documents

Publication Publication Date Title
CN109165683A (en) Sample predictions method, apparatus and storage medium based on federation's training
CN109034398A (en) Feature selection approach, device and storage medium based on federation's training
Reardon et al. Income inequality and income segregation
CN105335409B (en) A kind of determination method, equipment and the network server of target user
CN107871087A (en) The personalized difference method for secret protection that high dimensional data is issued under distributed environment
CN102663047B (en) Method and device for mining social relationship during mobile reading
CN111932386B (en) User account determining method and device, information pushing method and device, and electronic equipment
CN109087079A (en) Digital cash Transaction Information analysis method
CN107291815A (en) Recommend method in Ask-Answer Community based on cross-platform tag fusion
CN111666460A (en) User portrait generation method and device based on privacy protection and storage medium
CN107358116A (en) A kind of method for secret protection in multi-sensitive attributes data publication
CN113449048B (en) Data label distribution determining method and device, computer equipment and storage medium
CN111538916B (en) Interest point recommendation method based on neural network and geographic influence
CN109376901A (en) A kind of service quality prediction technique based on decentralization matrix decomposition
CN108416227A (en) Big data platform secret protection evaluation method and device based on Dare Information Entropy
Chao Construction model of E-commerce agricultural product online marketing system based on blockchain and improved genetic algorithm
CN109783805A (en) A kind of network community user recognition methods and device
CN116186754A (en) Federal random forest power data collaborative analysis method based on blockchain
CN112016954A (en) Resource allocation method and device based on block chain network technology and electronic equipment
WO2019237840A1 (en) Data set generating method and apparatus
CN112613601B (en) Neural network model updating method, equipment and computer storage medium
Mecke et al. Some distributions for I‐segments of planar random homogeneous STIT tessellations
CN108647334A (en) A kind of video social networks homology analysis method under spark platforms
CN109472115B (en) Large-scale complex network modeling method and device based on geographic information
Palestini et al. A graph-based approach to inequality assessment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant