CN112700031A

CN112700031A - XGboost prediction model training method for protecting multi-party data privacy

Info

Publication number: CN112700031A
Application number: CN202011452494.9A
Authority: CN
Inventors: 史清江; 谢仑辰
Original assignee: Tongji University
Current assignee: Tongji University
Priority date: 2020-12-12
Filing date: 2020-12-12
Publication date: 2021-04-23
Anticipated expiration: 2040-12-12
Also published as: CN112700031B

Abstract

The invention relates to an XGboost prediction model training method for protecting multi-party data privacy, which comprises a plurality of participants and a coordinator, wherein the participants with labels firstly calculate first-order and second-order gradient vectors and indication vectors by using a current model prediction result and a label value, the rest participants assist with the coordinator through secret sharing to jointly calculate and construct a combined decision tree model based on an XGboost algorithm, the participants cooperate with each other to determine the result of prediction of data to be trained in the combined decision tree model, and finally all the participants and the coordinator iterate together to complete the construction of a plurality of combined decision tree models to obtain a complete multi-party prediction model. Compared with the prior art, the method has the advantages of performing cross-data-source multi-party XGboost integrated model training on the premise of protecting data privacy, improving the prediction capability of the model while ensuring data safety and the like.

Description

XGboost prediction model training method for protecting multi-party data privacy

Technical Field

The invention relates to the technical field of machine learning, in particular to an XGboost prediction model training method for protecting multi-party data privacy.

Background

The XGboost algorithm is an ensemble learning algorithm, has the characteristics of quick construction and accurate prediction, and is designed to solve the machine learning problem that data features are in the same machine, but the XGboost algorithm is insufficient, so that the situation that multiple parties respectively hold different features of the same batch of data samples, one party holds label information and cannot transmit the label information to other parties cannot be processed.

In order to protect the privacy of data of all parties, a training scheme of longitudinal federal learning is adopted in the field of machine learning so as to achieve the accuracy of a machine learning model which is close to or equal to that of the data under the condition of the same machine. The current longitudinal federal learning algorithm mainly comprises two parties, but the longitudinal federal learning method strictly limits the number of mechanisms which can cooperate, cannot be easily expanded to any other parties, simplifies or approximates a machine learning model in order to smoothly cooperate with multiple parties, and causes loss of calculation result precision.

Disclosure of Invention

The invention aims to provide an XGboost prediction model training method for protecting multi-party data privacy, aiming at overcoming the defects of obstacles existing in data interaction and data precision loss in a cooperation process in the prior art.

The purpose of the invention can be realized by the following technical scheme:

a XGboost prediction model training method for protecting multi-party data privacy comprises a plurality of participants and a coordinator, wherein the participants with labels firstly calculate first-order and second-order gradient vectors and indication vectors by using a current model prediction result and a label value, the rest participants jointly calculate and construct a combined decision tree model based on an XGboost algorithm through secret sharing and assistance of the coordinator, the participants cooperate with each other to determine a prediction result of data to be trained in the combined decision tree model, and finally all the participants and the coordinator iterate together to complete construction of a plurality of combined decision tree models to obtain a complete multi-party prediction model.

The specific steps for training the joint decision tree model are as follows:

s1, the first participant sets the initial number t of the building tree to 1, the initial depth d to 1, the regularization parameter λ and the maximum depth d_maxFor a total of N participants, using secret sharing splitting yields { λ }ⁱAll set parameters are distributed to all participants i, with num owned for each_iParty i generation of a featureRandom non-repeating num_iFeature number index, by a first participant holding a tag, to predict a result vector using a current model

And calculating a sample label vector y to obtain a first-order gradient vector G and a second-order gradient vector H, generating an initial all-1 indication vector S, respectively performing secret sharing and splitting, and splitting into N first-order gradient vector fragments (G) for N participants in totalⁱSecond order gradient vector fragmentation { H }ⁱAnd indicates vector fragmentation { S }ⁱAnd distributed to all participants i, i ═ 1, … N, respectively;

s2, each participant i receives { G }ⁱ，{H}ⁱ、{S}ⁱThen, the ith slice { SG } of the own first-order gradient sum is calculatedⁱIth slice of the sum of second order gradients { SH }ⁱCalculating the ith sub-molecule fragment and the ith denominator fragment of each group corresponding to the splitting gain under each characteristic by using a secret sharing method, determining the maximum splitting gain and the characteristic, the group and whether the division is carried out or not by a coordinating party, and generating a left sub-tree indication vector SL and a right sub-tree indication vector SR after the division if the selected characteristic belongs to a participant i', wherein the SL and the SR respectively indicate samples in a left subset and a right subset obtained by dividing the sample set according to the characteristic and the group corresponding to the maximum splitting gain, and the left subset and the right subset respectively correspond to the left sub-tree and the right sub-tree; splitting the SL and SR into N fragments { SL }through secret sharingⁱAnd { SR }ⁱI ═ 1, …, N, and distributed to participant i; each participant i utilizes the received { SL }ⁱ、{SR}ⁱWith own indication vector slicing { S }ⁱLeft sub-tree first order gradient vector shard { SGL }after the computation sample set is divided into left sub-treesⁱAnd second order gradient vector slicing { SHL }ⁱComputing right sub-tree first order gradient vector Sharding (SGR) after the sample set is partitioned into the right sub-treeⁱAnd second order gradient vector sharding { SHR }ⁱUsing { SGL }ⁱ、{SHL}ⁱ、{SL}ⁱRecursively proceeds to step S2 to construct a left sub-tree, using { SGR }ⁱ、{SHR}ⁱ、{SR}ⁱRecursively performing step S2 to construct a right subtree, and setting the depth d to d +1, if no division is performed or the maximum depth d is reached_maxEach participant i calculates the ith fragment of the weight of the current leaf node sigma on the decision tree

S3, for each data sample x_pEach participant i utilizes a sample of the held partial features

Calculating the prediction result of the current t tree

Accumulate to the results of the first t-1 trees to produce t trees for data sample x_pIntegrated predicted results of

Wherein

Representing the qth tree to the pth data sample x_pThe result of the prediction of (a) is,

to represent

The p-th element, for a total of M data samples, traversal p 1, …, M yields the complete

And S4, increasing the number T of trees to T +1, and iterating the steps S1-S3 until T to T decision trees are built.

Further, the secret sharing algorithm used in steps S1, S2, S3 is a method of splitting a piece of data θ into multiple pieces { θ }ⁱI pairs of different participantsThe respective fragments are calculated in the same type and step to generate { theta' }ⁱAfter the calculation is finished, the data are generated by addition and combination

And theta' is equivalent to the result of executing the same type and synchronous step calculation on theta, and the specific calculation involved comprises the following steps:

a. secret sharing splitting

For one-dimensional data theta, when a participant i carries out secret sharing and splitting, N-1 random numbers are generated for N total participants, and the N-1 random numbers are designated as fragments { theta }^i′I '≠ i, for participant i' to use, and participant i generates its own data slice { θ }ⁱ＝θ-∑_i′{θ}^i′；

b. Secret sharing addition

For one-dimensional sliced data θ_A}¹，…，{θ_A}^NAnd { theta [ [ theta ] ])_B}¹，…，{θ_B}^NEach participant i utilizes the holding { theta }_A}ⁱAnd { theta [ [ theta ] ])_B}ⁱCan directly use common addition to calculate theta_A}ⁱ+{θ_B}ⁱ＝{θ′}ⁱTherefore, for the convenience of description, the common addition is directly used for explanation;

c. secret sharing subtraction

For one-dimensional sliced data θ_A}¹，…，{θ_A}^NAnd { theta [ [ theta ] ])_B}¹，…，{θ_B}^NEach participant i utilizes the holding { theta }_A}ⁱAnd { theta [ [ theta ] ])_B}ⁱCan directly use common subtraction to calculate theta_A}ⁱ-{θ_B}ⁱ＝{θ′}ⁱTherefore, for the convenience of description, the common subtraction method is directly used for explanation;

d. secret sharing multiplication

For one-dimensional sliced data θ_A}¹，…，{θ_A}^NAnd { theta [ [ theta ] ])_B}¹，…，{θ_B}^NOf each participant i holds { theta }_A}ⁱAnd { theta [ [ theta ] ])_B}ⁱFirst, a coordinator generates one-dimensional variables a, b, c ═ a × b, and splits into { a } through secret sharing¹，…，{a}^N、{b}¹，…，{b}^NAnd { c }¹，…，{c}^NAnd sending the data to each participant i, and each participant i receives { a }ⁱ，{b}ⁱ，{c}ⁱAnd calculates { e }ⁱ＝{θ_A}ⁱ-{a}ⁱAnd { f }ⁱ＝{θ_B}ⁱ-{b}ⁱSent to the first party, the first party calculates

And

sending the data to all participants, and calculating by the first participant to obtain { theta' }¹And the other participants i calculate to obtain { theta' }ⁱFinal secret-sharing multiplication

Expressed as:

for the above steps, the method can be popularized from one-dimensional data to multi-dimensional data.

Further, the step S1 specifically includes:

s1.1, the first participant sets the initial number t of the building tree to be 1, the initial depth d to be 1, the regularization parameter lambda and the maximum depth d_maxGenerating { lambda } using secret sharing splittingⁱDistribute all set parameters to all participants i, for each owning num_iParticipant i, coordinator of individual characteristicsCounting the total number of features num of the participants_feature＝∑_i＝1num_iThe production element is [1, 2.. num. ], num_feature]For each participant i, randomly assigns num_iThe array elements in disorder sequence are not overlapped with the array elements obtained among all the participants, and all the participants establish one-to-one mapping map (j) from disorder array elements j to own characteristic numbers and record and store the mapping map (j) in the own party;

s1.2, all participants calculate the maximum characteristic value number k in own owned sample characteristics_selfmaxAnd then the data is sent to a coordinator, and the coordinator determines the maximum characteristic value number k of all participants_max＝max k_selfmaxAnd broadcast to all participants;

s1.3, starting from a first participant holding tag data, each participant using the same loss function l (-) and the first participant predicting a result vector by using a model

Calculating a first order gradient vector from the tag value vector y

Second order gradient vector

With the initial full 1 indicating the vector S, each data x_pInitial predicted result of (2)

When t is 1, the sum is 0, otherwise, the sum is expressed as the prediction weight sum of the existing t-1 trees

Splitting G, H, S secret sharing into N first order gradient vector shards { G }for a total of N participantsⁱSecond order gradient vector fragmentation { H }ⁱAnd indicates vector fragmentation { S }ⁱI-1, … N, and distributed to participant i.

Further, the step S2 specifically includes:

s2.1, each participant i receives the ith sub-slice (G) of the first-order gradient vectorⁱSecond order gradient vector ith plate { H }ⁱAnd indicating the ith slice { S }ⁱThen, the ith slice { SG } of the own first-order gradient sum is calculatedⁱIth slice of the sum of second order gradients { SH }ⁱ，{SG}ⁱAnd { SH }ⁱBy { G } owned by participants i, respectivelyⁱAnd { H }ⁱRespectively summing vector elements;

s2.2, each participant i calculates the non-split gain molecule fragment { gain ] owned by each participant through the following formula_up}ⁱWith undivided gain denominator slice { gain_down}ⁱ：

{gain_down}ⁱ＝{SH}ⁱ+{λ}ⁱ

Wherein the content of the first and second substances,

for secret sharing multiplication, { lambda }ⁱThe ith fragment is the hyperparameter lambda;

s2.3, each participant i utilizes { G }ⁱAnd { H }ⁱCalculating the first-order gradient and the fragmentation matrix (BG) of all the value intervals of all the characteristics owned by the selfⁱWith a second order gradient and a fragmentation matrix { BH }ⁱ；

S2.4, each participant i initializes the split gain molecular fragmentation matrix { leftgain } of the left subtree_up}ⁱLeft subtree splitting gain denominator fragmentation matrix { leftgain }_down}ⁱRight subtree split gain molecular fragmentation matrix { rightgain }_up}ⁱRight subtree split gain denominator fragmentation matrix { rightgain }_down}ⁱ；

S2.5, for the characteristic j, each participant i initializes and records a left subtree accumulated first-order gradient slicing variable

Left subtree cumulative second-order gradient sliced variable

Right subtree cumulative first-order gradient sharding variable

Right subtree cumulative second-order gradient fragment variable

Are all 0;

s2.6, traversing the value interval k by each participant i, and updating and calculating

And

comprises the following steps:

wherein { BH }ⁱ[j，k]And { BH }ⁱ[j，k]Respectively represent the fragmentation matrix { BG }ⁱAnd { BH }ⁱTo the [ j, k ] th]Element, update calculation

And

for the XGBoost model, the splitting gain calculation formula used is as follows:

for each participant i, directly calculating splitting gain numerators and denominator fragments of left and right subtrees at the k-th value interval element position of the j-th characteristic and updating the splitting gain numerators and the denominator fragments into a matrix:

s2.7, each participant i utilizes the splitting gain numerator and the denominator fragmentation matrix of the left sub-tree and the right sub-tree obtained in the S2.6 to calculate the splitting gain difference value between different value intervals k of different characteristics j, and the selected characteristics j corresponding to the maximum splitting gain are determined through the comparison of the coordinators_bestAnd the value interval k_best；

S2.8, each participant i is subjected to the maximum splitting gain characteristic j_bestAnd the value interval k_bestSplitting the numerator and denominator patches and non-split numerator and denominator patches { gain } using the left and right subtrees of that location_up}ⁱ、{gain_down}ⁱCalculating the total split gain denominator (deno)minator}ⁱSending to a coordinator, and calculating the total split gain molecule fragment { nominator }ⁱSending the data to a first participant, and calculating the denominator by a coordinator

And determines a sign₀First party calculation molecule

And determines a sign₁The first participant and the coordinator pass through sign₀And sign₁Jointly determining a symbol variable corresponding to the final maximum gain;

s2.9, when the sign variable is 1, for the feature j_bestWhen the i' th participant has the feature j_bestSetting a record sample to fall into the M-dimensional vector SL of the left sub-tree after the characteristic division, and taking out the kth vector SL_bestIndividual value interval

Sample characteristics j are concentrated in a sample set_bestValue taking

Satisfy the requirement of

The SL position of (1) is set to be 1, the rest positions are set to be 0, and one record sample is set to fall into the M-dimensional vector of the right subtree after the characteristic division

Namely negating SL, and splitting SL and SR into N shards { SL }through secret sharing for N participants in totalⁱAnd { SR }ⁱAnd distributed to all participants i, i ═ 1, …, i', …, N;

s2.10, each participant i receives { SL }ⁱAnd { SR }ⁱRecalculating owned by itselfLeft sub-tree indicates vector fragmentation { SL }ⁱAnd right sub-tree indicating vector slice { SR }ⁱ：

{SL}ⁱ＝{S}ⁱ⊙{SL}ⁱ

{SR}ⁱ＝{S}ⁱ⊙{SR}ⁱ

Wherein a secret-sharing multiplication is performed between co-located elements of an

Get a dimension { S }ⁱThe same vector, calculate the own first-order gradient vector slice { GL } that falls into the left sub-tree sampleⁱWith the first-order gradient vector slice falling into the right sub-tree sample { GR }ⁱ：

{GL}ⁱ＝{G}ⁱ⊙{SL}ⁱ

{GR}ⁱ＝{G}ⁱ⊙{SR}ⁱ

Computing its own second-order gradient vector slice { HL } falling into the left sub-tree sampleⁱWith the second order gradient vector patch { HR } falling within the right sub-tree sampleⁱ：

{HL}ⁱ＝{H}ⁱ⊙{SL}ⁱ

{HR}ⁱ＝{H}ⁱ⊙{SR}ⁱ

S2.11, setting { GL } for each participant iⁱ、{HL}ⁱAnd { SL }ⁱSetting { GR } for constructing first-order gradient vector fragments, second-order gradient vector fragments and indication vector fragments used by the left subtreeⁱ、{HR}ⁱAnd { SR }ⁱA first-order gradient vector fragment, a second-order gradient vector fragment and an indication vector fragment used for constructing the right subtree;

s2.12, when the current depth d of the tree reaches the set limit d_maxOr when the symbolic variable is not 1, calculating the weight value of the leaf node, and stopping continuously constructing the left and right subtrees for the current node;

and S2.13, setting d to be d +1, recursively executing the step S2.1 to the step S2.12, and completing construction of the XGboost decision tree.

Further, the step S2.3 specifically includes:

s2.3.1: all participants i initialize record interval first-order gradient and fragmented num_feature*k_maxDimension matrix { BG }ⁱNum of second order gradient and slice from recording interval_feature*k_maxDimension matrix { BH }ⁱ；

S2.3.2: for the feature j,

j

1,2_featureWhen the ith 'participant has the feature number j, mapping j to the own feature map (j) owned by the participant i' by using the feature index in the step S1-1, counting all the division values owned by the feature and recording the number k of the division values_j；

S2.3.3: participant i' sets a k recording that the sample falls into the feature partition_maxMatrix of M dimensions_indexM is the number of samples, and for the j-th feature, all the features are valued val_kIs arranged as val from small to large_k，k＝1，...，k_jSetting left_k＝val_k-1And left₀＝-∞，right_k＝value_kGo through k_jAnd taking out the kth value interval (left)_k，right_k) Initializing a full 0 vector S 'of a dimension M x 1, and enabling the participator i' in a sample set to sample feature map (j) to take value_map(j)Satisfy left_k＜value_map(j)≤right_kThe S' position of (A) is taken as 1, and Matrix is recorded_indexKth line vector Matrix_index[k，：]＝S′^T，S′^TIs a transposed vector of S ', after the partitioning traversal is finished, for the total N participants, the participant i' transforms the Matrix_indexSplitting into N slices { Matrix ] through secret sharing_index}ⁱAnd distributed to all participants i, i ═ 1, …, i', …, N;

s2.3.4: participant i receives { Matrix_index}ⁱFor the j-th feature, traversing k until the maximum value interval number k_maxCalculating first order gradient and slicing { BG }ⁱ[j，k]With second order gradient and shard { BH }ⁱ[j，k]：

{BG}ⁱ[j，k]＝sum({Matrix_index}ⁱ[k，：]⊙{G}ⁱ)

{BH}ⁱ[j，k]＝sum({Matrix_index}ⁱ[k，：]⊙{H}ⁱ)

Wherein, [ k,: represents all elements of the k-th row of the selection matrix, where sum (v) represents the summation of elements of the vector v;

s2.3.5: go through j, execute S2.3.2-S2.3.4, make all participants i complete { BG }ⁱAnd { BH }ⁱAnd (4) calculating.

Further, the step S2.7 specifically includes:

s2.7.1, each participant i sets the initial partition index list vector col ═ 1,2, …, k of the currently participating alignment_max]Recording a length of R_colSetting initial each feature division index list vector col_selected；

S2.7.2 XGboost algorithm, for characteristic position [ j, col [ r ]]]And [ j, col [ r +1 ]]]

col[r]Denotes the r-th element in the index list col, [ j, col [ r [ ]]]Col [ r ] th of j-th row of the matrix]The number of the elements is one,

presentation pair

Rounding down, traversing r, and calculating the splitting gain difference between the feature positions as follows:

order:

nominator1_col

＝leftgain_up[j，col[r]]*leftgain_down[j，col[r+1]]-leftgain_up[j，col[r+1]]*leftgain_down[j，col[r]]

nominator2_col

＝rightgain_up[j，col[r]]*rightgain_down[j，col[r+1]]-rightgain_up[j，col[r+1]]*rightgain_aown[j，col[r]]

denominator1_col＝leftgain_down[j，col[r]]*leftgain_down[j，col[r+1]]

denominator2_col＝rightgain_down[j，col[r]]*rightgain_down[j，col[r+1]]

then:

for the

And (3) calculating all division position difference value result shards of the splitting gain numerator of the left sub-tree and the right sub-tree in the S2.6 and the characteristic j of the denominator shard matrix by using all the participators i by using the formula:

s2.7.3, each participant i sends its own space bit calculation slicing result vector { nominator1 to the coordinator_col}ⁱ、{nominator2_col}ⁱ、{denominator1_col}ⁱ、{denominator2_col}ⁱThe coordinator collects and calculates the vectors respectively

Wherein the vector of pairs { v }ⁱOperation carried out

Represents for all { v }ⁱThe collocated elements of a slice are summed into a vector, subtending the vectors v and v₂Operation v carried out₁\v₂Representing division of co-located elements of two vectors, i.e.

S2.7.4, initializing an empty list new _ col, sequentially judging the r-th element of col _ shared _ value,

if it is not negative, col [ r2]Adding into new _ col, otherwise, adding col [ r 2+1]Adding the new _ col into the new _ col, adding the last element of the col into the new _ col if the col length is odd after traversal is finished, and then broadcasting the new _ col by all participants in the coordination direction, wherein the participant commands col to be the new _ col;

s2.7.5, when col length is greater than 1, iterating steps S2.7.2 to S2.7.4 until col length becomes 1, and taking out the only element col [0 ] in col]Recording col_selected[j]＝col[0]；

S2.7.6, traversing all the characteristics j, iterating steps S2.7.1 to S2.7.5, obtaining the selected division position of each characteristic,partitioning index list vector col into complete features_selectedSetting the initial current partition index list vector row currently participating in the comparison as [1,2, …, num_feature]Recording a length of R_row；

S2.7.7 XGboost algorithm, for characteristic position [ row [ r ]]，col_selected[row[r]]]And [ row [ r +1 ]]，col_selected[row[r+1]]]，

Wherein row [ r ]]Denotes the r-th element, col, in the index list row_selected[row[r]]Representation col_selectedMiddle with row [ r ]]Is an element of the index position that is,

represents a pair of R_rowAnd/2, rounding down, traversing r, and calculating the splitting gain difference between the characteristic positions as follows:

order:

nominator1_row

＝leftgain_up[row[r]，col_selected[row[r]]]*leftgain_down[row[r+1]，col_selected[row[r+1]]]-leftgain_up[row[r+1]，col_selected[row[r+1]]]*leftgain_down[row[r]，col_selected[row[r]]]

nominator2_row

＝rightgain_up[row[r]，col_selected[row[r]]]*rightgain_down[row[r+1]，col_selected[row[r+1]]]-rightgain_up[row[r+1]，col_selected[row[r+1]]]*rightgain_down[row[r]，col_selected[row[r]]]

denominator1_row

＝leftgain_down[row[r]，col_selected[row[r]]]*leftgain_down[row[r+1]，col_selected[row[r+1]]]

denominator2_row

＝rightgain_down[row[r]，col_selected[row[r]]]*rightgain_down[row[r+1]，col_selected[row[r+1]]]

then:

for the

Elements, using the above formula, all the characteristics [1,2, …, num ] of the split gain numerator and denominator fragmentation matrix of the left and right subtrees in S2.6 are calculated_feature]Each preferably divides the difference result slices between locations:

s2.7.8, each participant i sends its own interval calculation result vector { nominator1 to the coordinator_row}ⁱ、{nominator2_row}ⁱ、{denominator1_row}ⁱ、{denominator2_row}ⁱThe coordinator collects and calculates the vectors respectively

S2.7.9, initialize empty list new _ row, traverse row _ shared _ value,

if it is not negative, then row [ r2 [ ]]Adding into new _ row, otherwise, adding row [ r 2+1]Adding the new _ row element into a new _ row element, adding the last element of the row element into the new _ row element if the row length is odd after traversal is finished, then broadcasting the new _ row element by all participants in a coordination direction, and enabling the row element to be the new _ row element by the participants;

s2.7.10, when the row length is greater than 1, iterating steps S2.7.6 to S2.7.9 until the row length becomes 1, fetching the unique element row [0 ] in the row]The participant is recorded as j_best＝row[0]And get k_best＝col_selected[j_best]Broadcast to all participants to determine the best feature number j after selection_bestThe optimal division position k with the feature_best。

Further, the step S2.8 specifically includes:

s2.8.1, maximum splitting gain characteristic j for a given_bestAnd the value interval k_bestThe split gain expression in the XGBoost algorithm is:

each participant i calculates its own split gain molecule slice { nominator }ⁱ：

Calculating the split gain denominator slice { denoxinator }of the selfⁱ：

S2.8.2, the remaining participants send the first participant the owned fragmentation gain molecule fragment { nominator }²...{nominator}^NThe first party collects { nominator }²...{nominator}^NAnd calculate

Setting a first participant symbol sign₁Judging the sign, and ordering:

s2.8.3, each participant i sends the coordinator the owned split gain denominator slice { denoxinator }to the coordinatorⁱThe coordinator collects { denoxinator }ⁱN and calculating

Setting coordinator symbol sign₀Judging the sign, and ordering:

s2.8.4 first participant sign₁Sending the signal to a coordinator, and receiving and calculating sign as sign by the coordinator₁*sign₀Broadcasting sign to all participants, and receiving the value as the currently established symbol variable by all the participants;

further, the step S2.12 specifically includes:

s2.12.1, each participant i calculates half of the sum of its second order gradient fragment and the regularization term:

each participant i computes its own first-order gradient patch sum:

{g′}ⁱ＝{SG}ⁱ

s2.12.2, each participant i determines { h' }ⁱOf order of magnitude mu_iSo that:

s2.12.3, all participating parties send corresponding magnitude digits, and the coordinator receives and selects the maximum magnitude digit as mu_mDetermining an iteration step size

The process parameter tau and the iteration number iter are sent to all the participants;

s2.12.4, setting random initial value for each participant i

And a variable with an initial value of 0

Starting with κ ═ 1, the iteration proceeds according to the following formula:

setting kappa as kappa +1 after each iteration, terminating when kappa as iter, and recording weight slicing after the computation of the participant i is finished

Further, the step S3 specifically includes:

s3.1 for the t tree_tEach ofFor data sample x_pUsing features of the held part

According to a local tree model

Performing leaf node prediction, wherein for each tree node, if the partition information is

If the branch sub-tree has the characteristics, the left (right) sub-tree is divided and prediction is continued according to the characteristics and the values, all leaf node flag bits of the branch sub-tree which does not enter are set to be 0, and if the division information is not

Until a certain characteristic is found, prediction is made along all left and right subtrees of the tree node

Setting flag bit to be 1 for attributive leaf node, finally obtaining the tree prediction by each participant i

Generating flag bits of all leaf nodes sigma,

sigma

1,2_iSimultaneously splicing delta leaf weights according to the same sequence

Is a result vector

S3.2, each participant i will index_iSecret sharing splitting is carried out and is divided into { index_i}^i′Sent to all participants i', i ═ 1, …, i, …, N;

s3.3, eachThe participant i' receives the mark vector slice { index ] sent by the participant i_i}^i′Calculating bitwise cumulative multiplication vectors { index } of all vector slices^i′＝{index₁}^i′⊙{index₂}^i′⊙…⊙{index_N}^i′And calculating the bitwise multiplication result { v) of the mark vector fragment and the self weight fragment_result}^i′＝{index}^i′⊙{v_w}^i′；

S3.4, each participant i' pair { v_result}^i′Element summation is carried out to { weight^p}^i′＝sum({v_result}^i′) And sending the result to the first party, which receives and calculates

And calculate

Becomes the sample x after the end of the t round^pThe predicted result of (2);

s3.5, traversing all p, and calculating all data samples x^pVector formed by combining t-th round prediction results

Compared with the prior art, the invention has the following beneficial effects:

according to the invention, the first-order and second-order gradient vectors and indication vectors are calculated by the participators of the tag by using the prediction result of the current model and the tag value, each participator constructs a decision tree model based on an XGboost algorithm through the assistance of secret sharing and coordinators, determines the prediction result of data to be trained through the joint cooperation of the participators, completes the construction of a plurality of decision tree models through iteration to obtain a complete lossless safe multi-party prediction model, performs cross-data-source multi-party XGboost integrated model training on the premise of protecting data privacy, and improves the prediction capability of the model while ensuring data safety.

Drawings

FIG. 1 is a schematic diagram of the interaction of participants and a coordinator in accordance with the present invention;

FIG. 2 is a schematic flow chart of a model training process of the present invention;

FIG. 3 is a communication flow diagram of the model training process of the present invention;

FIG. 4 is a diagram illustrating a multi-party tree model and its corresponding equivalent model according to an embodiment of the present invention.

Detailed Description

The invention is described in detail below with reference to the figures and specific embodiments. The present embodiment is implemented on the premise of the technical solution of the present invention, and a detailed implementation manner and a specific operation process are given, but the scope of the present invention is not limited to the following embodiments.

Example one

As shown in fig. 1, a XGBoost prediction model training method for protecting multi-party data privacy includes multiple participants and a coordinator, a participant with a tag first calculates a first-order and a second-order gradient vectors and an indication vector by using a current model prediction result and a tag value, the remaining participants assist with the coordinator through secret sharing to jointly calculate and construct a combined decision tree model based on an XGBoost algorithm, the participants cooperate together to determine a result of prediction of data to be trained in the combined decision tree model, and finally all the participants and the coordinator iterate together to complete construction of multiple combined decision tree models, thereby obtaining a complete multi-party prediction model.

In this embodiment, as shown in fig. 4, an application scenario is set such that several different types of mechanisms respectively have the same sample group, but the data characteristics do not overlap. By combining different institutional data, a more sophisticated model can be trained in multiple ways. In order to simulate the effect, a parallel computing frame is locally used, 4 computing nodes are set, the numbers of the computing nodes are respectively 0, 1,2 and 3, the computing nodes correspond to 3 computing participants and 1 coordinating party, wherein the computing node 0 is the coordinating party, the computing node 1 is the first participant with the characteristic, and the computing nodes 2 and 3 represent the participants 2 and 3. The present embodiment uses the Iris data set from UCI Machine Learning reproducibility, selects 100 pieces of data in two categories with class labels of 0 and 1, including four features of sepal length, sepal width, pedal length, and pedal width, and assigns sepal length and pedal length of the four features to the first participant, sepal width to participant 2, and pedal width to participant 3, where all participants regard 80% of the data samples as the training set, and the remaining 20% as the test set. The specific flow is shown in fig. 2 and 3.

S1, setting t to 1, generating initial tree building parameters and feature indexes, and calculating to generate gradient vectors and indicator vector segments, which specifically includes:

s1.1, setting initial tree building parameters and feature indexes:

the first participant sets the initial number t of the building tree to 1, the initial depth d to 1, the regularization parameter λ and the maximum depth d_maxIn this embodiment, λ is set to 1, d _max4, for a total of 3 participants, calculate { λ }ⁱ1/3 (equivalent to splitting secret sharing into three), distributed to all participants i, the coordinator counting the total number of features num of the participants for each participant i with numi features_feature＝∑_i＝1num_iThe production element is [1, 2.. num. ], num_feature]For each participant i, randomly assigns num_iThe array elements in the disordered sequence are not overlapped with the array elements obtained among the participants, each participant establishes a one-to-one mapping map (j) from the disordered array element j to the characteristic number owned by the participant and records and stores the mapping map (j) on the participant, for example, for a first participant with a first characteristic sepal length and a third characteristic pedal length, the two characteristics are accessed locally through

numbers

0 and 1, the first participant is distributed to

indexes

2 and 0, the first participant establishes mappings 0 ═ map (2) and 1 ═ map (0), and for the characteristic index number 2 in the subsequent iteration, the first participant owns the characteristic index number and converts the characteristic index number into the corresponding characteristic number 0 in the characteristic set by mapping, so as to access the characteristic;

s1.2, determining the maximum characteristic value quantity:

all participants calculate the maximum characteristic value number k in own sample characteristics_selfmaxAnd then the data is sent to a coordinator, and the coordinator determines the maximum characteristic value number k of all participants_max＝maxk_selfmaxAnd broadcast to all participants;

s1.3, calculating to generate gradient vectors and indication vector fragments:

starting from the first party holding the tagged data, each party uses the same loss function l (-) which in the embodiment is the squared loss function MSE, i.e. the square loss function MSE

First party predicts a result vector using a model

Calculating a first order gradient vector from the tag value vector y

Second order gradient vector

S2, multiple parties construct the tth decision tree based on the XGboost algorithm, and the decision tree specifically comprises the following steps:

s2.1, each participant receives the ith sub-slice (G) of the first-order gradient vectorⁱSecond order gradient vector ith plate { H }ⁱAnd indicating the ith slice { S }ⁱCalculating the self first-order and second-order gradients and the fragments;

each participant i receives the ith slice of the first-order gradient vector { G }ⁱSecond order gradient vector ith plate { H }ⁱAnd indicating the ith slice { S }ⁱThen, the ith slice { SG } of the own first-order gradient sum is calculatedⁱIth slice of the sum of second order gradients { SH }ⁱ，{SG}ⁱAnd { SH }ⁱBy { G } owned by participants i, respectivelyⁱAnd { H }ⁱRespectively summing vector elements;

s2.2, calculating the non-split gain numerator segment and the denominator segment by each participant:

for the XGBoost algorithm, at a certain tree node, for all data first-order gradients and SG, second-order gradients and SH, and regular term λ that the tree node has, the non-split gain is expressed as:

in our secret sharing scenario, we need to compute its molecular fragment { gain separately without division_up}ⁱAnd denominator slicing { gain_down}ⁱ：

{gain_down}ⁱ＝{SH}ⁱ+{λ}ⁱ

In the formula

s2.3, each participant i utilizes { G }ⁱAnd { H }ⁱCalculating the first-order gradient and the fragmentation matrix (BG) of all the value intervals of all the characteristics owned by the selfⁱWith a second order gradient and a fragmentation matrix { BH }ⁱ

S2.4, each participant i initializes the split gain molecular fragmentation matrix { leftgain } of the left subtree_up}ⁱLeft subtree splitting gain denominator fragmentation matrix { leftgain }_down}ⁱRight subtree split gain molecular fragmentation matrix { rightgain }_up}ⁱRight subtree split gain denominator fragmentation matrix { rightgain }_down}ⁱ

In the embodiment, the matrix needs to be explicitly initialized at each participant i to avoid the execution problem;

Left subtree cumulative second-order gradient sliced variable

Right subtree cumulative first-order gradient sharding variable

Right subtree cumulative second-order gradient fragment variable

Are all 0;

in this embodiment, the above variables need to be explicitly initialized at each participant i to avoid the execution problem;

s2.6, traversing the value interval k by all the participants i, and updating and calculating

And updating the split gain numerator and denominator fragmentation matrix of the left and right subtrees at the kth value interval position of the jth characteristic

Traversing the value interval k by each participant i, and updating and calculating

And

comprises the following steps:

And

s2.7, each participant i utilizes the split gain numerator and the denominator fragmentation matrix of the left sub-tree and the right sub-tree obtained by 2.6 to calculate the split gain difference value between different value intervals k of different characteristics j, and the selected characteristics j corresponding to the maximum split gain is determined through the comparison of the coordinators_bestAnd the value interval k_best；

S2.8, determining a symbol variable of the maximum gain;

s2.9, determining the indication vector slice divided by the division position for the features and the division position meeting the division criterion

When the sign variable is 1, for the feature j_bestWhen the i' th participant has the feature j_bestSetting a record sample to fall into the M-dimensional vector SL of the left sub-tree after the characteristic division, and taking out the kth vector SL_bestIndividual value interval

Sample characteristics j are concentrated in a sample set_bestValue taking

Satisfy the requirement of

s2.10, each participant updates first-order and second-order gradient vectors and indicates vector fragmentation

Each participant i receives { SL }ⁱAnd { SR }ⁱRecalculating the left sub-tree indication vector Slice (SL) owned by itselfⁱAnd right sub-tree indicating vector slice { SR }ⁱ：

{SL}ⁱ＝{S}ⁱ⊙{SL}ⁱ

{SR}ⁱ＝{S}ⁱ⊙{SR}ⁱ

{GL}ⁱ＝{G}ⁱ⊙{SL}ⁱ

{GR}ⁱ＝{G}ⁱ⊙{SR}ⁱ

{HL}ⁱ＝{H}ⁱ⊙{SL}ⁱ

{HR}ⁱ＝{H}ⁱ⊙{SR}ⁱ

S2.11, constructing variables of left and right subtrees and specifying:

for each participant i, { GL } is setⁱ、{HL}ⁱAnd { SL }ⁱFirst-order gradient vector partition, second-order gradient vector partition and method for constructing left subtreeIndicate vector fragmentation, set { GR }ⁱ、{HR}ⁱAnd { SR }ⁱA first-order gradient vector fragment, a second-order gradient vector fragment and an indication vector fragment used for constructing the right subtree;

2.13, increasing the tree depth, and recursively constructing a decision tree:

and setting d as d +1, recursively executing S2.1 to S2.12, and completing construction of an XGboost joint decision tree.

S3, generating a predicted result of the data sample by using the t-th tree, and merging the predicted result with the previous t-1 tree results, including:

s3.1, local result prediction:

for the t tree_tEach participant i for a data sample x_pUsing features of the held part

According to a local tree model

If the branch sub-tree has the characteristics, the left (right) sub-tree is divided and prediction is continued according to the characteristics and the values, all leaf node flag bits of the branch sub-trees which do not enter are set to be 0, and if the division information is not

Setting flag bit to be 1 for attributive leaf node, and finally each participanti get the tree prediction

Generating flag bits of all leaf nodes sigma, sigma 1,2_iSimultaneously splicing delta leaf weights according to the same sequence

Is a result vector

For example, as shown in fig. 4, for a data sample, three participants can determine their corresponding token vectors 1-3 locally, and each participant has a result vector slice

Where the first party holds a feature-partition pair (j)₁，k₁) And (j)₄，k₄) Participant 2 holds feature-partition pairs (j)₂，k₂) Participant 3 holds feature-partition pairs (j)₃，k₃) Three decision trees are equivalent to a decision tree which is obtained by training when data are stored in a single machine and contains complete division information, the first participating party to the third party respectively carries out sample division according to known information of the first participating party to the third party, when the decision tree contains the division information, a left sub-tree or a right sub-tree is selected, otherwise, the left sub-tree and the right sub-tree are searched, and finally, a mark vector (1, 1, 1, 0, 0), (0, 0, 1, 1, 1) and (0, 1, 1, 0, 0) which indicate the leaves to which a certain data sample belongs are respectively given;

s3.2, sign vector splitting and propagation:

each participant i will index_iSecret sharing splitting is carried out and is divided into { index_i}^i′Sent to all participants i', i ═ 1, …, i, …, N;

s3.3, all the participants calculate respective prediction result fragments:

each participant i' receives the mark vector slice { index ] sent by the participant i_i}^i′Calculating bitwise cumulative multiplication vectors { index } of all vector slices^i′＝{index₁}^i′⊙{index₂}^i′⊙…⊙{index_N}^i′And calculating the bitwise multiplication result { v) of the mark vector fragment and the self weight fragment_result}^i′＝{index}^i′⊙{v_w}^i′；

S3.4, merging prediction result fragments:

each participant i' pair { v_result}^i′Element summation is carried out to { weight^p}^i′＝sum({v_result}^i′) And sending the result to the first party, which receives and calculates

And calculate

Becomes the sample x after the end of the t round^pThe predicted result of (2);

s3.5, calculating prediction results of all samples:

go through all p, calculate all data samples x^pVector formed by combining t-th round prediction results

S4, iteratively increasing training rounds, and completing the construction of all decision trees:

and increasing the number T of trees to T +1, and iterating the steps S1-S3 until T to T decision trees are built.

Step S2.3 specifically includes:

S2.3.2: for the feature j,

j

1,2_featureWhen the ith' isWhen the participant has the feature number j, mapping j to the own feature map (j) owned by the participant i' by using the feature index in the step S1-1, counting all the division values owned by the feature and recording the number k of the values_j；

S2.3.3: participant i' sets a k recording that the sample falls into the feature partition_maxMatrix of M dimensions_indexM is the number of samples, and for the j-th feature, all the features are valued val_kIs arranged as val from small to large_k，k＝1，...，k_jSetting left to val_k-1And left₀＝-∞，right_k＝value_kGo through k_jAnd taking out the kth value interval (left)_k，right_k) Initializing a full 0 vector S 'of a dimension M x 1, and enabling the participator i' in a sample set to sample feature map (j) to take value_map(j)Satisfy left_k＜value_map(j)≤right_kThe S' position of (A) is taken as 1, and Matrix is recorded_indexKth line vector Matrix_index[k，：]＝S′^T，S′^TIs a transposed vector of S ', after the partitioning traversal is finished, for the total N participants, the participant i' transforms the Matrix_indexSplitting into N slices { Matrix ] through secret sharing_index}ⁱAnd distributed to all participants i, i ═ 1, …, i', …, N;

{BG}ⁱ[j，k]＝sum({Matrix_index}ⁱ[k，：]⊙{G}ⁱ)

{BH}ⁱ[j，k]＝sum({Matrix_index}ⁱ[k，：]⊙{H}ⁱ)

s2.3.5: go through j, execute S2.3.2-S2.3.4, makeAll participants i complete { BG }ⁱAnd { BH }ⁱAnd (4) calculating.

Step S2.7 specifically includes:

S2.7.2 XGboost algorithm, for characteristic position [ j, col [ r ]]]And [ j, col [ r +1 ]]]，

presentation pair

order:

nominator1_col

nominator2_col

＝rightgain_up[j，col[r]]*rightgain_down[j，col[r+1]]-rightgain_up[j，col[r+1]]*rightgain_down[j，col[r]]

denominator1_col＝leftgain_down[j，col[r]]*leftgain_down[j，col[r+1]]

denominator2_col＝rightgain_down[j，col[r]]*rightgain_down[j，col[r+1]]

then:

for the

Wherein the vector of pairs { v }ⁱOperation carried out

S2.7.6, traversing all the characteristics j, iterating steps S2.7.1 to S2.7.5 to obtain the selected division position of each characteristic, and combining the selected division positions into a complete characteristic division index list vector col_selectedSetting the initial current partition index list vector row currently participating in the comparison as [1,2, …, num_feature]Recording a length of R_row；

order:

nominator1_row

nominator2_row

denominator1_row

denominator2_row

then:

for the

S2.7.9, initialize empty list new _ row, traverse row _ shared _ value,

if it is not negative, then row [ r2 [ ]]Adding into new _ row, otherwise, adding row [ r 2+1]Adding the new _ row element into the new _ row element, and adding the last row element into the new _ row element if the row length is odd after traversalThen, the coordinator broadcasts new _ row to all participants, and the participants make row equal to new _ row;

Step S2.8 specifically includes:

Calculating the split gain denominator slice { denoxinator }of the selfⁱ：

Setting a first participant symbol sign₁Judging the sign, and ordering:

Setting coordinator symbol sign₀Judging the sign, and ordering:

step S2.12 specifically includes:

each participant i computes its own first-order gradient patch sum:

{g′}ⁱ＝{SG}ⁱ

s2.12.4, setting random initial value for each participant i

And a variable with an initial value of 0

In addition, it should be noted that the specific implementation examples described in this specification may have different names, and the above contents described in this specification are only illustrations of the structures of the present invention. All equivalent or simple changes in the structure, characteristics and principles of the invention are included in the protection scope of the invention. Various modifications or additions may be made to the described embodiments or methods may be similarly employed by those skilled in the art without departing from the scope of the invention as defined in the appending claims.

Claims

1. A XGboost prediction model training method for protecting multi-party data privacy is characterized by comprising a plurality of participants and a coordinator, wherein the participants with labels firstly calculate first-order and second-order gradient vectors and indication vectors by using a current model prediction result and a label value, the rest participants assist with the coordinator through secret sharing to jointly calculate and construct a combined decision tree model based on an XGboost algorithm, the participants cooperate with each other to determine a prediction result of data to be trained in the combined decision tree model, and finally all the participants and the coordinator iterate together to complete the construction of a plurality of combined decision tree models to obtain a complete multi-party prediction model.

2. The XGboost prediction model training method for protecting multi-party data privacy as claimed in claim 1, wherein the specific steps of training the joint decision tree model are as follows:

s1, the participants with labels are used as first participants, the initial number, the initial depth, the regularization parameters and the maximum depth of the building tree are set, the regularization parameters are divided secretly, all the set parameters are sent to all the participants, random non-repetitive feature number indexes are generated for each participant with corresponding number of features, the first participant with labels calculates to obtain a first-order gradient vector and a second-order gradient vector by using the current model prediction result vector and the sample label vector, initial full 1 indicating vectors are generated, secret sharing and division are respectively carried out, and the first-order gradient vector fragments, the second-order gradient vector fragments and the indicating vector fragments with corresponding numbers are divided for the participants with the corresponding numbers and are respectively distributed to all the participants;

s2, after each participant i receives the first-order gradient vector fragment, the second-order gradient vector fragment and the indication vector fragment, calculating the fragment of the own first-order gradient sum and the fragment of the second-order gradient sum, calculating the numerator fragment and the denominator fragment of each group corresponding to the splitting gain respectively under each characteristic by using a secret sharing method, determining the maximum splitting gain and the belonged characteristic, the grouping and whether the division is carried out or not by using a coordinating method, when the division is carried out, if the selected characteristic belongs to a specific participant, generating a left sub-tree indication vector and a right sub-tree indication vector after the division, wherein the left sub-tree indication vector and the right sub-tree indication vector respectively indicate samples in a left subset and a right subset obtained by dividing the sample set according to the characteristic and the grouping corresponding to the maximum splitting gain, and the left subset and the right subset respectively correspond to the left sub-tree and the right sub; splitting the left sub-tree indication vector and the right sub-tree indication vector into a plurality of fragments through secret sharing, and distributing the fragments to participants; each participant utilizes the received fragment and the own indicating vector fragment to calculate a left subtree first-order gradient vector fragment and a second-order gradient vector fragment after the sample set is divided into a left subtree, a right subtree first-order gradient vector fragment and a second-order gradient vector fragment after the sample set is divided into a right subtree, the left subtree is constructed by using the left subtree first-order gradient vector fragment, the second-order gradient vector fragment and the left subtree indicating vector recursion, the right subtree is constructed by using the right subtree first-order gradient vector fragment, the second-order gradient vector fragment and the right subtree indicating vector recursion, a depth cyclic increasing condition is set, and if the division is not carried out or the preset maximum depth is reached, each participant calculates a corresponding fragment of the weight of the current leaf node on the decision tree;

s3, for each data sample, each participant utilizes the sample of the held partial characteristics to calculate the prediction result of the current combined decision tree model, and accumulates the prediction results into the results of the previous tree models with corresponding quantity to generate the comprehensive prediction result of the multiple tree models for the data sample;

and S4, increasing the number of the tree models, and iterating the steps S1-S3 until the target number of combined decision tree models are constructed.

3. The XGboost predictive model training method for protecting privacy of multi-party data according to claim 2, wherein the secret sharing algorithm in the steps S1, S2 and S3 comprises secret sharing splitting, secret sharing adding, secret sharing subtracting and secret sharing multiplying.

4. The XGboost predictive model training method for protecting multi-party data privacy as claimed in claim 2, wherein the step S1 specifically comprises:

s1.1, a first participant sets the initial number, the initial depth, the regularization parameter and the maximum depth of a constructed tree, the regularization parameter is shared and split secretly, all set parameters are distributed to all participants, for each participant with corresponding number of characteristics, a coordinator counts the total number of the characteristics of the participants, elements are generated into arrays with corresponding numbers, corresponding number of array elements in a disordering sequence are randomly distributed for each participant, the array elements obtained among the participants are not overlapped, and each participant establishes a one-to-one mapping map (j) from the disordering array elements to own characteristic numbers and records and stores the mapping map (j) at the own participant;

s1.2, calculating the maximum characteristic value quantity in own owned sample characteristics by all participants, sending the maximum characteristic value quantity to a coordinator, determining the maximum characteristic value quantity of all the participants by the coordinator, and broadcasting the maximum characteristic value quantity to all the participants;

s1.3, starting from the first participant with label data, enabling the participants to use the same loss function, enabling the first participant to calculate a first-order gradient vector, a second-order gradient vector and an initial all-1 indication vector by using a model prediction result vector and a label value vector, and enabling the initial prediction result of each piece of data to be divided into multiple first-order gradient vector fragments, second-order gradient vector fragments and indication vector fragments for each participant through a secret sharing algorithm and distributing the multiple first-order gradient vector fragments, second-order gradient vector fragments and indication vector fragments to the corresponding participants.

5. The XGboost predictive model training method for protecting multi-party data privacy as claimed in claim 2, wherein the step S2 specifically comprises:

s2.1, after each participant i receives the corresponding first-order gradient vector fragment, second-order gradient vector fragment and indication vector fragment, calculating the ith fragment { SG }of the first-order gradient sumⁱIth slice of the sum of second order gradients { SH }ⁱ，{SG}ⁱAnd { SH }ⁱThe vector elements are respectively summed by the first-order gradient vector fragment and the second-order gradient vector fragment owned by the participant i;

s2.2, for the XGboost algorithm, when a certain tree node exists, for all data first-order gradients and SG, second-order gradients and SH and a regular term lambda of the node, the non-splitting gain is expressed as:

each participant calculates the non-split gain numerator fragment and the non-split gain denominator fragment owned by the participant according to the non-split gain formula, and the calculation method specifically comprises the following steps:

{gain_down}ⁱ＝{SH}ⁱ+{λ}ⁱ

wherein, { gain_up}ⁱFor fragmentation of the undisrupted gain molecule, { gain_down}ⁱFor the non-split gain denominator slices,

s2.3, each participant i utilizes the corresponding first-order gradient vector fragment and second-order gradient vector fragment to calculate the first-order gradient and fragment matrix (BG) of all the value intervals of all the own characteristicsⁱWith a second order gradient and a fragmentation matrix { BH }ⁱ；

S2.5, for the characteristic j, each participant initializes and records a left subtree accumulated first-order gradient slicing variable

Left subtree cumulative second-order gradient sliced variable

Right subtree cumulative first-order gradient sharding variable

Right subtree cumulative second-order gradient fragment variable

Are all 0;

s2.6, each participant i traverses the value interval, and the left subtree accumulated first-order gradient fragment variable and the left subtree accumulated second-order gradient fragment variable are updated and calculated as follows:

wherein, { BH }ⁱ[j,k]And { BH }ⁱ[j,k]Respectively represent the fragmentation matrix { BG }ⁱAnd { BH }ⁱTo the [ j, k ] th]The number of the elements is one,

updating and calculating the right subtree accumulated first-order gradient fragment variable and the right subtree accumulated second-order gradient fragment variable:

wherein λ is a regularization parameter;

for each participant, directly calculating the split gain numerator and denominator fragment of the left and right subtrees at the element positions of the corresponding value intervals of the corresponding features and updating the split gain numerator and the denominator fragment into a matrix:

wherein j is a characteristic number, and k is a value interval number;

s2.7, each participant uses the splitting gain numerator and the denominator fragmentation matrix of the left sub-tree and the right sub-tree to calculate the splitting gain difference value between each value interval of all the characteristics j in a traversing way, and the selected characteristics j corresponding to the maximum splitting gain is determined through the comparison of the coordinators_bestAnd the value interval k_best；

S2.8, for the maximum splitting gain characteristic and the value range, each participant uses the splitting gain numerator and denominator fragment and the non-splitting gain numerator and denominator fragment of the left and right subtrees of the position, calculates the total splitting gain denominator fragment, sends the total splitting gain denominator fragment to the coordinator, calculates the total splitting gain numerator fragment, sends the total splitting gain numerator fragment to a first participant, calculates the denominator and determines the symbol of the coordinator by the coordinator, calculates the numerator and determines the symbol of the first participant, and the symbol of the first participant and the coordinator determine the symbol variable corresponding to the final maximum gain through the symbol of the coordinator and the symbol of the first participant;

s2.9, when the symbol variable is 1, the maximum splitting gain characteristic j is possessed_bestSetting a full 0 column vector SL of a left subtree after a recording sample falls into the feature division, and taking out a value interval corresponding to the maximum splitting gain

Sample characteristics j are concentrated in a sample set_bestValue taking

Satisfy the requirement of

The SL element of the position takes a value of 1, and an M-dimensional vector of a right subtree after a record sample falls into the feature division is set

Namely, the SL is negated, and for N total participants, the left sub-tree indicating vector and the right sub-tree indicating vector are divided into N fragments through secret sharing and distributed to all the participants;

s2.10, each participant receives the fragments of the left sub-tree indicating vector and the right sub-tree indicating vector, and recalculates the left sub-tree indicating vector fragment { SL }owned by the participantⁱAnd right sub-tree indicating vector slice { SR }ⁱ：

{SL}ⁱ＝{S}ⁱ⊙{SL}ⁱ

{SR}ⁱ＝{S}ⁱ⊙{SR}ⁱ

{GL}ⁱ＝{G}ⁱ⊙{SL}ⁱ

{GR}ⁱ＝{G}ⁱ⊙{SR}ⁱ

{HL}ⁱ＝{H}ⁱ⊙{SL}ⁱ

{HR}ⁱ＝{H}ⁱ⊙{SR}ⁱ

s2.12, when the current depth of the tree reaches a set limit or the symbolic variable is not 1, calculating the weight value of the leaf node, and stopping continuously constructing the left sub-tree and the right sub-tree for the current node;

s2.13, setting a depth cycle increasing condition, recursively executing the step S2.1 to the step S2.12, and constructing an XGboost joint decision tree model.

6. The XGboost predictive model training method for protecting multi-party data privacy as claimed in claim 5, wherein the step S2.3 specifically comprises:

s2.3.1, all participants initialize the corresponding dimension matrix { BG }recording interval first order gradient and sliceⁱCorresponding dimension matrix { BH } of second order gradient and fragment of recording intervalⁱ；

S2.3.2, j ═ 1,2, …, num for owned feature number j_featureBy a participant of S1.1, mapping the feature number, namely mapping j to a self feature map (j) owned by a participant, counting all division values owned by the feature and recording the number of the values;

s2.3.3, the participator sets a multi-dimensional Matrix with the record sample falling into the feature division_indexFor the feature j, all the features are valued val_kIs arranged as val from small to large_k,k＝1,…,k_jSetting left_k＝val_k-1And left₁＝-∞，right_k＝val_kTraversing k, and taking out the kth value interval (left)_k,right_k) Initializing a full 0 column vector S', and making the sample feature map (j) in the participant sample set take value_map(j)Satisfy left_k＜value_map(j)≤right_kIs 1, record Matrix_index[k,:]＝S′^TAfter the partitioning traversal is finished, the partition is split into N fragments { Matrix ] through secret sharing_index}ⁱAnd distributed to the respective participants;

s2.3.4, participant i receives { Matrix_index}ⁱAnd for the j-th feature, traversing k until the maximum value interval number, and calculating a first-order gradient and fragment and a second-order gradient and fragment:

{BG}ⁱ[j,k]＝sum({Matrix_index}ⁱ[k,:]⊙{G}ⁱ)

{BH}ⁱ[j,k]＝sum({Matrix_index}ⁱ[k,:]⊙{H}ⁱ)

wherein [ k, ] represents selecting all elements of the k row of the matrix, and sum (v) represents summing the elements of the vector v;

s2.3.5, traversing all the features, executing steps S2.3.2 to S2.3.4 to make all the participants complete the calculation of the first order gradient and slice and the second order gradient and slice of the transposed vector.

7. The XGboost predictive model training method for protecting multi-party data privacy as claimed in claim 5, wherein the step S2.7 specifically comprises:

s2.7.1, each participate inThe method sets the initial partition index list vector col ═ 1,2, …, k currently participating in the alignment_max]，k_maxRecord the length of col as R for the maximum feature division number_colSetting initial each feature division index list vector col_selected；

S2.7.2 XGboost algorithm, for characteristic position [ j, col [ r ]]]And [ j, col [ r +1 ]]],

presentation pair

order:

denominator1_col＝leftgain_down[j,col[r]]*eftgain_down[j,col[r+1]]

denominator2_col＝rightgain_down[j,col[r]]*rightgain_down[j,col[r+1]]

then:

for the

s2.7.3, each participant i sends its own space bit calculation slicing result vector { nominator1 to the coordinator_col}ⁱ、{nominator2_col}ⁱ、{denominator1_col}ⁱAnd { denoxinator 2_col}ⁱThe coordinator collects and calculates the vectors respectively

Wherein the pair vector{v}ⁱOperation carried out

s2.7.5, when col length is greater than 1, iterating steps S2.7.2 to S2.7.4 until col length becomes 1, taking out the only element col [0 ] in col]And recording col_selected[j]＝col[0]；

S2.7.6, traversing all the features j, iterating steps S2.7.1 to S2.7.5 to obtain the selected partition positions of each feature, combining the selected partition positions into a complete feature partition index list vector, and setting the initial current partition index list vector row of the current involved in comparison to [1,2, …, num ═ 1,2, …_feature]Recording a length of R_row，num_featureThe number of all features;

s2.7.7 XGboost algorithm, for characteristic position [ row [ r ]],col_selected[row[r]]]And

order:

then:

for the

Elements, using the above formula, all the characteristics [1,2, …, num ] of the split gain numerator and denominator fragmentation matrix of the left and right subtrees in S2.6 are calculated_feature]The corresponding best division of the difference result fragments between the positions specifically comprises the following steps:

S2.7.9, initialize empty list new _ row, traverse row _ shared _ value,

8. The XGboost predictive model training method for protecting multi-party data privacy as claimed in claim 5, wherein the step S2.8 specifically comprises:

s2.8.1, calculating the maximum splitting gain characteristic and the splitting gain in the value interval, specifically:

each participant i calculates its own split gain molecular fragment:

calculating the splitting gain denominator of the splitting gain denominator:

s2.8.2, the remaining participants respectively send the owned split gain molecule fragments to the first participant, the first participant collects and calculates, sets the symbol of the first participant, judges the sign of the first participant, and makes:

wherein sign₁Is a first participant symbol;

s2.8.3, all participants send the share gain denominator fragments to the coordinator, the coordinator collects and calculates, sets the coordinator symbol, judges the sign, and makes:

wherein sign₀Is a coordinator symbol;

s2.8.4, the first participant sends the first participant symbol to the coordinator, the coordinator receives and calculates the total symbol, broadcasts the total symbol to all participants, and all participants receive the value as the currently established symbol variable.

9. The XGboost predictive model training method for protecting multi-party data privacy as claimed in claim 5, wherein the step S2.12 specifically comprises:

s2.12.1, each participant calculates its second order gradient slice and half of the sum of the regular term:

wherein, { SH } is a second-order gradient slice, and { lambda } is a regular term;

each participant computes its own first-order gradient patch sum:

{g′}ⁱ＝{SG}ⁱ

wherein { SG } is a first-order gradient slice;

s2.12.2, each participant determines { h' }ⁱOf the order of magnitude such that:

wherein, mu_iIs an order of magnitude digit;

s2.12.4, each participant i settingRandom initial value

And a variable with an initial value of 0

10. The XGboost predictive model training method for protecting multi-party data privacy as claimed in claim 2, wherein the step S3 specifically comprises:

s3.1, each participant uses the held partial features for a data sample to predict leaf nodes according to a local tree model, wherein for each tree node, prediction is carried out according to the partition information of the tree node, all leaf node zone bits of branch subtrees which do not enter are set to be 0, if the partition information is not the features owned by the data sample, prediction is carried out along all left and right subtrees of the tree node until a leaf node which determines the attribute of the features is found, the zone bit is set to be 1, finally, each participant obtains the tree prediction features to generate the zone bits of all the leaf nodes, the zone bits are spliced into a zone vector according to the joint decision tree structure sequence of the leaf nodes, and meanwhile, a plurality of leaf weights are spliced into a result vector according to the same sequence;

s3.2, each participant carries out secret sharing and splitting on the mark vector and sends the mark vector to all participants;

s3.3, each participant receives the mark vector fragments sent by other participants, calculates bitwise multiplication vector fragments of all the vector fragments, and calculates bitwise multiplication results of the bitwise multiplication vector fragments and the weight fragments of the participant;

s3.4, each participant performs element summation according to the bitwise multiplication result and sends the result to a first participant, and the first participant receives and calculates a prediction result;

and S3.5, traversing all the data samples, and calculating a prediction result vector formed by combining corresponding prediction results.