CN112700031A - XGboost prediction model training method for protecting multi-party data privacy - Google Patents

XGboost prediction model training method for protecting multi-party data privacy Download PDF

Info

Publication number
CN112700031A
CN112700031A CN202011452494.9A CN202011452494A CN112700031A CN 112700031 A CN112700031 A CN 112700031A CN 202011452494 A CN202011452494 A CN 202011452494A CN 112700031 A CN112700031 A CN 112700031A
Authority
CN
China
Prior art keywords
participant
vector
col
order gradient
participants
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011452494.9A
Other languages
Chinese (zh)
Other versions
CN112700031B (en
Inventor
史清江
谢仑辰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tongji University
Original Assignee
Tongji University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tongji University filed Critical Tongji University
Priority to CN202011452494.9A priority Critical patent/CN112700031B/en
Publication of CN112700031A publication Critical patent/CN112700031A/en
Application granted granted Critical
Publication of CN112700031B publication Critical patent/CN112700031B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Human Resources & Organizations (AREA)
  • Strategic Management (AREA)
  • General Engineering & Computer Science (AREA)
  • Medical Informatics (AREA)
  • Economics (AREA)
  • Bioethics (AREA)
  • Game Theory and Decision Science (AREA)
  • Tourism & Hospitality (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • Databases & Information Systems (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Marketing (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Development Economics (AREA)
  • General Business, Economics & Management (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to an XGboost prediction model training method for protecting multi-party data privacy, which comprises a plurality of participants and a coordinator, wherein the participants with labels firstly calculate first-order and second-order gradient vectors and indication vectors by using a current model prediction result and a label value, the rest participants assist with the coordinator through secret sharing to jointly calculate and construct a combined decision tree model based on an XGboost algorithm, the participants cooperate with each other to determine the result of prediction of data to be trained in the combined decision tree model, and finally all the participants and the coordinator iterate together to complete the construction of a plurality of combined decision tree models to obtain a complete multi-party prediction model. Compared with the prior art, the method has the advantages of performing cross-data-source multi-party XGboost integrated model training on the premise of protecting data privacy, improving the prediction capability of the model while ensuring data safety and the like.

Description

XGboost prediction model training method for protecting multi-party data privacy
Technical Field
The invention relates to the technical field of machine learning, in particular to an XGboost prediction model training method for protecting multi-party data privacy.
Background
The XGboost algorithm is an ensemble learning algorithm, has the characteristics of quick construction and accurate prediction, and is designed to solve the machine learning problem that data features are in the same machine, but the XGboost algorithm is insufficient, so that the situation that multiple parties respectively hold different features of the same batch of data samples, one party holds label information and cannot transmit the label information to other parties cannot be processed.
In order to protect the privacy of data of all parties, a training scheme of longitudinal federal learning is adopted in the field of machine learning so as to achieve the accuracy of a machine learning model which is close to or equal to that of the data under the condition of the same machine. The current longitudinal federal learning algorithm mainly comprises two parties, but the longitudinal federal learning method strictly limits the number of mechanisms which can cooperate, cannot be easily expanded to any other parties, simplifies or approximates a machine learning model in order to smoothly cooperate with multiple parties, and causes loss of calculation result precision.
Disclosure of Invention
The invention aims to provide an XGboost prediction model training method for protecting multi-party data privacy, aiming at overcoming the defects of obstacles existing in data interaction and data precision loss in a cooperation process in the prior art.
The purpose of the invention can be realized by the following technical scheme:
a XGboost prediction model training method for protecting multi-party data privacy comprises a plurality of participants and a coordinator, wherein the participants with labels firstly calculate first-order and second-order gradient vectors and indication vectors by using a current model prediction result and a label value, the rest participants jointly calculate and construct a combined decision tree model based on an XGboost algorithm through secret sharing and assistance of the coordinator, the participants cooperate with each other to determine a prediction result of data to be trained in the combined decision tree model, and finally all the participants and the coordinator iterate together to complete construction of a plurality of combined decision tree models to obtain a complete multi-party prediction model.
The specific steps for training the joint decision tree model are as follows:
s1, the first participant sets the initial number t of the building tree to 1, the initial depth d to 1, the regularization parameter λ and the maximum depth dmaxFor a total of N participants, using secret sharing splitting yields { λ }iAll set parameters are distributed to all participants i, with num owned for eachiParty i generation of a featureRandom non-repeating numiFeature number index, by a first participant holding a tag, to predict a result vector using a current model
Figure BDA0002831883660000021
And calculating a sample label vector y to obtain a first-order gradient vector G and a second-order gradient vector H, generating an initial all-1 indication vector S, respectively performing secret sharing and splitting, and splitting into N first-order gradient vector fragments (G) for N participants in totaliSecond order gradient vector fragmentation { H }iAnd indicates vector fragmentation { S }iAnd distributed to all participants i, i ═ 1, … N, respectively;
s2, each participant i receives { G }i,{H}i、{S}iThen, the ith slice { SG } of the own first-order gradient sum is calculatediIth slice of the sum of second order gradients { SH }iCalculating the ith sub-molecule fragment and the ith denominator fragment of each group corresponding to the splitting gain under each characteristic by using a secret sharing method, determining the maximum splitting gain and the characteristic, the group and whether the division is carried out or not by a coordinating party, and generating a left sub-tree indication vector SL and a right sub-tree indication vector SR after the division if the selected characteristic belongs to a participant i', wherein the SL and the SR respectively indicate samples in a left subset and a right subset obtained by dividing the sample set according to the characteristic and the group corresponding to the maximum splitting gain, and the left subset and the right subset respectively correspond to the left sub-tree and the right sub-tree; splitting the SL and SR into N fragments { SL }through secret sharingiAnd { SR }iI ═ 1, …, N, and distributed to participant i; each participant i utilizes the received { SL }i、{SR}iWith own indication vector slicing { S }iLeft sub-tree first order gradient vector shard { SGL }after the computation sample set is divided into left sub-treesiAnd second order gradient vector slicing { SHL }iComputing right sub-tree first order gradient vector Sharding (SGR) after the sample set is partitioned into the right sub-treeiAnd second order gradient vector sharding { SHR }iUsing { SGL }i、{SHL}i、{SL}iRecursively proceeds to step S2 to construct a left sub-tree, using { SGR }i、{SHR}i、{SR}iRecursively performing step S2 to construct a right subtree, and setting the depth d to d +1, if no division is performed or the maximum depth d is reachedmaxEach participant i calculates the ith fragment of the weight of the current leaf node sigma on the decision tree
Figure BDA0002831883660000022
S3, for each data sample xpEach participant i utilizes a sample of the held partial features
Figure BDA0002831883660000023
Calculating the prediction result of the current t tree
Figure BDA0002831883660000024
Accumulate to the results of the first t-1 trees to produce t trees for data sample xpIntegrated predicted results of
Figure BDA0002831883660000025
Wherein
Figure BDA0002831883660000026
Representing the qth tree to the pth data sample xpThe result of the prediction of (a) is,
Figure BDA0002831883660000027
to represent
Figure BDA0002831883660000028
The p-th element, for a total of M data samples, traversal p 1, …, M yields the complete
Figure BDA0002831883660000029
And S4, increasing the number T of trees to T +1, and iterating the steps S1-S3 until T to T decision trees are built.
Further, the secret sharing algorithm used in steps S1, S2, S3 is a method of splitting a piece of data θ into multiple pieces { θ }iI pairs of different participantsThe respective fragments are calculated in the same type and step to generate { theta' }iAfter the calculation is finished, the data are generated by addition and combination
Figure BDA0002831883660000031
And theta' is equivalent to the result of executing the same type and synchronous step calculation on theta, and the specific calculation involved comprises the following steps:
a. secret sharing splitting
For one-dimensional data theta, when a participant i carries out secret sharing and splitting, N-1 random numbers are generated for N total participants, and the N-1 random numbers are designated as fragments { theta }i′I '≠ i, for participant i' to use, and participant i generates its own data slice { θ }i=θ-∑i′{θ}i′
b. Secret sharing addition
For one-dimensional sliced data θA}1,…,{θA}NAnd { theta [ [ theta ] ])B}1,…,{θB}NEach participant i utilizes the holding { theta }A}iAnd { theta [ [ theta ] ])B}iCan directly use common addition to calculate thetaA}i+{θB}i={θ′}iTherefore, for the convenience of description, the common addition is directly used for explanation;
c. secret sharing subtraction
For one-dimensional sliced data θA}1,…,{θA}NAnd { theta [ [ theta ] ])B}1,…,{θB}NEach participant i utilizes the holding { theta }A}iAnd { theta [ [ theta ] ])B}iCan directly use common subtraction to calculate thetaA}i-{θB}i={θ′}iTherefore, for the convenience of description, the common subtraction method is directly used for explanation;
d. secret sharing multiplication
For one-dimensional sliced data θA}1,…,{θA}NAnd { theta [ [ theta ] ])B}1,…,{θB}NOf each participant i holds { theta }A}iAnd { theta [ [ theta ] ])B}iFirst, a coordinator generates one-dimensional variables a, b, c ═ a × b, and splits into { a } through secret sharing1,…,{a}N、{b}1,…,{b}NAnd { c }1,…,{c}NAnd sending the data to each participant i, and each participant i receives { a }i,{b}i,{c}iAnd calculates { e }i={θA}i-{a}iAnd { f }i={θB}i-{b}iSent to the first party, the first party calculates
Figure BDA0002831883660000032
And
Figure BDA0002831883660000033
sending the data to all participants, and calculating by the first participant to obtain { theta' }1And the other participants i calculate to obtain { theta' }iFinal secret-sharing multiplication
Figure BDA0002831883660000034
Expressed as:
Figure BDA0002831883660000035
Figure BDA0002831883660000036
for the above steps, the method can be popularized from one-dimensional data to multi-dimensional data.
Further, the step S1 specifically includes:
s1.1, the first participant sets the initial number t of the building tree to be 1, the initial depth d to be 1, the regularization parameter lambda and the maximum depth dmaxGenerating { lambda } using secret sharing splittingiDistribute all set parameters to all participants i, for each owning numiParticipant i, coordinator of individual characteristicsCounting the total number of features num of the participantsfeature=∑i=1numiThe production element is [1, 2.. num. ], numfeature]For each participant i, randomly assigns numiThe array elements in disorder sequence are not overlapped with the array elements obtained among all the participants, and all the participants establish one-to-one mapping map (j) from disorder array elements j to own characteristic numbers and record and store the mapping map (j) in the own party;
s1.2, all participants calculate the maximum characteristic value number k in own owned sample characteristicsselfmaxAnd then the data is sent to a coordinator, and the coordinator determines the maximum characteristic value number k of all participantsmax=max kselfmaxAnd broadcast to all participants;
s1.3, starting from a first participant holding tag data, each participant using the same loss function l (-) and the first participant predicting a result vector by using a model
Figure BDA0002831883660000041
Calculating a first order gradient vector from the tag value vector y
Figure BDA0002831883660000042
Second order gradient vector
Figure BDA0002831883660000043
With the initial full 1 indicating the vector S, each data xpInitial predicted result of (2)
Figure BDA0002831883660000044
When t is 1, the sum is 0, otherwise, the sum is expressed as the prediction weight sum of the existing t-1 trees
Figure BDA0002831883660000045
Splitting G, H, S secret sharing into N first order gradient vector shards { G }for a total of N participantsiSecond order gradient vector fragmentation { H }iAnd indicates vector fragmentation { S }iI-1, … N, and distributed to participant i.
Further, the step S2 specifically includes:
s2.1, each participant i receives the ith sub-slice (G) of the first-order gradient vectoriSecond order gradient vector ith plate { H }iAnd indicating the ith slice { S }iThen, the ith slice { SG } of the own first-order gradient sum is calculatediIth slice of the sum of second order gradients { SH }i,{SG}iAnd { SH }iBy { G } owned by participants i, respectivelyiAnd { H }iRespectively summing vector elements;
s2.2, each participant i calculates the non-split gain molecule fragment { gain ] owned by each participant through the following formulaup}iWith undivided gain denominator slice { gaindown}i
Figure BDA0002831883660000046
{gaindown}i={SH}i+{λ}i
Wherein the content of the first and second substances,
Figure BDA0002831883660000047
for secret sharing multiplication, { lambda }iThe ith fragment is the hyperparameter lambda;
s2.3, each participant i utilizes { G }iAnd { H }iCalculating the first-order gradient and the fragmentation matrix (BG) of all the value intervals of all the characteristics owned by the selfiWith a second order gradient and a fragmentation matrix { BH }i
S2.4, each participant i initializes the split gain molecular fragmentation matrix { leftgain } of the left subtreeup}iLeft subtree splitting gain denominator fragmentation matrix { leftgain }down}iRight subtree split gain molecular fragmentation matrix { rightgain }up}iRight subtree split gain denominator fragmentation matrix { rightgain }down}i
S2.5, for the characteristic j, each participant i initializes and records a left subtree accumulated first-order gradient slicing variable
Figure BDA0002831883660000051
Left subtree cumulative second-order gradient sliced variable
Figure BDA0002831883660000052
Right subtree cumulative first-order gradient sharding variable
Figure BDA0002831883660000053
Right subtree cumulative second-order gradient fragment variable
Figure BDA0002831883660000054
Are all 0;
s2.6, traversing the value interval k by each participant i, and updating and calculating
Figure BDA0002831883660000055
And
Figure BDA0002831883660000056
comprises the following steps:
Figure BDA0002831883660000057
Figure BDA0002831883660000058
wherein { BH }i[j,k]And { BH }i[j,k]Respectively represent the fragmentation matrix { BG }iAnd { BH }iTo the [ j, k ] th]Element, update calculation
Figure BDA0002831883660000059
And
Figure BDA00028318836600000510
Figure BDA00028318836600000511
Figure BDA00028318836600000512
for the XGBoost model, the splitting gain calculation formula used is as follows:
Figure BDA00028318836600000513
for each participant i, directly calculating splitting gain numerators and denominator fragments of left and right subtrees at the k-th value interval element position of the j-th characteristic and updating the splitting gain numerators and the denominator fragments into a matrix:
Figure BDA00028318836600000514
Figure BDA00028318836600000515
Figure BDA00028318836600000516
Figure BDA00028318836600000517
s2.7, each participant i utilizes the splitting gain numerator and the denominator fragmentation matrix of the left sub-tree and the right sub-tree obtained in the S2.6 to calculate the splitting gain difference value between different value intervals k of different characteristics j, and the selected characteristics j corresponding to the maximum splitting gain are determined through the comparison of the coordinatorsbestAnd the value interval kbest
S2.8, each participant i is subjected to the maximum splitting gain characteristic jbestAnd the value interval kbestSplitting the numerator and denominator patches and non-split numerator and denominator patches { gain } using the left and right subtrees of that locationup}i、{gaindown}iCalculating the total split gain denominator (deno)minator}iSending to a coordinator, and calculating the total split gain molecule fragment { nominator }iSending the data to a first participant, and calculating the denominator by a coordinator
Figure BDA00028318836600000518
And determines a sign0First party calculation molecule
Figure BDA00028318836600000519
And determines a sign1The first participant and the coordinator pass through sign0And sign1Jointly determining a symbol variable corresponding to the final maximum gain;
s2.9, when the sign variable is 1, for the feature jbestWhen the i' th participant has the feature jbestSetting a record sample to fall into the M-dimensional vector SL of the left sub-tree after the characteristic division, and taking out the kth vector SLbestIndividual value interval
Figure BDA0002831883660000061
Sample characteristics j are concentrated in a sample setbestValue taking
Figure BDA0002831883660000062
Satisfy the requirement of
Figure BDA0002831883660000063
Figure BDA0002831883660000064
The SL position of (1) is set to be 1, the rest positions are set to be 0, and one record sample is set to fall into the M-dimensional vector of the right subtree after the characteristic division
Figure BDA0002831883660000065
Namely negating SL, and splitting SL and SR into N shards { SL }through secret sharing for N participants in totaliAnd { SR }iAnd distributed to all participants i, i ═ 1, …, i', …, N;
s2.10, each participant i receives { SL }iAnd { SR }iRecalculating owned by itselfLeft sub-tree indicates vector fragmentation { SL }iAnd right sub-tree indicating vector slice { SR }i
{SL}i={S}i⊙{SL}i
{SR}i={S}i⊙{SR}i
Wherein a secret-sharing multiplication is performed between co-located elements of an
Figure BDA0002831883660000066
Get a dimension { S }iThe same vector, calculate the own first-order gradient vector slice { GL } that falls into the left sub-tree sampleiWith the first-order gradient vector slice falling into the right sub-tree sample { GR }i
{GL}i={G}i⊙{SL}i
{GR}i={G}i⊙{SR}i
Computing its own second-order gradient vector slice { HL } falling into the left sub-tree sampleiWith the second order gradient vector patch { HR } falling within the right sub-tree samplei
{HL}i={H}i⊙{SL}i
{HR}i={H}i⊙{SR}i
S2.11, setting { GL } for each participant ii、{HL}iAnd { SL }iSetting { GR } for constructing first-order gradient vector fragments, second-order gradient vector fragments and indication vector fragments used by the left subtreei、{HR}iAnd { SR }iA first-order gradient vector fragment, a second-order gradient vector fragment and an indication vector fragment used for constructing the right subtree;
s2.12, when the current depth d of the tree reaches the set limit dmaxOr when the symbolic variable is not 1, calculating the weight value of the leaf node, and stopping continuously constructing the left and right subtrees for the current node;
and S2.13, setting d to be d +1, recursively executing the step S2.1 to the step S2.12, and completing construction of the XGboost decision tree.
Further, the step S2.3 specifically includes:
s2.3.1: all participants i initialize record interval first-order gradient and fragmented numfeature*kmaxDimension matrix { BG }iNum of second order gradient and slice from recording intervalfeature*kmaxDimension matrix { BH }i
S2.3.2: for the feature j, j 1,2featureWhen the ith 'participant has the feature number j, mapping j to the own feature map (j) owned by the participant i' by using the feature index in the step S1-1, counting all the division values owned by the feature and recording the number k of the division valuesj
S2.3.3: participant i' sets a k recording that the sample falls into the feature partitionmaxMatrix of M dimensionsindexM is the number of samples, and for the j-th feature, all the features are valued valkIs arranged as val from small to largek,k=1,...,kjSetting leftk=valk-1And left0=-∞,rightk=valuekGo through kjAnd taking out the kth value interval (left)k,rightk) Initializing a full 0 vector S 'of a dimension M x 1, and enabling the participator i' in a sample set to sample feature map (j) to take valuemap(j)Satisfy leftk<valuemap(j)≤rightkThe S' position of (A) is taken as 1, and Matrix is recordedindexKth line vector Matrixindex[k,:]=S′T,S′TIs a transposed vector of S ', after the partitioning traversal is finished, for the total N participants, the participant i' transforms the MatrixindexSplitting into N slices { Matrix ] through secret sharingindex}iAnd distributed to all participants i, i ═ 1, …, i', …, N;
s2.3.4: participant i receives { Matrixindex}iFor the j-th feature, traversing k until the maximum value interval number kmaxCalculating first order gradient and slicing { BG }i[j,k]With second order gradient and shard { BH }i[j,k]:
{BG}i[j,k]=sum({Matrixindex}i[k,:]⊙{G}i)
{BH}i[j,k]=sum({Matrixindex}i[k,:]⊙{H}i)
Wherein, [ k,: represents all elements of the k-th row of the selection matrix, where sum (v) represents the summation of elements of the vector v;
s2.3.5: go through j, execute S2.3.2-S2.3.4, make all participants i complete { BG }iAnd { BH }iAnd (4) calculating.
Further, the step S2.7 specifically includes:
s2.7.1, each participant i sets the initial partition index list vector col ═ 1,2, …, k of the currently participating alignmentmax]Recording a length of RcolSetting initial each feature division index list vector colselected
S2.7.2 XGboost algorithm, for characteristic position [ j, col [ r ]]]And [ j, col [ r +1 ]]]
Figure BDA0002831883660000071
Figure BDA0002831883660000072
col[r]Denotes the r-th element in the index list col, [ j, col [ r [ ]]]Col [ r ] th of j-th row of the matrix]The number of the elements is one,
Figure BDA0002831883660000073
presentation pair
Figure BDA0002831883660000074
Rounding down, traversing r, and calculating the splitting gain difference between the feature positions as follows:
Figure BDA0002831883660000075
Figure BDA0002831883660000081
order:
nominator1col
=leftgainup[j,col[r]]*leftgaindown[j,col[r+1]]-leftgainup[j,col[r+1]]*leftgaindown[j,col[r]]
nominator2col
=rightgainup[j,col[r]]*rightgaindown[j,col[r+1]]-rightgainup[j,col[r+1]]*rightgainaown[j,col[r]]
denominator1col=leftgaindown[j,col[r]]*leftgaindown[j,col[r+1]]
denominator2col=rightgaindown[j,col[r]]*rightgaindown[j,col[r+1]]
then:
Figure BDA0002831883660000082
for the
Figure BDA0002831883660000083
And (3) calculating all division position difference value result shards of the splitting gain numerator of the left sub-tree and the right sub-tree in the S2.6 and the characteristic j of the denominator shard matrix by using all the participators i by using the formula:
Figure BDA0002831883660000084
Figure BDA0002831883660000085
Figure BDA0002831883660000091
Figure BDA0002831883660000092
s2.7.3, each participant i sends its own space bit calculation slicing result vector { nominator1 to the coordinatorcol}i、{nominator2col}i、{denominator1col}i、{denominator2col}iThe coordinator collects and calculates the vectors respectively
Figure BDA0002831883660000093
Figure BDA0002831883660000094
Wherein the vector of pairs { v }iOperation carried out
Figure BDA0002831883660000095
Represents for all { v }iThe collocated elements of a slice are summed into a vector, subtending the vectors v and v2Operation v carried out1\v2Representing division of co-located elements of two vectors, i.e.
Figure BDA0002831883660000096
S2.7.4, initializing an empty list new _ col, sequentially judging the r-th element of col _ shared _ value,
Figure BDA0002831883660000098
if it is not negative, col [ r2]Adding into new _ col, otherwise, adding col [ r 2+1]Adding the new _ col into the new _ col, adding the last element of the col into the new _ col if the col length is odd after traversal is finished, and then broadcasting the new _ col by all participants in the coordination direction, wherein the participant commands col to be the new _ col;
s2.7.5, when col length is greater than 1, iterating steps S2.7.2 to S2.7.4 until col length becomes 1, and taking out the only element col [0 ] in col]Recording colselected[j]=col[0];
S2.7.6, traversing all the characteristics j, iterating steps S2.7.1 to S2.7.5, obtaining the selected division position of each characteristic,partitioning index list vector col into complete featuresselectedSetting the initial current partition index list vector row currently participating in the comparison as [1,2, …, numfeature]Recording a length of Rrow
S2.7.7 XGboost algorithm, for characteristic position [ row [ r ]],colselected[row[r]]]And [ row [ r +1 ]],colselected[row[r+1]]],
Figure BDA0002831883660000097
Wherein row [ r ]]Denotes the r-th element, col, in the index list rowselected[row[r]]Representation colselectedMiddle with row [ r ]]Is an element of the index position that is,
Figure BDA0002831883660000099
represents a pair of RrowAnd/2, rounding down, traversing r, and calculating the splitting gain difference between the characteristic positions as follows:
Figure BDA0002831883660000101
order:
nominator1row
=leftgainup[row[r],colselected[row[r]]]*leftgaindown[row[r+1],colselected[row[r+1]]]-leftgainup[row[r+1],colselected[row[r+1]]]*leftgaindown[row[r],colselected[row[r]]]
nominator2row
=rightgainup[row[r],colselected[row[r]]]*rightgaindown[row[r+1],colselected[row[r+1]]]-rightgainup[row[r+1],colselected[row[r+1]]]*rightgaindown[row[r],colselected[row[r]]]
denominator1row
=leftgaindown[row[r],colselected[row[r]]]*leftgaindown[row[r+1],colselected[row[r+1]]]
denominator2row
=rightgaindown[row[r],colselected[row[r]]]*rightgaindown[row[r+1],colselected[row[r+1]]]
then:
Figure BDA0002831883660000111
for the
Figure BDA0002831883660000112
Elements, using the above formula, all the characteristics [1,2, …, num ] of the split gain numerator and denominator fragmentation matrix of the left and right subtrees in S2.6 are calculatedfeature]Each preferably divides the difference result slices between locations:
Figure BDA0002831883660000113
Figure BDA0002831883660000114
Figure BDA0002831883660000115
Figure BDA0002831883660000116
s2.7.8, each participant i sends its own interval calculation result vector { nominator1 to the coordinatorrow}i、{nominator2row}i、{denominator1row}i、{denominator2row}iThe coordinator collects and calculates the vectors respectively
Figure BDA0002831883660000117
Figure BDA0002831883660000118
S2.7.9, initialize empty list new _ row, traverse row _ shared _ value,
Figure BDA0002831883660000123
if it is not negative, then row [ r2 [ ]]Adding into new _ row, otherwise, adding row [ r 2+1]Adding the new _ row element into a new _ row element, adding the last element of the row element into the new _ row element if the row length is odd after traversal is finished, then broadcasting the new _ row element by all participants in a coordination direction, and enabling the row element to be the new _ row element by the participants;
s2.7.10, when the row length is greater than 1, iterating steps S2.7.6 to S2.7.9 until the row length becomes 1, fetching the unique element row [0 ] in the row]The participant is recorded as jbest=row[0]And get kbest=colselected[jbest]Broadcast to all participants to determine the best feature number j after selectionbestThe optimal division position k with the featurebest
Further, the step S2.8 specifically includes:
s2.8.1, maximum splitting gain characteristic j for a givenbestAnd the value interval kbestThe split gain expression in the XGBoost algorithm is:
Figure BDA0002831883660000121
each participant i calculates its own split gain molecule slice { nominator }i
Figure BDA0002831883660000122
Calculating the split gain denominator slice { denoxinator }of the selfi
Figure BDA0002831883660000131
S2.8.2, the remaining participants send the first participant the owned fragmentation gain molecule fragment { nominator }2...{nominator}NThe first party collects { nominator }2...{nominator}NAnd calculate
Figure BDA0002831883660000132
Setting a first participant symbol sign1Judging the sign, and ordering:
Figure BDA0002831883660000133
s2.8.3, each participant i sends the coordinator the owned split gain denominator slice { denoxinator }to the coordinatoriThe coordinator collects { denoxinator }iN and calculating
Figure BDA0002831883660000134
Setting coordinator symbol sign0Judging the sign, and ordering:
Figure BDA0002831883660000135
s2.8.4 first participant sign1Sending the signal to a coordinator, and receiving and calculating sign as sign by the coordinator1*sign0Broadcasting sign to all participants, and receiving the value as the currently established symbol variable by all the participants;
further, the step S2.12 specifically includes:
s2.12.1, each participant i calculates half of the sum of its second order gradient fragment and the regularization term:
Figure BDA0002831883660000136
each participant i computes its own first-order gradient patch sum:
{g′}i={SG}i
s2.12.2, each participant i determines { h' }iOf order of magnitude muiSo that:
Figure BDA0002831883660000137
s2.12.3, all participating parties send corresponding magnitude digits, and the coordinator receives and selects the maximum magnitude digit as mumDetermining an iteration step size
Figure BDA0002831883660000141
The process parameter tau and the iteration number iter are sent to all the participants;
s2.12.4, setting random initial value for each participant i
Figure BDA0002831883660000142
And a variable with an initial value of 0
Figure BDA0002831883660000143
Starting with κ ═ 1, the iteration proceeds according to the following formula:
Figure BDA0002831883660000144
Figure BDA0002831883660000145
setting kappa as kappa +1 after each iteration, terminating when kappa as iter, and recording weight slicing after the computation of the participant i is finished
Figure BDA0002831883660000146
Further, the step S3 specifically includes:
s3.1 for the t treetEach ofFor data sample xpUsing features of the held part
Figure BDA0002831883660000147
According to a local tree model
Figure BDA0002831883660000148
Performing leaf node prediction, wherein for each tree node, if the partition information is
Figure BDA0002831883660000149
If the branch sub-tree has the characteristics, the left (right) sub-tree is divided and prediction is continued according to the characteristics and the values, all leaf node flag bits of the branch sub-tree which does not enter are set to be 0, and if the division information is not
Figure BDA00028318836600001410
Until a certain characteristic is found, prediction is made along all left and right subtrees of the tree node
Figure BDA00028318836600001411
Setting flag bit to be 1 for attributive leaf node, finally obtaining the tree prediction by each participant i
Figure BDA00028318836600001412
Generating flag bits of all leaf nodes sigma, sigma 1,2iSimultaneously splicing delta leaf weights according to the same sequence
Figure BDA00028318836600001413
Is a result vector
Figure BDA00028318836600001414
S3.2, each participant i will indexiSecret sharing splitting is carried out and is divided into { indexi}i′Sent to all participants i', i ═ 1, …, i, …, N;
s3.3, eachThe participant i' receives the mark vector slice { index ] sent by the participant ii}i′Calculating bitwise cumulative multiplication vectors { index } of all vector slicesi′={index1}i′⊙{index2}i′⊙…⊙{indexN}i′And calculating the bitwise multiplication result { v) of the mark vector fragment and the self weight fragmentresult}i′={index}i′⊙{vw}i′
S3.4, each participant i' pair { vresult}i′Element summation is carried out to { weightp}i′=sum({vresult}i′) And sending the result to the first party, which receives and calculates
Figure BDA00028318836600001415
And calculate
Figure BDA00028318836600001416
Becomes the sample x after the end of the t roundpThe predicted result of (2);
s3.5, traversing all p, and calculating all data samples xpVector formed by combining t-th round prediction results
Figure BDA00028318836600001417
Compared with the prior art, the invention has the following beneficial effects:
according to the invention, the first-order and second-order gradient vectors and indication vectors are calculated by the participators of the tag by using the prediction result of the current model and the tag value, each participator constructs a decision tree model based on an XGboost algorithm through the assistance of secret sharing and coordinators, determines the prediction result of data to be trained through the joint cooperation of the participators, completes the construction of a plurality of decision tree models through iteration to obtain a complete lossless safe multi-party prediction model, performs cross-data-source multi-party XGboost integrated model training on the premise of protecting data privacy, and improves the prediction capability of the model while ensuring data safety.
Drawings
FIG. 1 is a schematic diagram of the interaction of participants and a coordinator in accordance with the present invention;
FIG. 2 is a schematic flow chart of a model training process of the present invention;
FIG. 3 is a communication flow diagram of the model training process of the present invention;
FIG. 4 is a diagram illustrating a multi-party tree model and its corresponding equivalent model according to an embodiment of the present invention.
Detailed Description
The invention is described in detail below with reference to the figures and specific embodiments. The present embodiment is implemented on the premise of the technical solution of the present invention, and a detailed implementation manner and a specific operation process are given, but the scope of the present invention is not limited to the following embodiments.
Example one
As shown in fig. 1, a XGBoost prediction model training method for protecting multi-party data privacy includes multiple participants and a coordinator, a participant with a tag first calculates a first-order and a second-order gradient vectors and an indication vector by using a current model prediction result and a tag value, the remaining participants assist with the coordinator through secret sharing to jointly calculate and construct a combined decision tree model based on an XGBoost algorithm, the participants cooperate together to determine a result of prediction of data to be trained in the combined decision tree model, and finally all the participants and the coordinator iterate together to complete construction of multiple combined decision tree models, thereby obtaining a complete multi-party prediction model.
In this embodiment, as shown in fig. 4, an application scenario is set such that several different types of mechanisms respectively have the same sample group, but the data characteristics do not overlap. By combining different institutional data, a more sophisticated model can be trained in multiple ways. In order to simulate the effect, a parallel computing frame is locally used, 4 computing nodes are set, the numbers of the computing nodes are respectively 0, 1,2 and 3, the computing nodes correspond to 3 computing participants and 1 coordinating party, wherein the computing node 0 is the coordinating party, the computing node 1 is the first participant with the characteristic, and the computing nodes 2 and 3 represent the participants 2 and 3. The present embodiment uses the Iris data set from UCI Machine Learning reproducibility, selects 100 pieces of data in two categories with class labels of 0 and 1, including four features of sepal length, sepal width, pedal length, and pedal width, and assigns sepal length and pedal length of the four features to the first participant, sepal width to participant 2, and pedal width to participant 3, where all participants regard 80% of the data samples as the training set, and the remaining 20% as the test set. The specific flow is shown in fig. 2 and 3.
S1, setting t to 1, generating initial tree building parameters and feature indexes, and calculating to generate gradient vectors and indicator vector segments, which specifically includes:
s1.1, setting initial tree building parameters and feature indexes:
the first participant sets the initial number t of the building tree to 1, the initial depth d to 1, the regularization parameter λ and the maximum depth dmaxIn this embodiment, λ is set to 1, d max4, for a total of 3 participants, calculate { λ }i1/3 (equivalent to splitting secret sharing into three), distributed to all participants i, the coordinator counting the total number of features num of the participants for each participant i with numi featuresfeature=∑i=1numiThe production element is [1, 2.. num. ], numfeature]For each participant i, randomly assigns numiThe array elements in the disordered sequence are not overlapped with the array elements obtained among the participants, each participant establishes a one-to-one mapping map (j) from the disordered array element j to the characteristic number owned by the participant and records and stores the mapping map (j) on the participant, for example, for a first participant with a first characteristic sepal length and a third characteristic pedal length, the two characteristics are accessed locally through numbers 0 and 1, the first participant is distributed to indexes 2 and 0, the first participant establishes mappings 0 ═ map (2) and 1 ═ map (0), and for the characteristic index number 2 in the subsequent iteration, the first participant owns the characteristic index number and converts the characteristic index number into the corresponding characteristic number 0 in the characteristic set by mapping, so as to access the characteristic;
s1.2, determining the maximum characteristic value quantity:
all participants calculate the maximum characteristic value number k in own sample characteristicsselfmaxAnd then the data is sent to a coordinator, and the coordinator determines the maximum characteristic value number k of all participantsmax=maxkselfmaxAnd broadcast to all participants;
s1.3, calculating to generate gradient vectors and indication vector fragments:
starting from the first party holding the tagged data, each party uses the same loss function l (-) which in the embodiment is the squared loss function MSE, i.e. the square loss function MSE
Figure BDA0002831883660000161
First party predicts a result vector using a model
Figure BDA0002831883660000162
Calculating a first order gradient vector from the tag value vector y
Figure BDA0002831883660000163
Second order gradient vector
Figure BDA0002831883660000164
With the initial full 1 indicating the vector S, each data xpInitial predicted result of (2)
Figure BDA0002831883660000165
When t is 1, the sum is 0, otherwise, the sum is expressed as the prediction weight sum of the existing t-1 trees
Figure BDA0002831883660000166
Splitting G, H, S secret sharing into N first order gradient vector shards { G }for a total of N participantsiSecond order gradient vector fragmentation { H }iAnd indicates vector fragmentation { S }iI-1, … N, and distributed to participant i.
S2, multiple parties construct the tth decision tree based on the XGboost algorithm, and the decision tree specifically comprises the following steps:
s2.1, each participant receives the ith sub-slice (G) of the first-order gradient vectoriSecond order gradient vector ith plate { H }iAnd indicating the ith slice { S }iCalculating the self first-order and second-order gradients and the fragments;
each participant i receives the ith slice of the first-order gradient vector { G }iSecond order gradient vector ith plate { H }iAnd indicating the ith slice { S }iThen, the ith slice { SG } of the own first-order gradient sum is calculatediIth slice of the sum of second order gradients { SH }i,{SG}iAnd { SH }iBy { G } owned by participants i, respectivelyiAnd { H }iRespectively summing vector elements;
s2.2, calculating the non-split gain numerator segment and the denominator segment by each participant:
for the XGBoost algorithm, at a certain tree node, for all data first-order gradients and SG, second-order gradients and SH, and regular term λ that the tree node has, the non-split gain is expressed as:
Figure BDA0002831883660000171
in our secret sharing scenario, we need to compute its molecular fragment { gain separately without divisionup}iAnd denominator slicing { gaindown}i
Figure BDA0002831883660000172
{gaindown}i={SH}i+{λ}i
In the formula
Figure BDA0002831883660000173
For secret sharing multiplication, { lambda }iThe ith fragment is the hyperparameter lambda;
s2.3, each participant i utilizes { G }iAnd { H }iCalculating the first-order gradient and the fragmentation matrix (BG) of all the value intervals of all the characteristics owned by the selfiWith a second order gradient and a fragmentation matrix { BH }i
S2.4, each participant i initializes the split gain molecular fragmentation matrix { leftgain } of the left subtreeup}iLeft subtree splitting gain denominator fragmentation matrix { leftgain }down}iRight subtree split gain molecular fragmentation matrix { rightgain }up}iRight subtree split gain denominator fragmentation matrix { rightgain }down}i
In the embodiment, the matrix needs to be explicitly initialized at each participant i to avoid the execution problem;
s2.5, for the characteristic j, each participant i initializes and records a left subtree accumulated first-order gradient slicing variable
Figure BDA0002831883660000174
Left subtree cumulative second-order gradient sliced variable
Figure BDA0002831883660000175
Right subtree cumulative first-order gradient sharding variable
Figure BDA0002831883660000176
Right subtree cumulative second-order gradient fragment variable
Figure BDA0002831883660000177
Are all 0;
in this embodiment, the above variables need to be explicitly initialized at each participant i to avoid the execution problem;
s2.6, traversing the value interval k by all the participants i, and updating and calculating
Figure BDA0002831883660000178
And updating the split gain numerator and denominator fragmentation matrix of the left and right subtrees at the kth value interval position of the jth characteristic
Traversing the value interval k by each participant i, and updating and calculating
Figure BDA0002831883660000179
And
Figure BDA00028318836600001710
comprises the following steps:
Figure BDA0002831883660000181
Figure BDA0002831883660000182
wherein { BH }i[j,k]And { BH }i[j,k]Respectively represent the fragmentation matrix { BG }iAnd { BH }iTo the [ j, k ] th]Element, update calculation
Figure BDA0002831883660000183
And
Figure BDA0002831883660000184
Figure BDA0002831883660000185
Figure BDA0002831883660000186
for the XGBoost model, the splitting gain calculation formula used is as follows:
Figure BDA0002831883660000187
for each participant i, directly calculating splitting gain numerators and denominator fragments of left and right subtrees at the k-th value interval element position of the j-th characteristic and updating the splitting gain numerators and the denominator fragments into a matrix:
Figure BDA0002831883660000188
Figure BDA0002831883660000189
Figure BDA00028318836600001810
Figure BDA00028318836600001811
s2.7, each participant i utilizes the split gain numerator and the denominator fragmentation matrix of the left sub-tree and the right sub-tree obtained by 2.6 to calculate the split gain difference value between different value intervals k of different characteristics j, and the selected characteristics j corresponding to the maximum split gain is determined through the comparison of the coordinatorsbestAnd the value interval kbest
S2.8, determining a symbol variable of the maximum gain;
s2.9, determining the indication vector slice divided by the division position for the features and the division position meeting the division criterion
When the sign variable is 1, for the feature jbestWhen the i' th participant has the feature jbestSetting a record sample to fall into the M-dimensional vector SL of the left sub-tree after the characteristic division, and taking out the kth vector SLbestIndividual value interval
Figure BDA00028318836600001812
Sample characteristics j are concentrated in a sample setbestValue taking
Figure BDA00028318836600001813
Satisfy the requirement of
Figure BDA00028318836600001814
Figure BDA00028318836600001815
The SL position of (1) is set to be 1, the rest positions are set to be 0, and one record sample is set to fall into the M-dimensional vector of the right subtree after the characteristic division
Figure BDA00028318836600001816
Namely negating SL, and splitting SL and SR into N shards { SL }through secret sharing for N participants in totaliAnd { SR }iAnd distributed to all participants i, i ═ 1, …, i', …, N;
s2.10, each participant updates first-order and second-order gradient vectors and indicates vector fragmentation
Each participant i receives { SL }iAnd { SR }iRecalculating the left sub-tree indication vector Slice (SL) owned by itselfiAnd right sub-tree indicating vector slice { SR }i
{SL}i={S}i⊙{SL}i
{SR}i={S}i⊙{SR}i
Wherein a secret-sharing multiplication is performed between co-located elements of an
Figure BDA0002831883660000191
Get a dimension { S }iThe same vector, calculate the own first-order gradient vector slice { GL } that falls into the left sub-tree sampleiWith the first-order gradient vector slice falling into the right sub-tree sample { GR }i
{GL}i={G}i⊙{SL}i
{GR}i={G}i⊙{SR}i
Computing its own second-order gradient vector slice { HL } falling into the left sub-tree sampleiWith the second order gradient vector patch { HR } falling within the right sub-tree samplei
{HL}i={H}i⊙{SL}i
{HR}i={H}i⊙{SR}i
S2.11, constructing variables of left and right subtrees and specifying:
for each participant i, { GL } is seti、{HL}iAnd { SL }iFirst-order gradient vector partition, second-order gradient vector partition and method for constructing left subtreeIndicate vector fragmentation, set { GR }i、{HR}iAnd { SR }iA first-order gradient vector fragment, a second-order gradient vector fragment and an indication vector fragment used for constructing the right subtree;
s2.12, when the current depth d of the tree reaches the set limit dmaxOr when the symbolic variable is not 1, calculating the weight value of the leaf node, and stopping continuously constructing the left and right subtrees for the current node;
2.13, increasing the tree depth, and recursively constructing a decision tree:
and setting d as d +1, recursively executing S2.1 to S2.12, and completing construction of an XGboost joint decision tree.
S3, generating a predicted result of the data sample by using the t-th tree, and merging the predicted result with the previous t-1 tree results, including:
s3.1, local result prediction:
for the t treetEach participant i for a data sample xpUsing features of the held part
Figure BDA0002831883660000192
According to a local tree model
Figure BDA0002831883660000193
Performing leaf node prediction, wherein for each tree node, if the partition information is
Figure BDA0002831883660000194
If the branch sub-tree has the characteristics, the left (right) sub-tree is divided and prediction is continued according to the characteristics and the values, all leaf node flag bits of the branch sub-trees which do not enter are set to be 0, and if the division information is not
Figure BDA0002831883660000195
Until a certain characteristic is found, prediction is made along all left and right subtrees of the tree node
Figure BDA0002831883660000196
Setting flag bit to be 1 for attributive leaf node, and finally each participanti get the tree prediction
Figure BDA0002831883660000197
Generating flag bits of all leaf nodes sigma, sigma 1,2iSimultaneously splicing delta leaf weights according to the same sequence
Figure BDA0002831883660000201
Is a result vector
Figure BDA0002831883660000202
For example, as shown in fig. 4, for a data sample, three participants can determine their corresponding token vectors 1-3 locally, and each participant has a result vector slice
Figure BDA0002831883660000203
Where the first party holds a feature-partition pair (j)1,k1) And (j)4,k4) Participant 2 holds feature-partition pairs (j)2,k2) Participant 3 holds feature-partition pairs (j)3,k3) Three decision trees are equivalent to a decision tree which is obtained by training when data are stored in a single machine and contains complete division information, the first participating party to the third party respectively carries out sample division according to known information of the first participating party to the third party, when the decision tree contains the division information, a left sub-tree or a right sub-tree is selected, otherwise, the left sub-tree and the right sub-tree are searched, and finally, a mark vector (1, 1, 1, 0, 0), (0, 0, 1, 1, 1) and (0, 1, 1, 0, 0) which indicate the leaves to which a certain data sample belongs are respectively given;
s3.2, sign vector splitting and propagation:
each participant i will indexiSecret sharing splitting is carried out and is divided into { indexi}i′Sent to all participants i', i ═ 1, …, i, …, N;
s3.3, all the participants calculate respective prediction result fragments:
each participant i' receives the mark vector slice { index ] sent by the participant ii}i′Calculating bitwise cumulative multiplication vectors { index } of all vector slicesi′={index1}i′⊙{index2}i′⊙…⊙{indexN}i′And calculating the bitwise multiplication result { v) of the mark vector fragment and the self weight fragmentresult}i′={index}i′⊙{vw}i′
S3.4, merging prediction result fragments:
each participant i' pair { vresult}i′Element summation is carried out to { weightp}i′=sum({vresult}i′) And sending the result to the first party, which receives and calculates
Figure BDA0002831883660000204
And calculate
Figure BDA0002831883660000205
Becomes the sample x after the end of the t roundpThe predicted result of (2);
s3.5, calculating prediction results of all samples:
go through all p, calculate all data samples xpVector formed by combining t-th round prediction results
Figure BDA0002831883660000206
S4, iteratively increasing training rounds, and completing the construction of all decision trees:
and increasing the number T of trees to T +1, and iterating the steps S1-S3 until T to T decision trees are built.
Step S2.3 specifically includes:
s2.3.1: all participants i initialize record interval first-order gradient and fragmented numfeature*kmaxDimension matrix { BG }iNum of second order gradient and slice from recording intervalfeature*kmaxDimension matrix { BH }i
S2.3.2: for the feature j, j 1,2featureWhen the ith' isWhen the participant has the feature number j, mapping j to the own feature map (j) owned by the participant i' by using the feature index in the step S1-1, counting all the division values owned by the feature and recording the number k of the valuesj
S2.3.3: participant i' sets a k recording that the sample falls into the feature partitionmaxMatrix of M dimensionsindexM is the number of samples, and for the j-th feature, all the features are valued valkIs arranged as val from small to largek,k=1,...,kjSetting left to valk-1And left0=-∞,rightk=valuekGo through kjAnd taking out the kth value interval (left)k,rightk) Initializing a full 0 vector S 'of a dimension M x 1, and enabling the participator i' in a sample set to sample feature map (j) to take valuemap(j)Satisfy leftk<valuemap(j)≤rightkThe S' position of (A) is taken as 1, and Matrix is recordedindexKth line vector Matrixindex[k,:]=S′T,S′TIs a transposed vector of S ', after the partitioning traversal is finished, for the total N participants, the participant i' transforms the MatrixindexSplitting into N slices { Matrix ] through secret sharingindex}iAnd distributed to all participants i, i ═ 1, …, i', …, N;
s2.3.4: participant i receives { Matrixindex}iFor the j-th feature, traversing k until the maximum value interval number kmaxCalculating first order gradient and slicing { BG }i[j,k]With second order gradient and shard { BH }i[j,k]:
{BG}i[j,k]=sum({Matrixindex}i[k,:]⊙{G}i)
{BH}i[j,k]=sum({Matrixindex}i[k,:]⊙{H}i)
Wherein, [ k,: represents all elements of the k-th row of the selection matrix, where sum (v) represents the summation of elements of the vector v;
s2.3.5: go through j, execute S2.3.2-S2.3.4, makeAll participants i complete { BG }iAnd { BH }iAnd (4) calculating.
Step S2.7 specifically includes:
s2.7.1, each participant i sets the initial partition index list vector col ═ 1,2, …, k of the currently participating alignmentmax]Recording a length of RcolSetting initial each feature division index list vector colselected
S2.7.2 XGboost algorithm, for characteristic position [ j, col [ r ]]]And [ j, col [ r +1 ]]],
Figure BDA0002831883660000211
Figure BDA0002831883660000212
col[r]Denotes the r-th element in the index list col, [ j, col [ r [ ]]]Col [ r ] th of j-th row of the matrix]The number of the elements is one,
Figure BDA0002831883660000213
presentation pair
Figure BDA0002831883660000214
Rounding down, traversing r, and calculating the splitting gain difference between the feature positions as follows:
Figure BDA0002831883660000215
Figure BDA0002831883660000221
order:
nominator1col
=leftgainup[j,col[r]]*leftgaindown[j,col[r+1]]-leftgainup[j,col[r+1]]*leftgaindown[j,col[r]]
nominator2col
=rightgainup[j,col[r]]*rightgaindown[j,col[r+1]]-rightgainup[j,col[r+1]]*rightgaindown[j,col[r]]
denominator1col=leftgaindown[j,col[r]]*leftgaindown[j,col[r+1]]
denominator2col=rightgaindown[j,col[r]]*rightgaindown[j,col[r+1]]
then:
Figure BDA0002831883660000222
for the
Figure BDA0002831883660000223
And (3) calculating all division position difference value result shards of the splitting gain numerator of the left sub-tree and the right sub-tree in the S2.6 and the characteristic j of the denominator shard matrix by using all the participators i by using the formula:
Figure BDA0002831883660000224
Figure BDA0002831883660000225
Figure BDA0002831883660000226
Figure BDA0002831883660000231
s2.7.3, each participant i sends its own space bit calculation slicing result vector { nominator1 to the coordinatorcol}i、{nominator2col}i、{denominator1col}i、{denominator2col}iThe coordinator collects and calculates the vectors respectively
Figure BDA0002831883660000232
Figure BDA0002831883660000233
Wherein the vector of pairs { v }iOperation carried out
Figure BDA0002831883660000234
Represents for all { v }iThe collocated elements of a slice are summed into a vector, subtending the vectors v and v2Operation v carried out1\v2Representing division of co-located elements of two vectors, i.e.
Figure BDA0002831883660000235
S2.7.4, initializing an empty list new _ col, sequentially judging the r-th element of col _ shared _ value,
Figure BDA0002831883660000237
if it is not negative, col [ r2]Adding into new _ col, otherwise, adding col [ r 2+1]Adding the new _ col into the new _ col, adding the last element of the col into the new _ col if the col length is odd after traversal is finished, and then broadcasting the new _ col by all participants in the coordination direction, wherein the participant commands col to be the new _ col;
s2.7.5, when col length is greater than 1, iterating steps S2.7.2 to S2.7.4 until col length becomes 1, and taking out the only element col [0 ] in col]Recording colselected[j]=col[0];
S2.7.6, traversing all the characteristics j, iterating steps S2.7.1 to S2.7.5 to obtain the selected division position of each characteristic, and combining the selected division positions into a complete characteristic division index list vector colselectedSetting the initial current partition index list vector row currently participating in the comparison as [1,2, …, numfeature]Recording a length of Rrow
S2.7.7 XGboost algorithm, for characteristic position [ row [ r ]],colselected[row[r]]]And [ row [ r +1 ]],colselected[row[r+1]]],
Figure BDA0002831883660000236
Wherein row [ r ]]Denotes the r-th element, col, in the index list rowselected[row[r]]Representation colselectedMiddle with row [ r ]]Is an element of the index position that is,
Figure BDA0002831883660000238
represents a pair of RrowAnd/2, rounding down, traversing r, and calculating the splitting gain difference between the characteristic positions as follows:
Figure BDA0002831883660000241
order:
nominator1row
=leftgainup[row[r],colselected[row[r]]]*leftgaindown[row[r+1],colselected[row[r+1]]]-leftgainup[row[r+1],colselected[row[r+1]]]*leftgaindown[row[r],colselected[row[r]]]
nominator2row
=rightgainup[row[r],colselected[row[r]]]*rightgaindown[row[r+1],colselected[row[r+1]]]-rightgainup[row[r+1],colselected[row[r+1]]]*rightgaindown[row[r],colselected[row[r]]]
denominator1row
=leftgaindown[row[r],colselected[row[r]]]*leftgaindown[row[r+1],colselected[row[r+1]]]
denominator2row
=rightgaindown[row[r],colselected[row[r]]]*rightgaindown[row[r+1],colselected[row[r+1]]]
then:
Figure BDA0002831883660000251
for the
Figure BDA0002831883660000252
Elements, using the above formula, all the characteristics [1,2, …, num ] of the split gain numerator and denominator fragmentation matrix of the left and right subtrees in S2.6 are calculatedfeature]Each preferably divides the difference result slices between locations:
Figure BDA0002831883660000253
Figure BDA0002831883660000254
Figure BDA0002831883660000255
Figure BDA0002831883660000256
s2.7.8, each participant i sends its own interval calculation result vector { nominator1 to the coordinatorrow}i、{nominator2row}i、{denominator1row}i、{denominator2row}iThe coordinator collects and calculates the vectors respectively
Figure BDA0002831883660000257
S2.7.9, initialize empty list new _ row, traverse row _ shared _ value,
Figure BDA0002831883660000263
if it is not negative, then row [ r2 [ ]]Adding into new _ row, otherwise, adding row [ r 2+1]Adding the new _ row element into the new _ row element, and adding the last row element into the new _ row element if the row length is odd after traversalThen, the coordinator broadcasts new _ row to all participants, and the participants make row equal to new _ row;
s2.7.10, when the row length is greater than 1, iterating steps S2.7.6 to S2.7.9 until the row length becomes 1, fetching the unique element row [0 ] in the row]The participant is recorded as jbest=row[0]And get kbest=colselected[jbest]Broadcast to all participants to determine the best feature number j after selectionbestThe optimal division position k with the featurebest
Step S2.8 specifically includes:
s2.8.1, maximum splitting gain characteristic j for a givenbestAnd the value interval kbestThe split gain expression in the XGBoost algorithm is:
Figure BDA0002831883660000261
each participant i calculates its own split gain molecule slice { nominator }i
Figure BDA0002831883660000262
Calculating the split gain denominator slice { denoxinator }of the selfi
Figure BDA0002831883660000271
S2.8.2, the remaining participants send the first participant the owned fragmentation gain molecule fragment { nominator }2...{nominator}NThe first party collects { nominator }2...{nominator}NAnd calculate
Figure BDA0002831883660000272
Setting a first participant symbol sign1Judging the sign, and ordering:
Figure BDA0002831883660000273
s2.8.3, each participant i sends the coordinator the owned split gain denominator slice { denoxinator }to the coordinatoriThe coordinator collects { denoxinator }iN and calculating
Figure BDA0002831883660000274
Setting coordinator symbol sign0Judging the sign, and ordering:
Figure BDA0002831883660000275
s2.8.4 first participant sign1Sending the signal to a coordinator, and receiving and calculating sign as sign by the coordinator1*sign0Broadcasting sign to all participants, and receiving the value as the currently established symbol variable by all the participants;
step S2.12 specifically includes:
s2.12.1, each participant i calculates half of the sum of its second order gradient fragment and the regularization term:
Figure BDA0002831883660000276
each participant i computes its own first-order gradient patch sum:
{g′}i={SG}i
s2.12.2, each participant i determines { h' }iOf order of magnitude muiSo that:
Figure BDA0002831883660000277
s2.12.3, all participating parties send corresponding magnitude digits, and the coordinator receives and selects the maximum magnitude digit as mumDetermining an iteration step size
Figure BDA0002831883660000281
The process parameter tau and the iteration number iter are sent to all the participants;
s2.12.4, setting random initial value for each participant i
Figure BDA0002831883660000282
And a variable with an initial value of 0
Figure BDA0002831883660000283
Starting with κ ═ 1, the iteration proceeds according to the following formula:
Figure BDA0002831883660000284
Figure BDA0002831883660000285
setting kappa as kappa +1 after each iteration, terminating when kappa as iter, and recording weight slicing after the computation of the participant i is finished
Figure BDA0002831883660000286
In addition, it should be noted that the specific implementation examples described in this specification may have different names, and the above contents described in this specification are only illustrations of the structures of the present invention. All equivalent or simple changes in the structure, characteristics and principles of the invention are included in the protection scope of the invention. Various modifications or additions may be made to the described embodiments or methods may be similarly employed by those skilled in the art without departing from the scope of the invention as defined in the appending claims.

Claims (10)

1. A XGboost prediction model training method for protecting multi-party data privacy is characterized by comprising a plurality of participants and a coordinator, wherein the participants with labels firstly calculate first-order and second-order gradient vectors and indication vectors by using a current model prediction result and a label value, the rest participants assist with the coordinator through secret sharing to jointly calculate and construct a combined decision tree model based on an XGboost algorithm, the participants cooperate with each other to determine a prediction result of data to be trained in the combined decision tree model, and finally all the participants and the coordinator iterate together to complete the construction of a plurality of combined decision tree models to obtain a complete multi-party prediction model.
2. The XGboost prediction model training method for protecting multi-party data privacy as claimed in claim 1, wherein the specific steps of training the joint decision tree model are as follows:
s1, the participants with labels are used as first participants, the initial number, the initial depth, the regularization parameters and the maximum depth of the building tree are set, the regularization parameters are divided secretly, all the set parameters are sent to all the participants, random non-repetitive feature number indexes are generated for each participant with corresponding number of features, the first participant with labels calculates to obtain a first-order gradient vector and a second-order gradient vector by using the current model prediction result vector and the sample label vector, initial full 1 indicating vectors are generated, secret sharing and division are respectively carried out, and the first-order gradient vector fragments, the second-order gradient vector fragments and the indicating vector fragments with corresponding numbers are divided for the participants with the corresponding numbers and are respectively distributed to all the participants;
s2, after each participant i receives the first-order gradient vector fragment, the second-order gradient vector fragment and the indication vector fragment, calculating the fragment of the own first-order gradient sum and the fragment of the second-order gradient sum, calculating the numerator fragment and the denominator fragment of each group corresponding to the splitting gain respectively under each characteristic by using a secret sharing method, determining the maximum splitting gain and the belonged characteristic, the grouping and whether the division is carried out or not by using a coordinating method, when the division is carried out, if the selected characteristic belongs to a specific participant, generating a left sub-tree indication vector and a right sub-tree indication vector after the division, wherein the left sub-tree indication vector and the right sub-tree indication vector respectively indicate samples in a left subset and a right subset obtained by dividing the sample set according to the characteristic and the grouping corresponding to the maximum splitting gain, and the left subset and the right subset respectively correspond to the left sub-tree and the right sub; splitting the left sub-tree indication vector and the right sub-tree indication vector into a plurality of fragments through secret sharing, and distributing the fragments to participants; each participant utilizes the received fragment and the own indicating vector fragment to calculate a left subtree first-order gradient vector fragment and a second-order gradient vector fragment after the sample set is divided into a left subtree, a right subtree first-order gradient vector fragment and a second-order gradient vector fragment after the sample set is divided into a right subtree, the left subtree is constructed by using the left subtree first-order gradient vector fragment, the second-order gradient vector fragment and the left subtree indicating vector recursion, the right subtree is constructed by using the right subtree first-order gradient vector fragment, the second-order gradient vector fragment and the right subtree indicating vector recursion, a depth cyclic increasing condition is set, and if the division is not carried out or the preset maximum depth is reached, each participant calculates a corresponding fragment of the weight of the current leaf node on the decision tree;
s3, for each data sample, each participant utilizes the sample of the held partial characteristics to calculate the prediction result of the current combined decision tree model, and accumulates the prediction results into the results of the previous tree models with corresponding quantity to generate the comprehensive prediction result of the multiple tree models for the data sample;
and S4, increasing the number of the tree models, and iterating the steps S1-S3 until the target number of combined decision tree models are constructed.
3. The XGboost predictive model training method for protecting privacy of multi-party data according to claim 2, wherein the secret sharing algorithm in the steps S1, S2 and S3 comprises secret sharing splitting, secret sharing adding, secret sharing subtracting and secret sharing multiplying.
4. The XGboost predictive model training method for protecting multi-party data privacy as claimed in claim 2, wherein the step S1 specifically comprises:
s1.1, a first participant sets the initial number, the initial depth, the regularization parameter and the maximum depth of a constructed tree, the regularization parameter is shared and split secretly, all set parameters are distributed to all participants, for each participant with corresponding number of characteristics, a coordinator counts the total number of the characteristics of the participants, elements are generated into arrays with corresponding numbers, corresponding number of array elements in a disordering sequence are randomly distributed for each participant, the array elements obtained among the participants are not overlapped, and each participant establishes a one-to-one mapping map (j) from the disordering array elements to own characteristic numbers and records and stores the mapping map (j) at the own participant;
s1.2, calculating the maximum characteristic value quantity in own owned sample characteristics by all participants, sending the maximum characteristic value quantity to a coordinator, determining the maximum characteristic value quantity of all the participants by the coordinator, and broadcasting the maximum characteristic value quantity to all the participants;
s1.3, starting from the first participant with label data, enabling the participants to use the same loss function, enabling the first participant to calculate a first-order gradient vector, a second-order gradient vector and an initial all-1 indication vector by using a model prediction result vector and a label value vector, and enabling the initial prediction result of each piece of data to be divided into multiple first-order gradient vector fragments, second-order gradient vector fragments and indication vector fragments for each participant through a secret sharing algorithm and distributing the multiple first-order gradient vector fragments, second-order gradient vector fragments and indication vector fragments to the corresponding participants.
5. The XGboost predictive model training method for protecting multi-party data privacy as claimed in claim 2, wherein the step S2 specifically comprises:
s2.1, after each participant i receives the corresponding first-order gradient vector fragment, second-order gradient vector fragment and indication vector fragment, calculating the ith fragment { SG }of the first-order gradient sumiIth slice of the sum of second order gradients { SH }i,{SG}iAnd { SH }iThe vector elements are respectively summed by the first-order gradient vector fragment and the second-order gradient vector fragment owned by the participant i;
s2.2, for the XGboost algorithm, when a certain tree node exists, for all data first-order gradients and SG, second-order gradients and SH and a regular term lambda of the node, the non-splitting gain is expressed as:
Figure FDA0002831883650000031
each participant calculates the non-split gain numerator fragment and the non-split gain denominator fragment owned by the participant according to the non-split gain formula, and the calculation method specifically comprises the following steps:
Figure FDA0002831883650000032
{gaindown}i={SH}i+{λ}i
wherein, { gainup}iFor fragmentation of the undisrupted gain molecule, { gaindown}iFor the non-split gain denominator slices,
Figure FDA0002831883650000033
for secret sharing multiplication, { lambda }iThe ith fragment is the hyperparameter lambda;
s2.3, each participant i utilizes the corresponding first-order gradient vector fragment and second-order gradient vector fragment to calculate the first-order gradient and fragment matrix (BG) of all the value intervals of all the own characteristicsiWith a second order gradient and a fragmentation matrix { BH }i
S2.4, each participant i initializes the split gain molecular fragmentation matrix { leftgain } of the left subtreeup}iLeft subtree splitting gain denominator fragmentation matrix { leftgain }down}iRight subtree split gain molecular fragmentation matrix { rightgain }up}iRight subtree split gain denominator fragmentation matrix { rightgain }down}i
S2.5, for the characteristic j, each participant initializes and records a left subtree accumulated first-order gradient slicing variable
Figure FDA0002831883650000034
Left subtree cumulative second-order gradient sliced variable
Figure FDA0002831883650000035
Right subtree cumulative first-order gradient sharding variable
Figure FDA0002831883650000036
Right subtree cumulative second-order gradient fragment variable
Figure FDA0002831883650000037
Are all 0;
s2.6, each participant i traverses the value interval, and the left subtree accumulated first-order gradient fragment variable and the left subtree accumulated second-order gradient fragment variable are updated and calculated as follows:
Figure FDA0002831883650000038
Figure FDA0002831883650000039
wherein, { BH }i[j,k]And { BH }i[j,k]Respectively represent the fragmentation matrix { BG }iAnd { BH }iTo the [ j, k ] th]The number of the elements is one,
updating and calculating the right subtree accumulated first-order gradient fragment variable and the right subtree accumulated second-order gradient fragment variable:
Figure FDA00028318836500000310
Figure FDA00028318836500000311
for the XGBoost model, the splitting gain calculation formula used is as follows:
Figure FDA0002831883650000041
wherein λ is a regularization parameter;
for each participant, directly calculating the split gain numerator and denominator fragment of the left and right subtrees at the element positions of the corresponding value intervals of the corresponding features and updating the split gain numerator and the denominator fragment into a matrix:
Figure FDA0002831883650000042
Figure FDA0002831883650000043
Figure FDA0002831883650000044
Figure FDA0002831883650000045
wherein j is a characteristic number, and k is a value interval number;
s2.7, each participant uses the splitting gain numerator and the denominator fragmentation matrix of the left sub-tree and the right sub-tree to calculate the splitting gain difference value between each value interval of all the characteristics j in a traversing way, and the selected characteristics j corresponding to the maximum splitting gain is determined through the comparison of the coordinatorsbestAnd the value interval kbest
S2.8, for the maximum splitting gain characteristic and the value range, each participant uses the splitting gain numerator and denominator fragment and the non-splitting gain numerator and denominator fragment of the left and right subtrees of the position, calculates the total splitting gain denominator fragment, sends the total splitting gain denominator fragment to the coordinator, calculates the total splitting gain numerator fragment, sends the total splitting gain numerator fragment to a first participant, calculates the denominator and determines the symbol of the coordinator by the coordinator, calculates the numerator and determines the symbol of the first participant, and the symbol of the first participant and the coordinator determine the symbol variable corresponding to the final maximum gain through the symbol of the coordinator and the symbol of the first participant;
s2.9, when the symbol variable is 1, the maximum splitting gain characteristic j is possessedbestSetting a full 0 column vector SL of a left subtree after a recording sample falls into the feature division, and taking out a value interval corresponding to the maximum splitting gain
Figure FDA0002831883650000048
Sample characteristics j are concentrated in a sample setbestValue taking
Figure FDA0002831883650000049
Satisfy the requirement of
Figure FDA00028318836500000410
The SL element of the position takes a value of 1, and an M-dimensional vector of a right subtree after a record sample falls into the feature division is set
Figure FDA0002831883650000046
Namely, the SL is negated, and for N total participants, the left sub-tree indicating vector and the right sub-tree indicating vector are divided into N fragments through secret sharing and distributed to all the participants;
s2.10, each participant receives the fragments of the left sub-tree indicating vector and the right sub-tree indicating vector, and recalculates the left sub-tree indicating vector fragment { SL }owned by the participantiAnd right sub-tree indicating vector slice { SR }i
{SL}i={S}i⊙{SL}i
{SR}i={S}i⊙{SR}i
Wherein a secret-sharing multiplication is performed between co-located elements of an
Figure FDA0002831883650000047
Get a dimension { S }iThe same vector, calculate the own first-order gradient vector slice { GL } that falls into the left sub-tree sampleiWith the first-order gradient vector slice falling into the right sub-tree sample { GR }i
{GL}i={G}i⊙{SL}i
{GR}i={G}i⊙{SR}i
Computing its own second-order gradient vector slice { HL } falling into the left sub-tree sampleiWith the second order gradient vector patch { HR } falling within the right sub-tree samplei
{HL}i={H}i⊙{SL}i
{HR}i={H}i⊙{SR}i
S2.11, setting { GL } for each participant ii、{HL}iAnd { SL }iSetting { GR } for constructing first-order gradient vector fragments, second-order gradient vector fragments and indication vector fragments used by the left subtreei、{HR}iAnd { SR }iA first-order gradient vector fragment, a second-order gradient vector fragment and an indication vector fragment used for constructing the right subtree;
s2.12, when the current depth of the tree reaches a set limit or the symbolic variable is not 1, calculating the weight value of the leaf node, and stopping continuously constructing the left sub-tree and the right sub-tree for the current node;
s2.13, setting a depth cycle increasing condition, recursively executing the step S2.1 to the step S2.12, and constructing an XGboost joint decision tree model.
6. The XGboost predictive model training method for protecting multi-party data privacy as claimed in claim 5, wherein the step S2.3 specifically comprises:
s2.3.1, all participants initialize the corresponding dimension matrix { BG }recording interval first order gradient and sliceiCorresponding dimension matrix { BH } of second order gradient and fragment of recording intervali
S2.3.2, j ═ 1,2, …, num for owned feature number jfeatureBy a participant of S1.1, mapping the feature number, namely mapping j to a self feature map (j) owned by a participant, counting all division values owned by the feature and recording the number of the values;
s2.3.3, the participator sets a multi-dimensional Matrix with the record sample falling into the feature divisionindexFor the feature j, all the features are valued valkIs arranged as val from small to largek,k=1,…,kjSetting leftk=valk-1And left1=-∞,rightk=valkTraversing k, and taking out the kth value interval (left)k,rightk) Initializing a full 0 column vector S', and making the sample feature map (j) in the participant sample set take valuemap(j)Satisfy leftk<valuemap(j)≤rightkIs 1, record Matrixindex[k,:]=S′TAfter the partitioning traversal is finished, the partition is split into N fragments { Matrix ] through secret sharingindex}iAnd distributed to the respective participants;
s2.3.4, participant i receives { Matrixindex}iAnd for the j-th feature, traversing k until the maximum value interval number, and calculating a first-order gradient and fragment and a second-order gradient and fragment:
{BG}i[j,k]=sum({Matrixindex}i[k,:]⊙{G}i)
{BH}i[j,k]=sum({Matrixindex}i[k,:]⊙{H}i)
wherein [ k, ] represents selecting all elements of the k row of the matrix, and sum (v) represents summing the elements of the vector v;
s2.3.5, traversing all the features, executing steps S2.3.2 to S2.3.4 to make all the participants complete the calculation of the first order gradient and slice and the second order gradient and slice of the transposed vector.
7. The XGboost predictive model training method for protecting multi-party data privacy as claimed in claim 5, wherein the step S2.7 specifically comprises:
s2.7.1, each participate inThe method sets the initial partition index list vector col ═ 1,2, …, k currently participating in the alignmentmax],kmaxRecord the length of col as R for the maximum feature division numbercolSetting initial each feature division index list vector colselected
S2.7.2 XGboost algorithm, for characteristic position [ j, col [ r ]]]And [ j, col [ r +1 ]]],
Figure FDA0002831883650000061
Figure FDA0002831883650000062
col[r]Denotes the r-th element in the index list col, [ j, col [ r [ ]]]Col [ r ] th of j-th row of the matrix]The number of the elements is one,
Figure FDA0002831883650000063
presentation pair
Figure FDA0002831883650000064
Rounding down, traversing r, and calculating the splitting gain difference between the feature positions as follows:
Figure FDA0002831883650000065
order:
Figure FDA0002831883650000066
Figure FDA0002831883650000071
denominator1col=leftgaindown[j,col[r]]*eftgaindown[j,col[r+1]]
denominator2col=rightgaindown[j,col[r]]*rightgaindown[j,col[r+1]]
then:
Figure FDA0002831883650000072
for the
Figure FDA0002831883650000073
And (3) calculating all division position difference value result shards of the splitting gain numerator of the left sub-tree and the right sub-tree in the S2.6 and the characteristic j of the denominator shard matrix by using all the participators i by using the formula:
Figure FDA0002831883650000074
Figure FDA0002831883650000075
Figure FDA0002831883650000076
Figure FDA0002831883650000077
s2.7.3, each participant i sends its own space bit calculation slicing result vector { nominator1 to the coordinatorcol}i、{nominator2col}i、{denominator1col}iAnd { denoxinator 2col}iThe coordinator collects and calculates the vectors respectively
Figure FDA0002831883650000078
Figure FDA0002831883650000079
Wherein the pair vector{v}iOperation carried out
Figure FDA0002831883650000081
Represents for all { v }iThe collocated elements of a slice are summed into a vector, subtending the vectors v and v2Operation v carried out1\v2Representing division of co-located elements of two vectors, i.e.
Figure FDA0002831883650000082
S2.7.4, initializing an empty list new _ col, sequentially judging the r-th element of col _ shared _ value,
Figure FDA0002831883650000083
if it is not negative, col [ r2]Adding into new _ col, otherwise, adding col [ r 2+1]Adding the new _ col into the new _ col, adding the last element of the col into the new _ col if the col length is odd after traversal is finished, and then broadcasting the new _ col by all participants in the coordination direction, wherein the participant commands col to be the new _ col;
s2.7.5, when col length is greater than 1, iterating steps S2.7.2 to S2.7.4 until col length becomes 1, taking out the only element col [0 ] in col]And recording colselected[j]=col[0];
S2.7.6, traversing all the features j, iterating steps S2.7.1 to S2.7.5 to obtain the selected partition positions of each feature, combining the selected partition positions into a complete feature partition index list vector, and setting the initial current partition index list vector row of the current involved in comparison to [1,2, …, num ═ 1,2, …feature]Recording a length of Rrow,numfeatureThe number of all features;
s2.7.7 XGboost algorithm, for characteristic position [ row [ r ]],colselected[row[r]]]And
Figure FDA0002831883650000084
wherein row [ r ]]Denotes the r-th element, col, in the index list rowselected[row[r]]Representation colselectedMiddle with row [ r ]]Is an element of the index position that is,
Figure FDA0002831883650000086
represents a pair of RrowAnd/2, rounding down, traversing r, and calculating the splitting gain difference between the characteristic positions as follows:
Figure FDA0002831883650000085
order:
Figure FDA0002831883650000091
Figure FDA0002831883650000092
Figure FDA0002831883650000093
then:
Figure FDA0002831883650000094
for the
Figure FDA0002831883650000095
Elements, using the above formula, all the characteristics [1,2, …, num ] of the split gain numerator and denominator fragmentation matrix of the left and right subtrees in S2.6 are calculatedfeature]The corresponding best division of the difference result fragments between the positions specifically comprises the following steps:
Figure FDA0002831883650000096
Figure FDA0002831883650000101
Figure FDA0002831883650000102
Figure FDA0002831883650000103
s2.7.8, each participant i sends its own interval calculation result vector { nominator1 to the coordinatorrow}i、{nominator2row}i、{denominator1row}i、{denominator2row}iThe coordinator collects and calculates the vectors respectively
Figure FDA0002831883650000104
Figure FDA0002831883650000105
S2.7.9, initialize empty list new _ row, traverse row _ shared _ value,
Figure FDA0002831883650000106
if it is not negative, then row [ r2 [ ]]Adding into new _ row, otherwise, adding row [ r 2+1]Adding the new _ row element into a new _ row element, adding the last element of the row element into the new _ row element if the row length is odd after traversal is finished, then broadcasting the new _ row element by all participants in a coordination direction, and enabling the row element to be the new _ row element by the participants;
s2.7.10, when the row length is greater than 1, iterating steps S2.7.6 to S2.7.9 until the row length becomes 1, fetching the unique element row [0 ] in the row]The participant is recorded as jbest=row[0]And get kbest=colselected[jbest]Broadcast to all participants to determine the best feature number j after selectionbestThe optimal division position k with the featurebest
8. The XGboost predictive model training method for protecting multi-party data privacy as claimed in claim 5, wherein the step S2.8 specifically comprises:
s2.8.1, calculating the maximum splitting gain characteristic and the splitting gain in the value interval, specifically:
Figure FDA0002831883650000111
each participant i calculates its own split gain molecular fragment:
Figure FDA0002831883650000112
calculating the splitting gain denominator of the splitting gain denominator:
Figure FDA0002831883650000113
s2.8.2, the remaining participants respectively send the owned split gain molecule fragments to the first participant, the first participant collects and calculates, sets the symbol of the first participant, judges the sign of the first participant, and makes:
Figure FDA0002831883650000114
wherein sign1Is a first participant symbol;
s2.8.3, all participants send the share gain denominator fragments to the coordinator, the coordinator collects and calculates, sets the coordinator symbol, judges the sign, and makes:
Figure FDA0002831883650000121
wherein sign0Is a coordinator symbol;
s2.8.4, the first participant sends the first participant symbol to the coordinator, the coordinator receives and calculates the total symbol, broadcasts the total symbol to all participants, and all participants receive the value as the currently established symbol variable.
9. The XGboost predictive model training method for protecting multi-party data privacy as claimed in claim 5, wherein the step S2.12 specifically comprises:
s2.12.1, each participant calculates its second order gradient slice and half of the sum of the regular term:
Figure FDA0002831883650000122
wherein, { SH } is a second-order gradient slice, and { lambda } is a regular term;
each participant computes its own first-order gradient patch sum:
{g′}i={SG}i
wherein { SG } is a first-order gradient slice;
s2.12.2, each participant determines { h' }iOf the order of magnitude such that:
Figure FDA0002831883650000123
wherein, muiIs an order of magnitude digit;
s2.12.3, all participating parties send corresponding magnitude digits, and the coordinator receives and selects the maximum magnitude digit as mumDetermining an iteration step size
Figure FDA0002831883650000124
The process parameter tau and the iteration number iter are sent to all the participants;
s2.12.4, each participant i settingRandom initial value
Figure FDA0002831883650000125
And a variable with an initial value of 0
Figure FDA0002831883650000126
Starting with κ ═ 1, the iteration proceeds according to the following formula:
Figure FDA0002831883650000127
Figure FDA0002831883650000128
setting kappa as kappa +1 after each iteration, terminating when kappa as iter, and recording weight slicing after the computation of the participant i is finished
Figure FDA0002831883650000129
10. The XGboost predictive model training method for protecting multi-party data privacy as claimed in claim 2, wherein the step S3 specifically comprises:
s3.1, each participant uses the held partial features for a data sample to predict leaf nodes according to a local tree model, wherein for each tree node, prediction is carried out according to the partition information of the tree node, all leaf node zone bits of branch subtrees which do not enter are set to be 0, if the partition information is not the features owned by the data sample, prediction is carried out along all left and right subtrees of the tree node until a leaf node which determines the attribute of the features is found, the zone bit is set to be 1, finally, each participant obtains the tree prediction features to generate the zone bits of all the leaf nodes, the zone bits are spliced into a zone vector according to the joint decision tree structure sequence of the leaf nodes, and meanwhile, a plurality of leaf weights are spliced into a result vector according to the same sequence;
s3.2, each participant carries out secret sharing and splitting on the mark vector and sends the mark vector to all participants;
s3.3, each participant receives the mark vector fragments sent by other participants, calculates bitwise multiplication vector fragments of all the vector fragments, and calculates bitwise multiplication results of the bitwise multiplication vector fragments and the weight fragments of the participant;
s3.4, each participant performs element summation according to the bitwise multiplication result and sends the result to a first participant, and the first participant receives and calculates a prediction result;
and S3.5, traversing all the data samples, and calculating a prediction result vector formed by combining corresponding prediction results.
CN202011452494.9A 2020-12-12 2020-12-12 XGboost prediction model training method for protecting multi-party data privacy Active CN112700031B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011452494.9A CN112700031B (en) 2020-12-12 2020-12-12 XGboost prediction model training method for protecting multi-party data privacy

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011452494.9A CN112700031B (en) 2020-12-12 2020-12-12 XGboost prediction model training method for protecting multi-party data privacy

Publications (2)

Publication Number Publication Date
CN112700031A true CN112700031A (en) 2021-04-23
CN112700031B CN112700031B (en) 2023-03-31

Family

ID=75508776

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011452494.9A Active CN112700031B (en) 2020-12-12 2020-12-12 XGboost prediction model training method for protecting multi-party data privacy

Country Status (1)

Country Link
CN (1) CN112700031B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113098687A (en) * 2021-04-27 2021-07-09 支付宝(杭州)信息技术有限公司 Method and device for generating data tuple of secure computing protocol
CN113506163A (en) * 2021-09-07 2021-10-15 百融云创科技股份有限公司 Isolated forest training and predicting method and system based on longitudinal federation
CN113723477A (en) * 2021-08-16 2021-11-30 同盾科技有限公司 Cross-feature federal abnormal data detection method based on isolated forest
CN114327371A (en) * 2022-03-04 2022-04-12 支付宝(杭州)信息技术有限公司 Secret sharing-based multi-key sorting method and system
CN114841016A (en) * 2022-05-26 2022-08-02 北京交通大学 Multi-model federal learning method, system and storage medium
CN115396100A (en) * 2022-10-26 2022-11-25 华控清交信息科技(北京)有限公司 Careless random disordering method and system based on secret sharing
CN115630711A (en) * 2022-12-19 2023-01-20 华控清交信息科技(北京)有限公司 XGboost model training method and multi-party security computing platform

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110222527A (en) * 2019-05-22 2019-09-10 暨南大学 A kind of method for secret protection
CN110795603A (en) * 2019-10-29 2020-02-14 支付宝(杭州)信息技术有限公司 Prediction method and device based on tree model
CN111144576A (en) * 2019-12-13 2020-05-12 支付宝(杭州)信息技术有限公司 Model training method and device and electronic equipment
CN111626886A (en) * 2020-07-30 2020-09-04 工保科技(浙江)有限公司 Multi-party cooperation-based engineering performance guarantee insurance risk identification method and platform
CN111695697A (en) * 2020-06-12 2020-09-22 深圳前海微众银行股份有限公司 Multi-party combined decision tree construction method and device and readable storage medium
CN111724174A (en) * 2020-06-19 2020-09-29 安徽迪科数金科技有限公司 Citizen credit point evaluation method applying Xgboost modeling
CN111738360A (en) * 2020-07-24 2020-10-02 支付宝(杭州)信息技术有限公司 Two-party decision tree training method and system
CN111782550A (en) * 2020-07-31 2020-10-16 支付宝(杭州)信息技术有限公司 Method and device for training index prediction model based on user privacy protection

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110222527A (en) * 2019-05-22 2019-09-10 暨南大学 A kind of method for secret protection
CN110795603A (en) * 2019-10-29 2020-02-14 支付宝(杭州)信息技术有限公司 Prediction method and device based on tree model
CN111144576A (en) * 2019-12-13 2020-05-12 支付宝(杭州)信息技术有限公司 Model training method and device and electronic equipment
CN111695697A (en) * 2020-06-12 2020-09-22 深圳前海微众银行股份有限公司 Multi-party combined decision tree construction method and device and readable storage medium
CN111724174A (en) * 2020-06-19 2020-09-29 安徽迪科数金科技有限公司 Citizen credit point evaluation method applying Xgboost modeling
CN111738360A (en) * 2020-07-24 2020-10-02 支付宝(杭州)信息技术有限公司 Two-party decision tree training method and system
CN111626886A (en) * 2020-07-30 2020-09-04 工保科技(浙江)有限公司 Multi-party cooperation-based engineering performance guarantee insurance risk identification method and platform
CN111782550A (en) * 2020-07-31 2020-10-16 支付宝(杭州)信息技术有限公司 Method and device for training index prediction model based on user privacy protection

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113098687A (en) * 2021-04-27 2021-07-09 支付宝(杭州)信息技术有限公司 Method and device for generating data tuple of secure computing protocol
CN113098687B (en) * 2021-04-27 2022-04-12 支付宝(杭州)信息技术有限公司 Method and device for generating data tuple of secure computing protocol
CN113723477A (en) * 2021-08-16 2021-11-30 同盾科技有限公司 Cross-feature federal abnormal data detection method based on isolated forest
CN113723477B (en) * 2021-08-16 2024-04-30 同盾科技有限公司 Cross-feature federal abnormal data detection method based on isolated forest
CN113506163A (en) * 2021-09-07 2021-10-15 百融云创科技股份有限公司 Isolated forest training and predicting method and system based on longitudinal federation
CN114327371A (en) * 2022-03-04 2022-04-12 支付宝(杭州)信息技术有限公司 Secret sharing-based multi-key sorting method and system
CN114327371B (en) * 2022-03-04 2022-06-21 支付宝(杭州)信息技术有限公司 Secret sharing-based multi-key sorting method and system
CN114841016A (en) * 2022-05-26 2022-08-02 北京交通大学 Multi-model federal learning method, system and storage medium
CN115396100A (en) * 2022-10-26 2022-11-25 华控清交信息科技(北京)有限公司 Careless random disordering method and system based on secret sharing
CN115396100B (en) * 2022-10-26 2023-01-06 华控清交信息科技(北京)有限公司 Careless random disorganizing method and system based on secret sharing
CN115630711A (en) * 2022-12-19 2023-01-20 华控清交信息科技(北京)有限公司 XGboost model training method and multi-party security computing platform

Also Published As

Publication number Publication date
CN112700031B (en) 2023-03-31

Similar Documents

Publication Publication Date Title
CN112464287B (en) Multi-party XGboost safety prediction model training method based on secret sharing and federal learning
CN112700031B (en) XGboost prediction model training method for protecting multi-party data privacy
Yarotsky et al. The phase diagram of approximation rates for deep neural networks
Bruggemann et al. Automated search for resource-efficient branched multi-task networks
Liu et al. Detecting community structure in complex networks using simulated annealing with k-means algorithms
CN108921657B (en) Knowledge-enhanced memory network-based sequence recommendation method
US20070250522A1 (en) System and method for organizing, compressing and structuring data for data mining readiness
US8996436B1 (en) Decision tree classification for big data
CN112862057B (en) Modeling method, modeling device, electronic equipment and readable medium
CN106777090A (en) The medical science big data search method of the Skyline that view-based access control model vocabulary is matched with multiple features
Pizzuti et al. Many-objective optimization for community detection in multi-layer networks
CN106791964A (en) Broadcast TV program commending system and method
CN115659807A (en) Method for predicting talent performance based on Bayesian optimization model fusion algorithm
CN113723477A (en) Cross-feature federal abnormal data detection method based on isolated forest
CN109902808A (en) A method of convolutional neural networks are optimized based on floating-point numerical digit Mutation Genetic Algorithms Based
CN111104215B (en) Random gradient descent optimization method based on distributed coding
CN114362948B (en) Federated derived feature logistic regression modeling method
CN111832817A (en) Small world echo state network time sequence prediction method based on MCP penalty function
CN114528971A (en) Atlas frequent relation mode mining method based on heterogeneous atlas neural network
KR20230069578A (en) Sign-Aware Recommendation Apparatus and Method using Graph Neural Network
CN114003744A (en) Image retrieval method and system based on convolutional neural network and vector homomorphic encryption
Wickman et al. Efficient quality-diversity optimization through diverse quality species
CN116564555A (en) Drug interaction prediction model construction method based on deep memory interaction
Grebinski On the power of additive combinatorial search model
CN115543616A (en) Zk-SNARK operation-oriented GPU parallel acceleration method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant