CN117648992A

CN117648992A - Data processing method and device for XGBoost federal learning model training

Info

Publication number: CN117648992A
Application number: CN202311436932.6A
Authority: CN
Inventors: 尤志强; 王兆凯; 陈立峰; 赵华宇; 卞阳; 张伟奇
Original assignee: Fucun Technology Shanghai Co ltd
Current assignee: Fucun Technology Shanghai Co ltd
Priority date: 2023-04-04
Filing date: 2023-10-31
Publication date: 2024-03-05
Also published as: CN116127495A

Abstract

The application discloses a data processing method and device for training an XGBoost federal learning model. The method comprises the steps of applying the model to a federation learning scene, obtaining multi-party training sample data, and federally training an XGBoost tree model according to the multi-party sample data; carrying out clustering compression on the sample data based on data characteristics to obtain a clustering matrix and a clustering index; constructing a sparse matrix based on the cluster matrix, and calculating to obtain an aggregation matrix based on the cluster index; fragmenting the sparse matrix to obtain a first fragment matrix, and fragmenting the aggregation matrix to obtain a second fragment matrix; performing matrix multiplication processing based on data encryption based on the first fragment matrix and the second fragment matrix, and summing with a barrel feature clustering center to obtain gradient histogram data; and performing model training according to the gradient histogram data to obtain a target XGBoost tree model. The operation acceleration on the sparse matrix is realized by carrying out clustering compression on the features, so that the data calculation amount and transmission amount in the model training process are reduced, and the efficiency is improved.

Description

Data processing method and device for XGBoost federal learning model training

Technical Field

The application relates to the field of computers, in particular to a data processing method and device for XGBoost federal learning model training.

Background

Along with the development of artificial intelligence technology, in order to solve the problem of data island, a concept of 'federal learning' is proposed, and federal learning is essentially a distributed machine learning framework, so that data sharing and common modeling are realized on the basis of guaranteeing data privacy safety and legal compliance. The method has the core concept that when a plurality of data sources participate in model training together, model joint training is carried out only through interaction model intermediate parameters on the premise that original data circulation is not needed, and original data can not be output locally. This approach achieves a balance of data privacy protection and data sharing analysis, i.e., a data application mode of "data available invisible".

The initiator and the participant in federal learning serve as the membership, model training can be performed to obtain model parameters under the condition that own data are not required to be given, and the problem of data privacy disclosure can be avoided. Since federal learning processes require large amounts of data to support, and data is mostly distributed among different data holders, model construction is required by combining the data holders.

XGBoost (Exterme Gradient Boosting) is totally called as a limit gradient lifting tree model, is an integrated machine learning algorithm based on a decision tree, and is widely used in industry due to strong model prediction capability, such as application in business scenes of advertisement recommendation, financial wind control and the like.

The inventor finds that in order to ensure the data privacy safety during multiparty XGBoost model training of federal learning, the multiparty safety calculation multiplication is directly carried out on the characteristic data matrix of the initiator after the barreling is adopted, a large amount of data calculation cost is generated in the calculation process, the efficiency is low, and the calculation is difficult to be carried out under a large data sample scene.

Therefore, how to reduce the data calculation overhead in the model training process while ensuring the data privacy security during machine learning is a problem to be solved by those skilled in the art.

Disclosure of Invention

The main aim of the application is to provide a data processing method and device for XGBoost federal learning model training, so as to solve the problem of reducing data calculation overhead in the model training process under the condition of ensuring data privacy safety in the prior art, reduce the data calculation overhead in the model training process and improve the model training efficiency.

To achieve the above object, in a first aspect of the present application, a data processing method for XGBoost federal learning model training is provided, which is applied to a data sharing scenario between at least one initiator and at least one participant, and the data processing method includes:

obtaining sample data to be trained, wherein the sample data to be trained is the sample data of the at least one initiator and the at least one participant;

preprocessing the sample data based on data feature extraction to obtain a first feature matrix and a second feature matrix, wherein the first feature matrix is a matrix for representing the characteristics of the sample data of the initiator, and the second feature matrix is a matrix for representing the corresponding characteristics of the sample data of the participant;

carrying out random pre-clustering compression on each feature in the second feature matrix to obtain a clustering matrix and a clustering index;

performing sparse matrix construction processing according to the clustering matrix to obtain a sparse matrix; performing pre-aggregation calculation on the first feature matrix according to the cluster index to obtain an aggregation matrix;

performing fragmentation processing on the sparse matrix to obtain a first fragmentation matrix, and performing fragmentation processing on the aggregation matrix to obtain a second fragmentation matrix;

Performing matrix multiplication processing based on data encryption on the first fragment matrix and the second fragment matrix to obtain gradient histogram data;

and performing model training on the XGBoost tree model according to the gradient histogram data to obtain a target XGBoost tree model.

Further, performing random pre-cluster compression on each feature in the second feature matrix to obtain a cluster matrix and a cluster index, including:

randomly selecting a preset number of feature values from each feature in the second feature matrix as a clustering center corresponding to the feature, and obtaining feature data to be clustered, wherein the feature data to be clustered are feature values except the clustering center in the second feature matrix;

and carrying out clustering processing on the feature data to be clustered according to the clustering center and a preset clustering rule to obtain a clustering matrix and a clustering index corresponding to the clustering center.

Further, performing sparse matrix construction processing according to the cluster matrix, where obtaining a sparse matrix includes:

performing barrel division processing on the features in the clustering matrix according to the clustering center and a preset barrel division number to obtain a sub-sparse matrix;

and performing splicing treatment on the sub-sparse matrix to obtain the sparse matrix.

Further, performing a pre-aggregation calculation on the first feature matrix according to the cluster index to obtain an aggregation matrix includes:

performing identification processing on the first feature matrix to obtain a gradient matrix and a second-order gradient matrix;

performing pre-aggregation treatment on the one-step degree matrix according to the cluster index to obtain a first-order aggregation matrix;

performing pre-aggregation treatment on the second-order gradient matrix according to the cluster index to obtain a second-order aggregation matrix;

and carrying out combination optimization treatment on the first-order aggregation matrix and the second-order aggregation matrix to obtain the aggregation matrix.

Further, performing a combination optimization process on the first-order aggregation matrix and the second-order aggregation matrix to obtain the aggregation matrix, where the step of obtaining the aggregation matrix includes:

performing expansion processing on the first-order aggregation matrix to obtain a first aggregation matrix; performing expansion processing on the second-order aggregation matrix to obtain a second aggregation matrix;

performing gradient combination on the first aggregation matrix and the second aggregation matrix to obtain process aggregation matrix data;

and performing densification treatment on the process aggregation matrix data to obtain the aggregation matrix.

Further, performing matrix multiplication processing based on data encryption on the first patch matrix and the second patch matrix to obtain gradient histogram data includes:

Performing point multiplication processing based on the first fragment matrix and the second fragment matrix corresponding to each feature to obtain process gradient histogram data;

carrying out gradient histogram calculation processing based on the sum of feature cluster centers of the same bucket on the process gradient histogram data to obtain one-step gradient histogram data and second-order gradient histogram data;

and performing splicing processing on the one-step gradient histogram data and the second-order gradient histogram data to obtain the gradient histogram data.

Further, performing model training on the XGBoost tree model according to the gradient histogram data, and obtaining a target XGBoost tree model includes:

performing optimal segmentation point calculation processing of the XGBoost tree model according to the gradient histogram data to obtain optimal segmentation point data, wherein the optimal segmentation point data is used for representing the optimal segmentation point of the XGBoost tree model;

and performing model training processing based on tree structure updating on the XGBoost tree model according to the optimal segmentation point data to obtain the target XGBoost tree model.

Further, after performing model training on the XGBoost tree model according to the gradient histogram data to obtain a target XGBoost tree model, the data processing method includes:

Obtaining sample data to be predicted, wherein the sample data to be predicted is data which needs sample prediction by the at least one initiator and the at least one participant;

and carrying out prediction processing on the sample data to be predicted according to the target XGBoost tree model to obtain prediction result data.

According to a second aspect of the present application, a data processing apparatus for XGBoost federal learning model training is provided, applied in a data sharing scenario between at least one initiator and at least one participant, the data processing apparatus comprising:

the training sample acquisition module is used for acquiring sample data to be trained, wherein the sample data to be trained is the sample data of the at least one initiator and the at least one participant;

the preprocessing module is used for preprocessing the sample data based on data feature extraction to obtain a first feature matrix and a second feature matrix, wherein the first feature matrix is a matrix for representing the characteristics of the sample data of the initiator, and the second feature matrix is a matrix for representing the characteristics of the sample data of the participant;

the pre-clustering compression module is used for carrying out random pre-clustering compression on each feature in the second feature matrix to obtain a clustering matrix and a clustering index;

The matrix calculation module is used for carrying out sparse matrix construction processing according to the clustering matrix to obtain a sparse matrix; performing pre-aggregation calculation on the first feature matrix according to the cluster index to obtain an aggregation matrix;

the fragmentation module is used for carrying out fragmentation processing on the sparse matrix to obtain a first fragmentation matrix, and carrying out fragmentation processing on the aggregation matrix to obtain a second fragmentation matrix;

the gradient histogram calculation module is used for carrying out matrix multiplication processing based on data encryption on the first fragment matrix and the second fragment matrix to obtain gradient histogram data;

and the model training module is used for carrying out model training on the XGBoost tree model according to the gradient histogram data to obtain a target XGBoost tree model.

According to a third aspect of the present application, a computer readable storage medium is provided, the computer readable storage medium storing computer instructions for causing the computer to perform the above-described data processing method for XGBoost federal learning model training.

According to a fourth aspect of the present application, there is provided an electronic device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores a computer program executable by the at least one processor, the computer program being executable by the at least one processor to cause the at least one processor to perform the data processing method for XGBoost federal learning model training described above.

The technical scheme provided by the embodiment of the application can comprise the following beneficial effects:

in the application, sample data to be trained is obtained through a data sharing scene between at least one initiator and at least one participant, wherein the sample data to be trained is the sample data of the at least one initiator and the at least one participant; preprocessing the sample data based on data feature extraction to obtain a first feature matrix and a second feature matrix, wherein the first feature matrix is a matrix for representing the characteristics of the sample data of the initiator, and the second feature matrix is a matrix for representing the corresponding characteristics of the sample data of the participant; carrying out random pre-clustering compression on each feature in the second feature matrix to obtain a clustering matrix and a clustering index; performing sparse matrix construction processing according to the clustering matrix to obtain a sparse matrix; performing pre-aggregation calculation on the first feature matrix according to the cluster index to obtain an aggregation matrix; performing fragmentation processing on the sparse matrix to obtain a first fragmentation matrix, and performing fragmentation processing on the aggregation matrix to obtain a second fragmentation matrix; performing matrix multiplication processing based on data encryption on the first fragment matrix and the second fragment matrix to obtain gradient histogram data; and performing model training on the XGBoost tree model according to the gradient histogram data to obtain a target XGBoost tree model. In the process of calculating the gradient histogram by training the federal learning model, the operation acceleration on the sparse matrix is realized by carrying out clustering compression on the features, the data calculation amount, transmission amount and the like in the model training process are reduced, and under the condition of ensuring the privacy safety of federal learning data, the data calculation overhead in the model training process is reduced, and the model training efficiency is improved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this application, are included to provide a further understanding of the application and to provide a further understanding of the application with regard to the other features, objects and advantages of the application. The drawings of the illustrative embodiments of the present application and their descriptions are for the purpose of illustrating the present application and are not to be construed as unduly limiting the present application. In the drawings:

FIG. 1 is a histogram fragment state matrix calculation method on the Host side in the prior XGBoost model training process;

FIG. 2 is a flow chart of a data processing method for XGBoost federal learning model training provided herein;

FIG. 3 is a flow chart of a data processing method for XGBoost federal learning model training provided herein;

FIG. 4 is a flow chart of a data processing method for XGBoost federal learning model training provided herein;

FIG. 5 is a flow chart of a data processing method for XGBoost federal learning model training provided herein;

FIG. 6 is a flow chart of a data processing method for XGBoost federal learning model training provided herein;

FIGS. 7a and 7b are schematic flow diagrams of a data processing method for XGBoost federal learning model training according to an embodiment of the present application;

Fig. 8 is a schematic structural diagram of a data processing device for XGBoost federal learning model training provided in the present application.

Detailed Description

In order to make the present application solution better understood by those skilled in the art, the following description will be made in detail and with reference to the accompanying drawings in the embodiments of the present application, it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, shall fall within the scope of the present application.

It should be noted that the terms "first," "second," and the like in the description and claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate in order to describe the embodiments of the present application described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

In the present application, the terms "upper", "lower", "left", "right", "front", "rear", "top", "bottom", "inner", "outer", "middle", "vertical", "horizontal", "lateral", "longitudinal" and the like indicate an azimuth or a positional relationship based on that shown in the drawings. These terms are used primarily to better describe the present application and its embodiments and are not intended to limit the indicated device, element or component to a particular orientation or to be constructed and operated in a particular orientation.

Also, some of the terms described above may be used to indicate other meanings in addition to orientation or positional relationships, for example, the term "upper" may also be used to indicate some sort of attachment or connection in some cases. The specific meaning of these terms in this application will be understood by those of ordinary skill in the art as appropriate.

Furthermore, the terms "mounted," "configured," "provided," "connected," "coupled," and "sleeved" are to be construed broadly. For example, "connected" may be in a fixed connection, a removable connection, or a unitary construction; may be a mechanical connection, or an electrical connection; may be directly connected, or indirectly connected through intervening media, or may be in internal communication between two devices, elements, or components. The specific meaning of the terms in this application will be understood by those of ordinary skill in the art as the case may be.

Federal learning is a distributed computing framework, and in order to cooperate with parties to jointly complete computing tasks, participants need to transmit intermediate results in the computing process under the scheduling of a multiparty security protocol.

Secure multiparty computing (MPC) is one of the privacy protection ways to implement federal learning, which has the computational advantage of high performance and high accuracy over semi-homomorphic, differential privacy. Under the premise of realizing the data value and ensuring that the data is not going out, the safe multiparty calculation can create value in a plurality of fields, such as financial wind control, medical scientific research, advertisement promotion and the like. In the current environment where data security is more and more important, secure multiparty computing technology will become one of the requisite underlying technologies for data interaction.

The Guest party is an initiator (coordinator) and a label owner, the Host party is a calculation participant, and the Guest and Host main data interaction parts are in a red box. Both guests and Host are data owners, and data objects of both parties have features of different dimensions.

Therefore, in the federal learning system, a large amount of data communication is needed between the participants to interact model update information, for example, if a data set has 40 ten thousand samples of data, and each data set has 600 features, 255G of data transmission quantity is generated in the histogram calculation process, as in fig. 1, the calculation process of a Host histogram fragment state matrix in the prior XGBoost model training process is shown, only one histogram is constructed, the data interaction of the data level makes efficient model training on a large sample difficult, a large amount of data transmission quantity is generated in the model training process, and the data calculation cost is large; for example, 40 ten thousand training sample set data are taken as an example, wherein each data has 600 features, each feature is divided into 50 barrels, 240 hundred million multiplication operations and 239 hundred million 9994 ten thousand addition operations are needed to be executed in the whole process, and the model training process is long in time consumption and low in efficiency.

In an alternative embodiment of the present application, a data processing method for XGBoost federal learning model training is provided, in the process of calculating a host histogram in model training, the feature is clustered and compressed to implement operation acceleration on a sparse matrix, so as to reduce data calculation amount, transmission amount and the like in the model training process, and under the condition of ensuring the security of federal learning data privacy, reduce data calculation overhead in the model training process, and improve the efficiency of model training, where the data processing method for XGBoost federal learning model training is applied to a data sharing scene between at least one initiator and at least one participant, and fig. 2 is a flowchart of the data processing method for XGBoost federal learning model training provided in the present application, as shown in fig. 2, the method includes the following steps:

s101: acquiring sample data to be trained;

the sample data to be trained are the sample data of at least one initiator and at least one participant;

the XGBoost federal learning model is applied to a data sharing scene between an initiator and a participant, sample data are sample data of the initiator and the participant respectively, for example, the XGBoost federal learning model is applied to a prediction scene such as a patient claim settlement scene, a insurance wind control scene, and the like, the insurance data of an insurance company owned by a Guest party are obtained, the patient data of a hospital owned by the Host party are applied to the patient claim settlement scene and the insurance wind control scene according to a model trained by intersection of user data of the two parties; if the model is applied to the financial wind control and other scenes, the model is constructed by performing federal learning XGBoost tree model construction according to user data in each bank, and the constructed model is applied to the financial wind control scene; if the model is applied to advertisement recommendation, user data of advertisers, advertisement delivery platforms and the like are obtained to construct a federal learning XGBoost tree model, and the constructed model is applied to an advertisement delivery scene to realize prediction in the advertisement delivery scene.

S102: preprocessing sample data based on data feature extraction to obtain a first feature matrix and a second feature matrix;

the first feature matrix is a matrix for representing the characteristics of the sample data of the initiator, and the second feature matrix is a matrix for representing the corresponding characteristics of the sample data of the participant; and the method is used for respectively acquiring intersection feature data sets of the initiator and the participant to obtain a first feature matrix corresponding to the initiator and a second feature matrix corresponding to the participant. In an alternative embodiment, the first feature matrix includes a gradient feature matrix and a second-order gradient feature matrix, which is not limited in this embodiment.

S103: carrying out random pre-clustering compression processing on each feature in the second feature matrix to obtain a clustering matrix and a clustering index;

the feature may be either a continuous feature or a discrete feature, and the present embodiment is not limited to this, and the continuous feature is exemplified in this embodiment.

In another alternative embodiment of the present application, a data processing method for XGBoost federal learning model training is provided, and fig. 3 is a data processing method for XGBoost federal learning model training provided in the present application, as shown in fig. 3, and the method includes the following steps:

S201: randomly selecting a preset number of feature values from each feature in the second feature matrix to serve as a clustering center corresponding to the feature, and obtaining feature data to be clustered;

and randomly selecting a plurality of feature values in the corresponding features or generating random numbers as clustering centers of the features, wherein the feature data to be clustered is the feature values except the clustering centers in the second feature matrix.

S202: and carrying out clustering processing on the feature data to be clustered according to the clustering center and a preset clustering rule to obtain a clustering matrix and a clustering index corresponding to the clustering center.

And clustering the rest characteristic values of the second characteristic matrix to the clustering centers based on a preset clustering rule to obtain a clustering matrix and a clustering index corresponding to each clustering center. Each feature is executed, and a plurality of feature values are randomly selected from each feature to serve as clustering centers of the corresponding feature. If there are 100 features, the 100 features are selected by a random clustering center and pre-clustering is performed, and the operation of each feature is an independent event.

In one specific example, k cluster centers are randomly selected sequentially for the features of f participants (Host) and the features of the sample are gathered onto the approximate cluster centers. Such as the first feature content [10.5, 12.34,2.66,9.5..10.13 ], where the randomly selected cluster centers have 9.5,11,2.66, etc. The clustering result can be expressed as {9.5 } [3, & gt, 11 } [0,1, & gt ], & gt, 2.66 }, a clustering index is sent to the Guest, or the last content sent by the above feature is {0 } [3, & gt], 1 } [0,1, & gt], and }. K-1: [2, & gt ] that is, the actual data is hidden, and only the position order is reserved.

Because the pre-clustering center points are randomly selected, and the number of the random clustering center points is far greater than the barrel number, namely, random data is used as the pre-clustering center points for each feature, the method has the advantages that even if an index (non-actual data content) sent to a Host clustering center is not exposed to feature distribution, namely, even if a clustering index is taken, the distribution information of the data features cannot be deduced, the method has complete randomness, and the sent index content has no actual meaning to a Guest side, does not know the feature meaning and does not know the actual box division result of clustering, so that the safety of the data can be effectively ensured.

S104: performing sparse matrix construction processing according to the clustering matrix to obtain a sparse matrix; performing pre-aggregation calculation on the first feature matrix according to the cluster index to obtain an aggregation matrix;

as an alternative implementation manner of this embodiment, in another alternative embodiment of the present application, there is provided a data processing method for XGBoost federal learning model training, and fig. 4 is a flowchart of a data processing method for XGBoost federal learning model training provided in the present application, as shown in fig. 4, and the method includes the following steps:

S301: and carrying out barrel division processing on the features in the clustering matrix according to the clustering center and the preset barrel division number to obtain a sub-sparse matrix.

S302: and performing splicing treatment on the sub-sparse matrix to obtain a sparse matrix.

And splicing all the sub-sparse matrixes to obtain a sparse matrix. Optionally, the Host side constructs a sparse matrix based on the clustered feature content, and is used to represent the clustering of the aggregation centers, that is, the aggregation center data of different features fall into different buckets, e.g., 9.5, 11 in the specific example, 2.66 performs the clustering, and finally represents as 0/1 sparse matrix history, where shape of one feature is (k, b), and shape is (k, f×b) after the results of all features of the Host side are combined.

As an alternative implementation of this embodiment, fig. 5 is a flowchart of a data processing method for XGBoost federal learning model training provided in the present application, and as shown in fig. 5, the method includes the following steps:

s401: performing identification processing on the first feature matrix to obtain a gradient matrix and a second-order gradient matrix;

s402: performing pre-aggregation treatment on the first-order gradient matrix according to the cluster index to obtain a first-order aggregation matrix;

s403: performing pre-aggregation treatment on the second-order gradient matrix according to the cluster index to obtain a second-order aggregation matrix;

Optionally, according to the cluster index sent by the Host, the Guest performs cluster calculation on the first-order gradient feature matrix (first-order gradient g) and the second-order gradient feature matrix (second-order gradient h) in advance. As per {0 } - [ 3.], 1 } - [0, 1.- ], k-1 } - [ 2.- ], a [ g3+ ], g0+g1+, -, g2+ ] first order aggregation matrix (one-step aggregation result clu _g), and a [ h3+, -, h0+h1+, -, h2+ ] second order aggregation matrix (second order gradient aggregation result clu _h) are generated. Therefore, all features of the Host side can generate a one-step gradient result and a second-order gradient result with shape being (f, k), f is the number of features, and k is the number of aggregation centers. The result will be transposed for convenience in subsequent computational representation, shape being (k, f).

S404: and carrying out combination optimization treatment on the first-order aggregation matrix and the second-order aggregation matrix to obtain an aggregation matrix.

As an alternative implementation manner of this embodiment, there is provided a data processing method for XGBoost federal learning model training, including:

performing expansion processing on the first-order aggregation matrix to obtain a first aggregation matrix; and performing expansion processing on the second-order aggregation matrix to obtain a second aggregation matrix. The first-order aggregation matrix is subjected to expansion processing to obtain a first aggregation matrix, and the second-order aggregation matrix is subjected to expansion processing to obtain a second aggregation matrix. Optionally, the Guest copies the first order aggregation matrix (a gradient aggregation result clu _g) and the second order aggregation matrix (a second order gradient aggregation result clu _h) b times, so as to obtain clu _g and clu _h with shape (k, f×b) by expansion.

Performing gradient combination on the first aggregation matrix and the second aggregation matrix to obtain process aggregation matrix data; and performing densification treatment on the process aggregation matrix data to obtain the aggregation matrix. Combining the original first-order gradient and the second-order gradient into a single column through gradient packing, then calculating a histogram, combining the first-order gradient corresponding to the first aggregation matrix and the second-order gradient corresponding to the second aggregation matrix together to form a number, performing densification processing, and storing each number by using a fixed number of bits, wherein the fixed number of bits is calculated in the following way:

the maximum possible range after summing the first order gradient g and the second order gradient h is calculated first,

g _imax ＝n _i *(g _max +g _off )*2 ^r

h _imax ＝n _i *h _max *2 ^r

the maximum possible range summed with g and h is used to calculate the bit interval,

b _g ＝BitLength(g _imax )

b _h ＝BitLength(h _imax )

wherein n is in the above formula _i Example feature quantity, g _max And h _max Representing the maximum first order gradient value and the maximum second order gradient value, respectively. Because the value range of the first-order gradient value is [ -1,1]The value range of the second-order gradient is [0,1]So a step requires a bias g _off For converting a negative number into a positive number, this value is calculated in the form of abs (min (g)). The gradient range factor r may be set to 53, the number of bits actually used in float.

The G and H of the single sample can be spliced into an object GH, and the final result offside of the approximate order mode is calculated through the estimation mode. The range of the first-order gradient g is [ -1,1], and the range of the second-order gradient h is [0,1]. Under 64bit length, if the floating point condition without losing precision is considered, the decimal place is represented by 2 x 53, and assuming that 1 hundred million training data are provided, the number of bits of one numerical value is estimated as follows:

gimax＝100000000*(1+1)*(2**53)＝1801439850948198400000000

himax＝100000000*1*(2**53)＝900719925474099200000000

b _g log2 (gimax) =81, a step size requires 81 bits,

b _h =log2 (himax) =80, the second order gradient requires 80 bits,

the combined sum is thus controlled to a value range of 161 bits.

For the first order gradient g and the second order gradient h to be spliced, calculating gh, then gh=g (2×r) +h, where r is a multiple expansion factor.

In the embodiment of the application, under the scene of multiparty security calculation which needs to keep high-precision calculation, for example, the scene of 128 bits of original fragment random number length, the traffic is reduced, and the communication times are reduced, so that the effects of model training efficiency and model performance are achieved.

S105: fragmenting the sparse matrix to obtain a first fragment matrix, and fragmenting the aggregation matrix to obtain a second fragment matrix;

Optionally, the sparse matrix is subjected to fragmentation processing to obtain a first fragment matrix, secret sharing is performed, and the aggregation matrix is subjected to fragmentation processing to obtain a second fragment matrix, and secret sharing is performed.

S106: performing matrix multiplication processing based on data encryption on the first fragment matrix and the second fragment matrix to obtain gradient histogram data;

as an alternative implementation of this embodiment, fig. 6 is a flowchart of a data processing method for XGBoost federal learning model training provided in the present application, as shown in fig. 6, and the method includes the following steps:

s501: performing point multiplication processing based on the first fragment matrix and the second fragment matrix corresponding to each feature to obtain process gradient histogram data;

s502: carrying out gradient histogram calculation processing based on the sum of feature clustering centers of the same barrel on the process gradient histogram data to obtain one-step gradient histogram data and second-order gradient histogram data;

s503: and performing splicing processing on the first-step gradient histogram data and the second-order gradient histogram data to obtain gradient histogram data.

And performing matrix multiplication processing based on data encryption on the first fragment matrix and the second fragment matrix of each feature, and summing the feature clustering centers in the same barrel to obtain gradient histogram data.

Optionally, the clu _g and clu _h slice contents (second slice matrix) of the Guest and the history slice contents (first slice matrix) of the Host perform MPC dot product, and the shape of the dot product is (k, f×b).

And (3) performing MPC summation of the same barrel clustering content on the point multiplication result of each feature, namely performing summation on the first dimension of shape (k, f) to obtain a one-step gradient histogram and a second-order gradient histogram of which the shape is (1, f) b, wherein b is the barrel division number. And finally, splicing the one-step gradient histogram and the second-order gradient histogram according to a first dimension to obtain a multiparty security calculation result, namely a final Host party histogram, wherein shape is (2, f x b), and synchronizing the multiparty security calculation result to an initiator.

S107: and performing model training on the XGBoost tree model according to the gradient histogram data to obtain a target XGBoost tree model.

In another optional embodiment of the present application, a data processing method for XGBoost federal learning model training is provided, and model training is performed on an XGBoost tree model according to gradient histogram data to obtain a target XGBoost tree model, including the following steps:

performing optimal segmentation point calculation processing on the XGBoost tree model according to the histogram data to obtain optimal segmentation point data, wherein the optimal segmentation point data is used for representing the optimal segmentation point of the XGBoost tree model; and carrying out model training treatment based on tree structure updating on the XGBoost tree model according to the optimal partition point data to obtain a target XGBoost tree model.

In one example, fig. 7a and 7b are schematic flow diagrams of a data processing method for XGBoost federal learning model training provided in this embodiment, as shown in fig. 7a and 7b, a Guest initiator obtains a feature data set of a Guest party according to a uid, receives feature basic information of a Host party, generates a random seed and synchronizes to the Host party, starts initializing a predicted value p to be 0, judges whether a constructed tree reaches a specified number, if yes, randomly samples training samples and training features to calculate a first-order gradient and calculate a second-order gradient, then judges whether a tree construction stop condition is reached, if yes, initializes a feature histogram hist of the Guest, calculates a bucket boundary value of feature data of the Guest party, calculates a local histogram g_hist of the Guest party, and simultaneously receives a feature cluster index, receiving < hist 1> from Host, calculating the aggregate first-order gradient clu _g and second-order gradient clu _h of different features according to the feature cluster index, slicing clu _g into (< clu _g1>, < clu _g2 >), slicing clu _h into (< clu _h1>, < clu _h2 >), sending < clu _g2> and < clu _h2> to Host, performing mpc matrix point multiplication on < clu _g > < clu _h > and < hist > respectively to obtain < sum_g1> < sum_h1>, shape is (k, f is the cluster center number, f is the feature number, b is the barrel number, performing mpc summation on each feature of < sum_g > and barrel cluster to obtain new < sum_g1>, shape is (1), f is the Host, sending < sum_g2> < sum_h > and receiving sum_h2> to obtain < sum_g1> (k, f is the cluster center number), performing the tile generation of < sum_g1> (k, f) and receiving sum_h, the shape is (2, f is b), all the history contents of the Guest and the Host are further obtained, the value of the node reaching the splitting stopping condition is assigned according to the optimal splitting point of the node to be split, the splitting stopping condition is sent to the Host node, the tree structure is updated, the node information of the next level of the Host is sent, and finally the new tree is utilized to predict the original data and update the p value.

The method comprises the steps that a Host party corresponding to the above-mentioned Guest initiator performs steps that the Host party obtains a feature data set of the Host party according to a user, sends the feature data set to Guest feature basic information, receives random seeds of the Guest, judges whether a constructed tree reaches a specified number, randomly samples training samples and training features if the constructed tree reaches the specified number, then judges whether a tree construction stopping condition is reached, randomly selects a clustering center for each feature if the constructed tree construction stopping condition is not, clusters sample features onto the clustering center of the corresponding feature, sends feature clustering indexes, initializes feature histograms of the Host, wherein a shape is a (k, f) a cluster center number, k is a feature number, f is a barrel number, and fragments of the Guest are (< hist 1> < hist 2 >), sends < hist 1> to the user, receives a pattern transmitted < clu _g2> and a pattern_h2 >, performs clustering on the feature centers of the feature, respectively, sends the feature histograms of the Host to the feature centers, and the feature histograms of the guest_h2 > are obtained by a method, and the feature histograms of the feature histograms are obtained by carrying out a method, and the method is a new cluster structure (k, f) and the new feature information is a new cluster structure of the guest_2, and the new cluster structure is obtained by the method, and the new cluster structure is a new cluster_2 is obtained by the cluster structure of the cluster_h.

In another optional embodiment of the present application, a data processing method for XGBoost federal learning model training is provided, and after model training is performed on an XGBoost tree model according to the gradient histogram data, a target XGBoost tree model is obtained, the method further includes: obtaining sample data to be predicted, wherein the sample data to be predicted is data which needs sample prediction by at least one initiator and at least one participant; and carrying out prediction processing on sample data to be predicted according to the target XGBoost tree model to obtain prediction result data.

The target XGBoost tree model is used for obtaining sample data to be predicted, and the sample data to be predicted is predicted according to the target XGBoost tree model to obtain a patient claim settlement prediction result; the method comprises the steps of obtaining sample data to be predicted by using the target XGBoost tree model for a loan default prediction model, and performing prediction processing on the sample data to be predicted according to the target XGBoost tree model to obtain a loan default prediction result; and the target XGBoost tree model is used for a prediction model for advertisement popularization, sample data to be predicted is obtained, and prediction processing is carried out on the sample data to be predicted according to the target XGBoost tree model, so that an advertisement click rate prediction result is obtained.

In another alternative embodiment of the present application, a data processing apparatus for XGBoost federal learning model training is provided, where the data processing apparatus is applied in a data sharing scenario between at least one initiator and at least one participant, and fig. 8 is a schematic diagram of a data processing apparatus for XGBoost federal learning model training provided in the present application, as shown in fig. 8, where the apparatus includes:

a training sample acquiring module 81, configured to acquire sample data to be trained, where the sample data to be trained is sample data of at least one initiator and at least one participant;

the preprocessing module 82 is configured to perform preprocessing based on data feature extraction on the sample data to obtain a first feature matrix and a second feature matrix, where the first feature matrix is a matrix for representing features of the sample data of the initiator, and the second feature matrix is a matrix for representing corresponding features of the sample data of the participant;

the pre-clustering compression module 83 is configured to perform random pre-clustering compression on each feature in the second feature matrix to obtain a cluster matrix and a cluster index;

the matrix calculation module 84 is configured to perform sparse matrix construction according to the clustering matrix to obtain a sparse matrix; performing pre-aggregation calculation on the first feature matrix according to the cluster index to obtain an aggregation matrix;

A fragmentation module 85, configured to perform fragmentation processing on the sparse matrix to obtain a first fragmentation matrix, and perform fragmentation processing on the aggregate matrix to obtain a second fragmentation matrix;

the gradient histogram calculation module 86 is configured to perform matrix multiplication processing based on data encryption on the first fragment matrix and the second fragment matrix to obtain gradient histogram data;

the model training module 87 is configured to perform model training on the XGBoost tree model according to the gradient histogram data, so as to obtain a target XGBoost tree model.

The specific manner in which the operations of the units in the above embodiments are performed has been described in detail in the embodiments related to the method, and will not be described in detail here.

In summary, in the present application, sample data to be trained is obtained through a data sharing scenario between at least one initiator and at least one participant, where the sample data to be trained is sample data of the at least one initiator and the at least one participant; preprocessing the sample data based on data feature extraction to obtain a first feature matrix and a second feature matrix, wherein the first feature matrix is a matrix for representing the characteristics of the sample data of the initiator, and the second feature matrix is a matrix for representing the corresponding characteristics of the sample data of the participant; carrying out random pre-clustering compression on each feature in the second feature matrix to obtain a clustering matrix and a clustering index; performing sparse matrix construction processing according to the clustering matrix to obtain a sparse matrix; performing pre-aggregation calculation on the first feature matrix according to the cluster index to obtain an aggregation matrix; performing fragmentation processing on the sparse matrix to obtain a first fragmentation matrix, and performing fragmentation processing on the aggregation matrix to obtain a second fragmentation matrix; performing matrix multiplication processing based on data encryption on the first fragment matrix and the second fragment matrix to obtain gradient histogram data; and performing model training on the XGBoost tree model according to the gradient histogram data to obtain a target XGBoost tree model. In the process of calculating the gradient histogram by training the federal learning model, the operation acceleration on the sparse matrix is realized by carrying out clustering compression on the features, the data calculation amount, transmission amount and the like in the model training process are reduced, and under the condition of ensuring the privacy safety of federal learning data, the data calculation overhead in the model training process is reduced, and the model training efficiency is improved.

It should be noted that the steps illustrated in the flowcharts of the figures may be performed in a computer system such as a set of computer executable instructions, and that although a logical order is illustrated in the flowcharts, in some cases the steps illustrated or described may be performed in an order other than that illustrated herein.

It will be apparent to those skilled in the art that the elements or steps of the application described above may be implemented in a general purpose computing device, they may be concentrated on a single computing device, or distributed across a network of computing devices, or they may alternatively be implemented in program code executable by computing devices, such that they may be stored in a memory device for execution by the computing devices, or they may be separately fabricated into individual integrated circuit modules, or multiple modules or steps within them may be fabricated into a single integrated circuit module. Thus, the present application is not limited to any specific combination of hardware and software.

The foregoing description is only of the preferred embodiments of the present application and is not intended to limit the same, but rather, various modifications and variations may be made by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principles of the present application should be included in the protection scope of the present application.

Claims

1. A data processing method for XGBoost federal learning model training, applied in a data sharing scenario between at least one initiator and at least one participant, the data processing method comprising:

2. The data processing method according to claim 1, wherein performing random pre-cluster compression on each feature in the second feature matrix to obtain a cluster matrix and a cluster index includes:

3. The data processing method according to claim 1, wherein performing sparse matrix construction processing according to the cluster matrix, obtaining a sparse matrix includes:

4. The data processing method according to claim 1, wherein performing a pre-aggregation calculation on the first feature matrix according to the cluster index to obtain an aggregation matrix comprises:

5. The data processing method according to claim 4, wherein performing a combination optimization process on the first-order aggregation matrix and the second-order aggregation matrix to obtain the aggregation matrix includes:

6. The data processing method according to claim 1, wherein performing matrix multiplication processing based on data encryption on the first patch matrix and the second patch matrix to obtain gradient histogram data includes:

7. The data processing method according to claim 1, wherein performing model training on the XGBoost tree model according to the gradient histogram data to obtain a target XGBoost tree model includes:

8. The data processing method according to claim 1, wherein after model training is performed on the XGBoost tree model according to the gradient histogram data to obtain a target XGBoost tree model, the data processing method comprises:

9. A data processing apparatus for XGBoost federal learning model training, for use in a data sharing scenario between at least one initiator and at least one participant, the data processing apparatus comprising:

10. A computer readable storage medium, wherein the computer readable storage medium stores computer instructions for causing the computer to perform the data processing method for XGBoost federal learning model training according to any one of claims 1 to 8.