CN114429190A

CN114429190A - Model construction method based on federal learning, credit granting evaluation method and device

Info

Publication number: CN114429190A
Application number: CN202210106216.0A
Authority: CN
Inventors: 卞阳; 蔡晓娟; 陈立峰; 邢旭; 张翔; 张伟奇
Original assignee: Shanghai Fudata Technology Co ltd
Current assignee: Shanghai Fudata Technology Co ltd
Priority date: 2022-01-28
Filing date: 2022-01-28
Publication date: 2022-05-03

Abstract

The application provides a model construction method based on federal learning, a credit assessment method and a credit assessment device. The method comprises the following steps: each participant respectively acquires respective sample data and an initialization model; each participant performs model training on the initialization model by using respective sample data to obtain a trained target model, wherein the target model comprises a self model trained by each participant; in the model training process of each participant, the input data format and/or the output data format of the preset training step are consistent, and the preset training step refers to a step of data interaction among a plurality of participants. According to the method and the device, the data input format and the data output format of the step of data interaction among the multiple participants are agreed in advance in the process of the joint modeling of the multiple participants, so that the multiple participants only purchase one joint modeling product, and the waste of resources is avoided.

Description

Model construction method based on federal learning, credit granting evaluation method and device

Technical Field

The application relates to the technical field of data security, in particular to a model construction method based on federal learning, a credit granting evaluation method and a device.

Background

Federal learning is a machine learning framework, and can effectively help a plurality of organizations to perform data use and machine learning modeling on the premise of meeting the requirements of user privacy protection and data safety.

Currently, when multiple participants jointly model, the participants are required to purchase the same federal modeling product. For example: the party a builds a different model than the party B and the party C, respectively, and then the party B and the party C are required to purchase the same federal modeling product as the party a. Alternatively, the A participant is required to purchase the federated modeling products used by the B and C participants. When the federal modeling products used by the B and C participants are different, the a participant needs to purchase two federal modeling products.

Therefore, when a certain participant builds a model separately from other different participants, the participant is required to purchase a variety of federal modeling products, resulting in a waste of resources.

Disclosure of Invention

The embodiment of the application aims to provide a model construction method, a credit granting evaluation method and a credit granting evaluation device based on federal learning, which are used for reducing the number of joint modeling products purchased by participants and avoiding waste of resources.

In a first aspect, an embodiment of the present application provides a model building method based on federated learning, which is applied to multiple participants in a federated learning alliance, and the method includes: each participant respectively acquires respective sample data and an initialization model; each participant performs model training on the initialization model by using respective sample data to obtain a trained target model, wherein the target model comprises a self model after each participant completes training; in the model training process of each participant, the input data format and/or the output data format of the preset training step are consistent, and the preset training step refers to a step of data interaction among the multiple participants.

According to the method and the device for jointly modeling the data, the data input format and the data output format of the step of data interaction among the multiple participants are agreed in advance in the process of jointly modeling the multiple participants, so that even if the multiple participants use different joint modeling products, the data input format and the data output format are consistent, and therefore the joint modeling can be carried out. This will make a plurality of participants only buy a joint modeling product, the waste of resource that avoids.

In any embodiment, the initialization model is an initialization tree model, and each of the participants performs model training on the initialization model by using respective sample data, including: each participant builds each tree in the self tree model by adopting the following steps until the condition of stopping building is met, and the self tree model is obtained: a characteristic extraction step: performing feature extraction on the sample data to obtain sample features; and a splitting information calculation step: calculating the splitting information of the nodes in the self-tree model according to the sample characteristics; leaf node judging step: judging whether the node is a leaf node or not according to the splitting information and/or sample data corresponding to the node; node segmentation step: if the node is not a leaf node, segmenting the node according to the splitting information to obtain a plurality of sub-nodes corresponding to the node, synchronizing the plurality of sub-nodes to other participants until the sub-nodes obtained after segmentation are leaf nodes, and obtaining a tree model; a tree model judging step: obtaining the self-tree model if the self-tree model meets the condition of stopping training; otherwise, continuing to execute the step from the step of feature extraction to the step of tree model judgment until the condition of stopping training is met.

In the method, the data input format and the data output format of the tree model are uniformly specified in advance when the plurality of participants have a data interaction step in the steps of extracting features, calculating splitting information, judging leaf nodes, splitting nodes and judging the tree model, so that the joint modeling of the plurality of participants can be still carried out on the premise of using different joint modeling products.

In any embodiment, the federated learning is horizontal federated learning, and after the node is determined to be a leaf node according to the split information, the method further includes: a first predicted value calculating step: and all the participants jointly calculate the predicted value of the leaf node according to the sample data corresponding to the leaf node.

In any embodiment, the federal learning is longitudinal federal learning, and when a participating party is an integrating party, after the node is judged to be a leaf node according to the splitting information, the method further comprises: a second predicted value calculating step: and the integration party calculates the predicted value of the leaf node according to the sample data corresponding to the leaf node.

In any embodiment, after obtaining the sample data, the method further comprises: a data preprocessing step: and preprocessing the sample data. According to the embodiment of the application, each participator carries out pretreatment on the sample data in advance before using the sample data to carry out model construction, invalid data in the sample data can be removed specifically, the quality of the sample data participating in the model construction is improved, and the performance of the constructed and obtained model is further improved.

In any embodiment, the federal learning is a horizontal federal learning, and the data input format and the data output format in the data preprocessing step are predetermined for each of the participants. In the embodiment of the application, for horizontal federal learning, sample data is aligned by all participants, so that subsequent model construction is facilitated.

In a second aspect, an embodiment of the present application provides a credit granting evaluation method, which is applied to multiple participants in a federal learning alliance, where the multiple participants use the model building method according to the first aspect to perform modeling to obtain a target model, and the method includes: the multiple participants respectively receive credit granting evaluation requests; acquiring sample data of a corresponding user according to the credit assessment request; predicting corresponding sample data based on a self-party model to obtain a self-party credit assessment result; and obtaining a target credit granting evaluation result based on the own credit granting evaluation results respectively corresponding to the multiple participants.

In a third aspect, an embodiment of the present application provides a model building apparatus based on federal learning, which is applied to multiple participants in a federal learning alliance, and includes: the acquisition module is used for acquiring respective sample data and initialization models; wherein the initialization model comprises nodes; the model construction module is used for carrying out model training on the initialization model by utilizing respective sample data to obtain a trained target model, and the target model comprises a self model after training of each participant; in the model training process of each participant, the input data format and/or the output data format of the preset training step are consistent, and the preset training step refers to a step of data interaction among the multiple participants.

In a fourth aspect, an embodiment of the present application provides a credit granting evaluation apparatus, which is applied to multiple participants in a federal learning alliance, where the multiple participants use the model building method of the first aspect to perform modeling to obtain a target model, and the method includes: the request receiving module is used for receiving credit evaluation requests by the multiple participants respectively; the sample acquisition module is used for acquiring sample data of a corresponding user according to the credit granting evaluation request; the prediction module is used for predicting corresponding sample data based on the own model to obtain an own trust evaluation result; and the result integration module is used for obtaining a target credit granting evaluation result based on the own credit granting evaluation results respectively corresponding to the multiple participants.

In a fifth aspect, an embodiment of the present application provides an electronic device, including: the system comprises a processor, a memory and a bus, wherein the processor and the memory are communicated with each other through the bus; the memory stores program instructions executable by the processor, the processor invoking the program instructions to perform the method of the first or second aspect.

In a sixth aspect, an embodiment of the present application provides a non-transitory computer-readable storage medium, including: the non-transitory computer readable storage medium stores computer instructions that cause the computer to perform the method of the first or second aspect.

Additional features and advantages of the present application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the embodiments of the present application. The objectives and other advantages of the application may be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and that those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.

Fig. 1 is a schematic flow chart of a method for constructing a federated learning model provided in an embodiment of the present application;

fig. 2 is a schematic flow chart of a model construction method based on longitudinal federal learning according to an embodiment of the present application;

fig. 3 is a schematic flow chart of a model construction method based on horizontal federal learning according to an embodiment of the present application;

fig. 4 is a schematic flow chart of a trust evaluation method provided in the embodiment of the present application;

FIG. 5 is a schematic structural diagram of a model building apparatus according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a trust evaluation apparatus provided in an embodiment of the present application;

fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

To facilitate understanding of the embodiments of the present application, the related concepts related to the embodiments of the present application will be described as follows:

federal machine Learning (Federal machine Learning/Federal Learning), also known as Federal Learning, Joint Learning, Federal Learning. The Federal machine learning is a machine learning framework, and can effectively help a plurality of organizations to perform data use and machine learning modeling under the condition of meeting the requirements of user privacy protection and data safety. The federated learning is used as a distributed machine learning paradigm, the data island problem can be effectively solved, participators can jointly model on the basis of not sharing data, the data island can be technically broken, and AI cooperation is realized.

Federal learning has three major components: data source, federal learning system, user. As shown in the figure, under the federal learning system, each data source side carries out data preprocessing, establishes and learns the model together, and feeds back the output result to the user.

Horizontal federal learning: under the condition that the user features of two data sets are overlapped more and the user overlap is less, the data sets are segmented according to the horizontal direction (namely the user dimension), and the data of which the features of the two users are the same and the users are not completely the same are taken out for training. This method is called horizontal federal learning.

Longitudinal federal learning: under the condition that users of two data sets overlap more and user features overlap less, the data sets are segmented according to the longitudinal direction (namely feature dimension), and data of the two users which are the same and the user features which are not completely the same are extracted for training. This method is called longitudinal federal learning.

A tree model: including decision trees, random forests, etc. Taking decision tree as an example, the decision tree is a supervised learning algorithm (with predefined target variables) that is mainly used to classify problems, and the input and output variables may be discrete values or continuous values. In the decision tree, we divide the data set or sample into two or more subsets according to the most discriminative of the input variables.

A self tree model: refers to the trained tree model stored locally by each participant.

Target tree model: the set of the respective hexagonal tree models of all participants is called the target number model.

Splitting characteristics: each feature of the node corresponding to the sample can be taken as a splitting feature. Wherein each splitting feature may comprise at least one splitting point.

Splitting point: splitting characteristic corresponding to splitting point. It is understood that the split point is one or more split points in the split feature.

Splitting value: may be a minimum kini, a maximum information gain, or a maximum information gain ratio.

And the traversal nodes correspond to each splitting point of each characteristic of the sample, the splitting value is calculated, and the optimal splitting value is searched. And the splitting characteristic corresponding to the optimal splitting value is the optimal splitting characteristic. And the splitting point corresponding to the optimal splitting characteristic is the optimal splitting point.

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.

Fig. 1 is a schematic flow chart of a method for constructing a federated learning model provided in the embodiment of the present application, and as shown in fig. 1, the method is applied to multiple participants in a federated learning alliance. It is to be understood that both the a and B participants are electronic devices, which may be terminals, for example: desktop computers, notebook computers, tablet computers, smart phones, etc., may also be servers. In practical applications, the federal learning alliance may further include more participants, and the number of the participants is not particularly limited in the embodiment of the present application. The method comprises the following steps:

step 101: each participant respectively acquires respective sample data and an initialization model; wherein the initialization model comprises nodes;

step 102: each participant performs model training on the initialization model by using respective sample data to obtain a trained target model, wherein the target model comprises a self model trained by each participant;

in the model training process of each participant, the input data format and/or the output data format of the preset training step are consistent, and the preset training step refers to a step of data interaction among the multiple participants.

In step 101, when training starts, the party a acquires sample data and an initialization model stored locally, and the party B acquires the sample data and the initialization model stored locally.

The sample data is a data base for constructing the model, and the A party and the B party both store respective sample data locally. It will be appreciated that the a and B parties each store different sample data. Taking the bank end as the party a and the mobile end as the party B as an example, the sample data stored by the bank end is the identity information of the user and the information related to the banking business, and includes: the name, identification card number, deposit amount, loan or not of the user. The sample data stored by the mobile terminal is the identity information and the mobile data of the user, and includes, for example: the user's name, identification number, duration of call with other users, monthly usage traffic data, etc. The loan or not can be used as a sample label, and each piece of information is used as a sample feature of the sample data. The information of each user is taken as one sample data, so that the sample data in the party A and the party B are different, and for the convenience of distinguishing, the sample data stored by the party A is taken as the sample data A, the initial model is called an A initial model, the sample data stored by the party B is called the sample data B, and the initial model is called a B initial model.

In step 102, the A participant performs model training on the A initialization model by using the A sample data, and the B participant performs model training on the B initialization model by using the B sample data. It should be noted that, in the model training process, there is a step of data interaction between the party a and the party B, and for the step of data interaction, in order to ensure data consistency and enable the other party to recognize and use, the data input format and the data output format of the step of data interaction are agreed in advance. And obtaining respective own model after completing model training, wherein the set of own models forms the target model.

Since the specific processes of different types of models in the model training process are different, the embodiments of the present application do not limit the specific processes of model training and the types of the trained models. The federal learning includes longitudinal federal learning and transverse federal learning, different types of federal learning are adopted in the same model, and the training procedures are different. Therefore, in the following embodiments, the description will be made separately for the longitudinal federal learning and the lateral federal learning.

(1) Longitudinal federal learning

In longitudinal federal learning, an a participant is also referred to as an integrator, assuming that the a samples stored in the a participant contain a tag value.

Fig. 2 is a schematic flow chart of a model construction method based on longitudinal federated learning according to an embodiment of the present application, and as shown in fig. 2, an embodiment of the present application takes construction of a tree model as an example, and includes:

step 201: acquiring data; the participator A acquires sample data A, and the participator B acquires sample data B. The A sample data is stored in the A participant in advance, and the B sample data is stored in the B participant in advance.

Step 202: sampling sample characteristics; and the participator A performs column sampling and row sampling on the sample data A to obtain the characteristics of the sample A, and the participator B performs column sampling and row sampling on the sample data B to obtain the characteristics of the sample B. The column sampling method includes random sampling, efb sampling, pca dimension reduction sampling, and the like. The line sampling method includes goss sampling, random sampling and the like. It should be noted that after the a party and the B party complete line sampling, in order to ensure the integrity of the sample data characteristics obtained after sampling, the sample data after sampling may be aligned. For example: the participant A and the participant B both store sample data of the user 1, the user 2, the user 3, the user 4 and the user 5, the characteristics of each user in the participant A comprise a characteristic C and a characteristic D, and the characteristics of each user in the participant B are a characteristic E and a characteristic F. And if the sample data of the participant A is sampled, obtaining the sampled sample data as the characteristics of the user 1, the user 2 and the user 3. Then to ensure that the complete features of user 1, user 2, and user 3 are available, the B participant is required to obtain the features of user 1, user 2, and user 3, respectively. This is called alignment. It should be noted that the format of the line sampling feature obtained after line sampling of the a and B participants needs to be consistent.

Step 203: initializing a tree model; after initializing the tree model by the party A, obtaining an initialized tree model A, wherein the initialized tree model A comprises initialized nodes, and the number of the nodes can be 1 or other values; and after initializing the tree model by the B participant, obtaining a B initialization tree model, wherein the B initialization tree model comprises nodes obtained by initialization, and the number of the nodes can be 1 or other values.

Step 204: calculating histogram information of sample data of the nodes; the A participant calculates the histogram information of the sample data corresponding to the node in the A model (initially, the A initialization model), and the B participant calculates the histogram information of the sample data corresponding to the node in the B model (initially, the B initialization model). It will be appreciated that only histogram information is calculated that has not yet split, but is not a leaf node.

Step 205: calculating the optimal splitting information of the nodes; after the A participant and the B participant calculate the histogram information of the sample data of the node, the B participant sends the calculated histogram information to the A participant, and the A participant combines the calculated histogram information of the A participant and the received histogram information of the B participant to calculate the optimal splitting information. Therefore, the a party and the B party are required to perform the transformation of the histogram information according to the pre-agreed data input format and agree on the format of the output thereof.

In addition, the optimal splitting information includes optimal splitting characteristics, optimal splitting points, and optimal splitting values. And the traversal nodes correspond to each split point of each feature of the sample data, the split value of each split point of each feature is calculated according to the histogram information of the sample data corresponding to the nodes, and the optimal split value is searched. The characteristic corresponding to the optimal splitting value is an optimal splitting characteristic, and the splitting point corresponding to the optimal splitting characteristic is an optimal splitting point.

For example: the convention histogram information is input in a dictionary form (which can also be an array, a list and the like) formed by information of each division point of each feature, the node optimal splitting information (comprising an optimal splitting value, an optimal splitting feature and an optimal splitting point) is calculated, and the optimal splitting information is output in a fixed tuple form (which can also be a dictionary, an array and the like).

The information of the A participant histogram is as follows: { { feature 0: { bin 0: [ data 1,. ], data m ], binj: [ data 1., data m ] }, … …, featurei: { bin 0: [ data 1,. ], data m ], binj: [ data 1., data m ] }.

The information of the B participator histogram is as follows: { { feature 0: { bin 0: [ data 1,. ], data m ], binj: [ data 1., data m ] }, … …, featurei: { bin 0: [ data 1, ·, data m ],. page, binj: [ data 1., data m ] }.

Where featurei represents the ith feature of the own party and binj represents the jth split point of the feature.

In addition, in order to ensure the security of data, when the B party sends the node information to the a party, the B party needs to encrypt the node data, and the encryption mode can be agreed by both parties in advance. Such as: when each participant agrees to use homomorphic encryption and the tree model algorithm is xgboost, data 1 is the sum of the gradient weights, and data 2 is the sum of the second-order gradient weights in the split box; when each participant agrees to use differential privacy and the tree model algorithm is lightgbm, data 1 is the split value of the jth split point of the ith feature. When the tree model algorithms agreed by each participant are different, the data number and the data representation meaning are different.

It should be noted that each participant agrees to use the same encryption module. (1) When a homomorphic encryption module is used, firstly, before initializing a tree model, a participant A needs to calculate an encrypted sample residual value and send the encrypted sample residual value to a participant B; secondly, the party A calculates to obtain own-party histogram information according to the residual value, and the party B calculates to obtain dense-state histogram information according to the residual value after the dense state; thirdly, the B participant inputs the histogram information into the A participant and calculates the optimal splitting information of the node; and finally, the participant A synchronizes the calculated node optimal splitting information to the participant B. (2) When the differential privacy module is used, firstly, before initializing the tree model, the A participant needs to add noise to the sample residual value and send the sample residual value to the B participant; secondly, the B participant calculates own histogram information according to the residual error value after noise addition; thirdly, the B participant sends the histogram information to the A participant and calculates the node optimal splitting information; and finally, the participant A synchronizes the calculated node optimal splitting information to the participant B.

The specific logic for each participant to calculate the histogram information need not be consistent, but only the method for calculating the optimal splitting value of each feature at each splitting point needs to be consistent. Such as: the characteristic box dividing method can be equal frequency box dividing, also can be equidistant box dividing, or the detail processing modes can be different.

Step 206: synchronizing the optimal splitting information; the optimal splitting information comprises optimal splitting characteristics, optimal splitting points and optimal splitting values. After the optimal splitting information is obtained by the calculation of the party a in step 205, the optimal splitting information is synchronized to the party B.

Step 207: judging whether the current node is in a leaf node; the participant A judges whether the current node is a leaf node or not according to the optimal splitting information and/or the current node sample information; if it is a leaf node, step 210 is performed, and if it is not a leaf node, step 208 is performed. B, the participant judges whether the current node is a leaf node according to the obtained optimal splitting information; if it is a leaf node, step 211 is performed, and if it is not a leaf node, step 208 is performed. It should be noted that the formats of the output results of the a and B participants for determining whether the output results are leaf nodes are agreed in advance, that is, the output results are output according to a uniform format. For example: may be of boolean type using true and false to indicate whether it is a leaf node. Of course, other formats may also be used, such as: 0 and 1 are used to indicate whether or not it is a leaf node. Or represented using binary or the like. It should be noted that, whether the current node is a leaf node may be determined by at least one of the following determination conditions: all sample labels of the current node are in the same class; the number of the samples of the current node is less than a first preset threshold; the number of layers of the tree where the current node is located reaches the depth of a preset tree; and the optimal splitting value of the current node is larger than a preset threshold value.

Step 208: node segmentation; the participant A divides the current node according to the obtained optimal division information; and the B participant carries out sub-node segmentation on the current node according to the obtained optimal splitting information.

Step 209: judging whether the child node is a leaf node; the party a determines whether the node obtained by segmenting the current node in step 208 is a leaf node, and synchronizes the determination result to the party B. If the party A judges that the node is a leaf node, executing step 210; otherwise, step 204 is performed. If the B participant receives the message synchronized by the a participant that the node after the split is the leaf node, step 211 is executed, otherwise step 204 is executed.

Step 210: calculating a leaf node predicted value; and the participant A calculates the predicted value of the leaf node according to the sample data corresponding to the leaf node. It can be understood that since the B participant does not need to enter the module for calculating the leaf node prediction value and does not need to output the value, only the a participant calculates the prediction value, and thus the input and output rules do not need to be defined.

Step 211: and stopping cutting. And when the A participant and the B participant judge that the current node is a leaf node, stopping segmentation and finishing the construction of the tree model.

Step 212: judging whether a tree model stopping condition is reached; the A participant judges whether the condition for stopping building the tree model is satisfied, if so, the A tree model (namely the self tree model) is obtained, if not, the step 202 is executed until the condition for stopping building is satisfied. And (3) synchronizing the judgment result to the participant B by the participant A, acquiring a B tree model (namely the self tree model) if the judgment result is satisfied after the participant B receives the judgment result, and executing the step 202 until the condition of stopping construction is satisfied if the judgment result is not satisfied. It can be understood that in the longitudinal federal learning, the B participant does not need to judge whether to stop, only the a participant judges, and the input rule is not needed to be defined, the output is carried out in a uniform format, and boolean types { True, False } are used for indicating whether the tree model stop condition is reached, and the synchronization is carried out to the B participant. Of course, the judgment result may also be expressed in the manner mentioned above in step 207.

It is understood that the condition for stopping building the tree model includes at least one of:

the depth of the tree model is larger than a first preset value;

the number of samples corresponding to each leaf node in the tree model is smaller than a second preset value;

the label value of the sample data corresponding to each leaf node in the tree model is the same value.

The first preset value and the second preset value may be set according to actual conditions, and this is not specifically limited in the embodiment of the present application.

For longitudinal federal learning, in the embodiment of the application, a data input format and a data output format of a step of data interaction among a plurality of participants are agreed in advance in the process of joint modeling of the plurality of participants, so that even if the plurality of participants use different joint modeling products, joint modeling can be performed because the data input format and the data output format are consistent. This will make a plurality of participants only buy a joint modeling product can, the waste of resources avoided.

On the basis of the above embodiment, after step 201, a data preprocessing step may be further included, where the purpose of this step is to perform operations such as cleaning and duplicate removal on sample data, so that the sample data entering the model building process is data with good quality, and the performance of the model obtained by building is improved.

It should be noted that, for the longitudinal federal learning, the method adopted by each participant to preprocess the sample data may be different, or only some participants may preprocess the sample data, and of course, all participants may not preprocess the sample data.

(2) Horizontal federal learning

In the horizontal federal learning, both the a sample stored in the a participant and the B sample stored in the B participant contain tag values, and one participant can be randomly designated as an integrator.

Fig. 3 is a schematic flow chart of a model construction method based on horizontal federal learning according to an embodiment of the present application, and as shown in fig. 3, the method for constructing a tree model according to the embodiment of the present application includes:

step 301: acquiring data; the participator A acquires sample data A, and the participator B acquires sample data B. The A sample data is stored in the A participant in advance, and the B sample data is stored in the B participant in advance.

Step 302: sampling sample characteristics; and the participator A performs column sampling and row sampling on the sample data A to obtain the characteristics of the sample A, and the participator B performs column sampling and row sampling on the sample data B to obtain the characteristics of the sample B. The column sampling method includes random sampling, efb sampling, pca dimension reduction sampling, and the like. The line sampling method includes goss sampling, random sampling and the like. It should be noted that after column sampling is completed, the a and B participants align the sampled sample features. It should be noted that, in order to ensure the integrity of the sample data, the formats of the column sampling features obtained after the a party and the B party respectively perform column sampling on the respective sample data are kept consistent.

Step 303: initializing a tree model; after initializing the tree model by the party A, obtaining an initialized tree model A, wherein the initialized tree model A comprises initialized nodes, and the number of the nodes can be 1 or other values; and after initializing the tree model by the B participant, obtaining a B initialization tree model, wherein the B initialization tree model comprises nodes obtained by initialization, and the number of the nodes can be 1 or other values.

Step 304: calculating histogram information of sample data of the nodes; the a-party calculates histogram information of sample data of nodes in the a-model (initially, the a-initialization model), and the B-party calculates histogram information of sample data of nodes in the B-model (initially, the B-initialization model). It will be appreciated that only histogram information for sample data for nodes that have not yet been split, but are not leaf nodes, is computed.

Step 305: calculating the optimal splitting information of the nodes; after the A participant and the B participant calculate the histogram information of the sample data of the node, the B participant sends the calculated histogram information to the A participant, and the A participant combines the calculated histogram information of the A participant and the received histogram information of the B participant to calculate the optimal splitting information. Therefore, the a and B participants are required to perform the transformation of the histogram information according to the pre-agreed data input format and agree on the format of their output.

The information of the B participator histogram is as follows: { { feature 0: { bin 0: [ data 1,. ], data m ], binj: [ data 1., data m ] }, … …, featurei: { bin 0: [ data 1,. ], data m ], binj: [ data 1., data m ] }.

In addition, in order to ensure the security of data, after the B participant calculates and obtains the histogram information of the own sample data, it is necessary to encrypt the histogram information by using an encryption method, and send the encrypted histogram information to the a participant, where the encryption method may be predetermined by both parties, and for example, a homomorphic encryption method, a differential privacy encryption method, or an MPC method may be used. In addition, when each participant calculates the splitting information, a corresponding calculation method can be selected according to different tree models, such as: when the tree model algorithm is xgboost, data 1 is the sum of the gradients and the weights, and data 2 is the sum of the second-order gradients and the weights in the split box; when the tree model algorithm is lightgbm, data 1 is the split value of the jth split point of the ith feature. When the tree model algorithms agreed by each participant are different, the data number and the data representation meaning are different.

Step 306: synchronizing the optimal splitting information; the optimal splitting information comprises optimal splitting characteristics, optimal splitting points and optimal splitting values. After the optimal splitting information is obtained by the calculation of the party a in step 305, the optimal splitting information is synchronized to the party B.

Step 307: judging whether the current node is a leaf node; the participant A judges whether the current node is a leaf node or not according to the received optimal splitting information; if it is a leaf node, step 310 is performed, and if it is not a leaf node, step 308 is performed. B, the participant judges whether the current node is a leaf node according to the obtained optimal splitting information; if it is a leaf node, step 310 is performed, and if it is not a leaf node, step 308 is performed. It should be noted that the formats of the output results of the a and B participants for determining whether the output results are leaf nodes are agreed in advance, that is, the output results are output according to a uniform format. For example: may be of boolean type using true and false to indicate whether it is a leaf node. Of course, other formats may also be used, such as: 0 and 1 are used to indicate whether or not it is a leaf node. Or represented using binary or the like. It should be noted that, the method for judging the leaf node refers to the judgment method in the embodiment corresponding to the above-mentioned longitudinal federal learning, and is not described herein again.

Step 308: node segmentation; if the optimal splitting point is on the participant A, the participant A performs splitting on the current node according to the obtained optimal splitting information and synchronizes the splitting result to the participant B, so that the participant B performs corresponding splitting according to the splitting result; and if the optimal splitting point is on the B participant, the B participant performs sub-node splitting on the current node according to the obtained optimal splitting information and synchronizes the splitting result to the A participant, so that the A participant performs corresponding splitting according to the splitting result. If the participant A and the participant B both comprise the optimal split point, the participant A and the participant B are split according to the optimal split point of the participants, and the split result is synchronized to the other party after the split.

Step 309: judging whether the child node is a leaf node; the party a and the party B jointly determine whether the node obtained by splitting the current node in step 308 is a leaf node. The method specifically comprises the following steps: and the B participant sends the node information of the node in the B model to the A participant, and the A participant judges according to the data of the two parties and sends the judgment result to the B participant. If the judgment result is a leaf node, both the party A and the party B execute the step 310; otherwise both the a and B participants perform step 304. It should be noted that, it needs to be agreed in advance that the formats of the node information used by the participant a and the participant B for leaf node judgment are the same, and the format of the output judgment result is specified. For example: defining the input format as a dictionary, wherein the dictionary comprises the sample number of the child nodes, the sample label value types of the child nodes and the depth of the tree model to which the child nodes belong. The output format indicates whether it is a leaf node or not in a boolean type { True, False }.

Step 310: calculating a leaf node predicted value; and the A participant and the B participant jointly calculate the predicted values of the leaf nodes according to the sample data corresponding to the leaf nodes. The method specifically comprises the following steps: and the participant B sends the sample data corresponding to the leaf node to the participant A, the participant A calculates the predicted value of the leaf node by combining the sample data corresponding to the leaf node of the participant A and the sample data of the leaf node sent by the participant B, and synchronizes the predicted value of the leaf node obtained by calculation to the participant B. It should be noted that, when calculating the predicted value of a leaf node, it is necessary to combine the sample data corresponding to the leaf node of the a participant and the sample data of the leaf node sent by the B participant, so it is necessary to agree in advance that the formats of the sample number data of the leaf nodes of the a participant and the B participant are consistent, and to standardize the data format of the calculation result output by the a participant. For example: based on the convention conditions, when the decision tree is selected, the number of samples of each category and the total number of samples of leaf nodes given by each participant are input in a dictionary form, can be an array form or a list form, the predicted value of the leaf node is calculated and output in a numerical type, and can also be agreed as probability or category.

Step 311: the cutting is stopped. And when the A participant and the B participant judge that the current node is a leaf node, stopping segmentation and finishing the construction of the tree model.

Step 312: judging whether a tree model stopping condition is reached; and the A participant receives the model information of the B tree model sent by the B participant, judges whether the condition for stopping building the tree model is met or not by combining the model information of the A tree model, and sends the judgment result to the B participant. If the stopping condition is met, the A participant obtains an A tree model (namely a self tree model), and the B participant obtains a B tree model (namely a self tree model); if not, both the A and B participants perform step 302 until a stop build condition is met. It should be noted that since the model information of the a-tree model and the model information of the B-tree model are required for the determination, the formats of data input and data output need to be unified. For example: each participant gives tree model information in the form of a dictionary, which contains the trees of the current tree, and calculates the loss function parameters. Output in a unified format, use boolean type { True, False } to indicate whether tree model stop conditions are reached and synchronize to B participants.

the depth of the tree model is larger than a first preset value;

the sample labels of the sample data corresponding to each leaf node in the tree model are of the same type.

For the horizontal federal learning, the data input format and the data output format of a step of data interaction among a plurality of participants are agreed in advance in the process of the joint modeling of the plurality of participants, so that even if the plurality of participants use different joint modeling products, the joint modeling can be performed because the data input format and the data output format are consistent. This will make a plurality of participants only buy a joint modeling product, the waste of resource that avoids.

On the basis of the above embodiment, after step 301, a data preprocessing step may be further included, where the purpose of the step is to perform operations such as cleaning and deduplication on sample data. Moreover, the preprocessing methods adopted by the participants need to be consistent, and the format of sample data obtained after preprocessing by the participants needs to be agreed in advance. Therefore, the sample data entering the model building process is good-quality data, and the performance of the built model is improved.

Based on the model obtained by the model construction method provided in each embodiment, an embodiment of the present application provides a trust evaluation method, as shown in fig. 4, the method includes:

step 401: the multiple participants respectively receive credit granting evaluation requests; and the party A receives the credit evaluation request and sends the credit evaluation request to the party B.

Step 402: acquiring sample data of a corresponding user according to the credit assessment request; and the party A acquires the sample data of the locally stored user according to the credit assessment request and based on the credit assessment request. And after receiving the credit assessment request, the participant B acquires the sample data of the user locally stored by the participant B based on the credit assessment request. The A sample data of the user stored by the A participant comprises: name, identification card number, deposit amount, loan or not, etc. The B sample data of the user stored by the B participant comprises: name, identification number, duration of call with other users, average traffic, etc. Each user is one sample data.

Step 403: predicting corresponding sample data based on a self-party model to obtain a self-party credit assessment result; and the participant A inputs the sample data A into the model A to obtain the A credit evaluation result output by the model A. And the participator B inputs the sample data of B into the model B to obtain the A credit evaluation result output by the model B. It is understood that the a model and the B model in the embodiment of the present application constitute a complete target model, and the target model may be constructed by the method of model construction in the above embodiment.

Step 404: and obtaining a target credit granting evaluation result based on the own credit granting evaluation results respectively corresponding to the multiple participants. And the participant B sends the own-party credit granting evaluation result to the participant A, and the participant A integrates the own-party credit granting evaluation result of the participant A and the received own-party credit granting evaluation result of the participant B to obtain a target credit granting evaluation result. It will be appreciated that the manner in which they are integrated may vary for different types of object models. For example: and in the decision tree binary classification model, the prediction results returned by each participant need to be summed, then sigmoid conversion is carried out, and the deep learning model usually needs to be calculated by softmax. Or the intersection of the own-party credit granting evaluation result obtained by the participant A and the own-party credit granting evaluation result obtained by the participant B is obtained to obtain the target credit granting evaluation result. The method specifically comprises the following steps: the participant A comprises two leaf nodes, namely a leaf node 1 and a leaf node 2, wherein the predicted value corresponding to the leaf node 1 is 0.82, which represents that the credible probability is 0.82, and the predicted value corresponding to the leaf node 2 is 0.12, which represents that the credible probability is 0.12; the B participant comprises two leaf nodes, namely a leaf node 3 and a leaf node 4, wherein the predicted value corresponding to the leaf node 3 is 0.78, which represents that the credible probability is 0.78, and the predicted value corresponding to the leaf node 4 is 0.25, which represents that the credible probability is 0.25. If the agreed predicted value is greater than 0.5, the person can be trusted, otherwise, the person cannot be renewed. For a user to be assessed and trusted, after the participant A predicts the user, the user falls on the leaf node 1 and is trusted, and after the participant B predicts the user, the user falls on the leaf node 3 and the leaf node 4 respectively, so that the user is trusted after the intersection of the results of the leaf node 1 and the leaf node 3 is obtained and the target trust assessment result is obtained.

It should be noted that the above embodiment is only an example of a specific application scenario, and in an actual application process, an object model constructed by different sample data may be used for a corresponding application.

Fig. 5 is a schematic structural diagram of a model building apparatus provided in an embodiment of the present application, where the apparatus may be a module, a program segment, or code on an electronic device. It should be understood that the apparatus corresponds to the above-mentioned embodiment of the method in fig. 1, and can perform various steps related to the embodiment of the method in fig. 1, and the specific functions of the apparatus can be referred to the above description, and the detailed description is appropriately omitted here to avoid redundancy. The device comprises: an obtaining module 501 and a model building module 502, wherein:

the obtaining module 501 is configured to obtain respective sample data and initialization model; wherein the initialization model comprises nodes;

the model construction module 502 is configured to perform model training on the initialization model by using respective sample data to obtain a trained target model, where the target model includes a self model after training of each participant;

On the basis of the foregoing embodiment, the initialization model is an initialization tree model, and the model building module 502 is specifically configured to:

each participant builds each tree in the self tree model by adopting the following steps until the condition of stopping building is met, and the self tree model is obtained:

a characteristic extraction step: performing feature extraction on the sample data to obtain sample features;

and a splitting information calculation step: calculating the splitting information of the nodes in the self-tree model according to the sample characteristics;

leaf node judging step: judging whether the node is a leaf node or not according to the splitting information and/or the sample data corresponding to the node;

node segmentation step: if the node is not a leaf node, segmenting the node according to the splitting information to obtain a plurality of sub-nodes corresponding to the node, synchronizing the plurality of sub-nodes to other participants until the sub-nodes obtained after segmentation are leaf nodes, and obtaining a tree model;

judging a tree model: obtaining the self-tree model if the self-tree model meets the condition of stopping training; otherwise, continuing to execute the step from the step of feature extraction to the step of tree model judgment until the condition of stopping training is met.

On the basis of the above embodiment, the federal learning is a horizontal federal learning, and the apparatus further includes a first prediction module configured to: and all the participators jointly calculate the predicted value of the leaf node according to the sample data corresponding to the leaf node.

On the basis of the above embodiment, the federal learning is longitudinal federal learning, and when the participating party is the integrating party, the apparatus further includes a second prediction module, configured to calculate the predicted value of the leaf node by the integrating party according to the sample data corresponding to the leaf node.

On the basis of the above embodiment, the apparatus further includes a preprocessing module configured to: and preprocessing the sample data.

On the basis of the above embodiment, the federal learning is a horizontal federal learning, and the data input format and the data output format in the data preprocessing step are agreed in advance for each of the participants.

Fig. 6 is a schematic structural diagram of a credit assessment apparatus provided in an embodiment of the present application, where the apparatus may be a module, a program segment, or a code on an electronic device. It should be understood that the apparatus corresponds to the above-mentioned embodiment of the method of fig. 4, and can perform various steps related to the embodiment of the method of fig. 4, and the specific functions of the apparatus can be referred to the description above, and the detailed description is appropriately omitted here to avoid redundancy. The device comprises: a request receiving module 601, a sample obtaining module 602, a prediction module 603, and a result integration module 604, wherein:

the request receiving module 601 is configured to receive trust evaluation requests by the multiple participants respectively; the sample obtaining module 602 is configured to obtain sample data of a corresponding user according to the trust evaluation request; the prediction module 603 is configured to predict corresponding sample data based on the own model, and obtain an own trust evaluation result; the result integration module 604 is configured to obtain a target credit granting evaluation result based on the own credit granting evaluation result corresponding to each of the multiple participating parties.

Fig. 7 is a schematic structural diagram of an entity of an electronic device provided in an embodiment of the present application, and as shown in fig. 7, the electronic device includes: a processor (processor)701, a memory (memory)702, and a bus 703; wherein the content of the first and second substances,

the processor 701 and the memory 702 complete communication with each other through the bus 703;

the processor 701 is configured to call the program instructions in the memory 702 to execute the methods provided by the above-mentioned method embodiments, for example, including: each participant respectively acquires respective sample data and an initialization model; each participant performs model training on the initialization model by using respective sample data to obtain a trained target model, wherein the target model comprises a self model trained by each participant; in the model training process of each participant, the input data format and/or the output data format of the preset training step are consistent, and the preset training step refers to the step that data interaction exists among the participants. Or

The multiple participants respectively receive credit granting evaluation requests; acquiring sample data of a corresponding user according to the credit assessment request; predicting corresponding sample data based on a self-party model to obtain a self-party credit assessment result; and obtaining a target credit granting evaluation result based on the own credit granting evaluation results respectively corresponding to the multiple participants.

The processor 701 may be an integrated circuit chip having signal processing capabilities. The Processor 701 may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. Which may implement or perform the various methods, steps, and logic blocks disclosed in the embodiments of the present application. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The Memory 702 may include, but is not limited to, Random Access Memory (RAM), Read Only Memory (ROM), Programmable Read Only Memory (PROM), Erasable Read Only Memory (EPROM), Electrically Erasable Read Only Memory (EEPROM), and the like.

The present embodiment discloses a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform the method provided by the above method embodiments, for example, including: each participant respectively acquires respective sample data and an initialization model; each participant performs model training on the initialization model by using respective sample data to obtain a trained target model, wherein the target model comprises a self model after each participant completes training; in the model training process of each participant, the input data format and/or the output data format of the preset training step are consistent, and the preset training step refers to a step of data interaction among the multiple participants. Or

The present embodiments provide a non-transitory computer-readable storage medium storing computer instructions that cause the computer to perform the methods provided by the above method embodiments, for example, including: each participant respectively acquires respective sample data and an initialization model; each participant performs model training on the initialization model by using respective sample data to obtain a trained target model, wherein the target model comprises a self model after each participant completes training; in the model training process of each participant, the input data format and/or the output data format of the preset training step are consistent, and the preset training step refers to a step of data interaction among the multiple participants. Or

The multiple participants respectively receive credit granting evaluation requests; obtaining sample data of a corresponding user according to the credit granting evaluation request; predicting corresponding sample data based on a self-party model to obtain a self-party credit assessment result; and obtaining a target credit granting evaluation result based on the own credit granting evaluation results respectively corresponding to the multiple participants.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described apparatus embodiments are merely illustrative, and for example, the division of the units into only one type of logical function may be implemented in other ways, and for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.

In addition, units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

Furthermore, the functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.

In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions.

The above description is only an example of the present application and is not intended to limit the scope of the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A construction method based on a Federal learning model is applied to a plurality of participants in a Federal learning alliance, and the method comprises the following steps:

each participant respectively acquires respective sample data and an initialization model;

each participant performs model training on the initialization model by using respective sample data to obtain a trained target model, wherein the target model comprises a self model after each participant completes training;

2. The method of claim 1, wherein the initialization model is an initialization tree model, and each of the participating parties performs model training on the initialization model using respective sample data, comprising:

each participant builds each tree in the self tree model by the following steps until the condition of stopping building is met, and the self tree model is obtained:

leaf node judging step: judging whether the node is a leaf node or not according to the splitting information and/or sample data corresponding to the node;

judging a tree model: according to whether the current tree model meets the condition of stopping training or not, if so, obtaining the own tree model; otherwise, continuing to execute the step from the step of feature extraction to the step of tree model judgment until the condition of stopping training is met.

3. The method of claim 2, wherein the federated learning is a horizontal federated learning, and wherein after determining the node is a leaf node based on the split information, the method further comprises:

a first predicted value calculating step: and all the participators jointly calculate the predicted value of the leaf node according to the sample data corresponding to the leaf node.

4. The method of claim 2, wherein the federated learning is longitudinal federated learning, and when a participating party is an integrating party, after determining that the node is a leaf node from the split information, the method further comprises:

a second predicted value calculating step: and the integration party calculates the predicted value of the leaf node according to the sample data corresponding to the leaf node.

5. The method of claim 1, wherein after obtaining sample data, the method further comprises:

a data preprocessing step: and preprocessing the sample data.

6. The method according to claim 5, wherein the federal learning is a horizontal federal learning, and the data input format and the output format in the data preprocessing step are pre-agreed for each of the participants.

7. A credit granting evaluation method, applied to multiple participants in federal learning alliance, wherein the multiple participants adopt the federal learning model-based construction method as claimed in any one of claims 1 to 6 to model and obtain a target model, and the method comprises:

the multiple participants respectively receive credit assessment requests;

acquiring sample data of a corresponding user according to the credit assessment request;

predicting corresponding sample data based on a self-party model to obtain a self-party credit assessment result;

and obtaining a target credit granting evaluation result based on the own credit granting evaluation results respectively corresponding to the multiple participants.

8. A model building device based on federal learning is characterized in that the model building device is applied to a plurality of participants in the federal learning alliance and comprises the following components:

the acquisition module is used for acquiring respective sample data and initialization models; wherein the initialization model comprises nodes;

the model construction module is used for carrying out model training on the initialization model by utilizing respective sample data to obtain a trained target model, and the target model comprises a self model after training of each participant;

9. An electronic device, comprising: a processor, a memory, and a bus, wherein,

the processor and the memory are communicated with each other through the bus;

the memory stores program instructions executable by the processor, the processor invoking the program instructions to perform the method of any one of claims 1-6 or claim 7.

10. A non-transitory computer-readable storage medium storing computer instructions which, when executed by a computer, cause the computer to perform the method of any one of claims 1-6 or claim 7.