WO2022110721A1 - 基于客户端分类聚合的联合风险评估方法及相关设备 - Google Patents

基于客户端分类聚合的联合风险评估方法及相关设备 Download PDF

Info

Publication number
WO2022110721A1
WO2022110721A1 PCT/CN2021/096655 CN2021096655W WO2022110721A1 WO 2022110721 A1 WO2022110721 A1 WO 2022110721A1 CN 2021096655 W CN2021096655 W CN 2021096655W WO 2022110721 A1 WO2022110721 A1 WO 2022110721A1
Authority
WO
WIPO (PCT)
Prior art keywords
cluster
model parameters
initial
final
center point
Prior art date
Application number
PCT/CN2021/096655
Other languages
English (en)
French (fr)
Inventor
李泽远
王健宗
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2022110721A1 publication Critical patent/WO2022110721A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/03Credit; Loans; Processing thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering

Definitions

  • This application relates to the field of artificial intelligence technology and the field of financial technology, in particular to a joint risk assessment method and related equipment based on client classification and aggregation.
  • the main business of banks is lending.
  • banks can conduct credit checks on borrowers, and credit risk assessment is a method used to calculate the creditworthiness of individuals, businesses or organizations. It helps banks assess whether a loan applicant will default and the probability of default.
  • banks and financial institutions mainly evaluate credit risk levels based on various loan amounts and related subjective factors. This method is reactive rather than predictive, and the model effect is not ideal. Therefore, some more accurate quantitative prediction models need to be further developed, and machine learning models, as tools for gaining analysis and insights from the massive amounts of data in the past, are believed to play a good role in risk assessment.
  • Federated learning is a brand-new distributed machine learning framework, designed to meet the requirements of multi-party user privacy protection, data security and government regulations, while performing data collaboration and machine learning modeling to solve the problem of "data silos”.
  • federated learning techniques face statistical challenges in the application process.
  • the iterative optimization of machine learning and deep learning models relies on stochastic gradient descent (SGD), but in order to ensure that the stochastic gradient obtained in the SGD algorithm is an unbiased estimate of the complete gradient, the training data should satisfy the assumption of independent and identical distribution.
  • the aggregation optimization algorithm FedAvg which is widely used in federated learning, is also based on the assumption that the data is independent and identically distributed, and is simply averaged when the parameters are aggregated. Studies have shown that if highly skewed non-IID data is used for training, and the existing iterative optimization algorithm FedAvg is used, the accuracy of the neural network model obtained is significantly reduced, and the reduced accuracy can even reach more than 50%.
  • the purpose of this application is to provide a joint risk assessment method and related equipment based on client classification and aggregation, aiming to solve the problem of low prediction accuracy of the existing risk assessment method applying federated learning.
  • an embodiment of the present application provides a joint risk assessment method based on client classification and aggregation, including:
  • the server side randomly selects s from a plurality of clients as initial cluster center points, and calculates the dataset centroid of the local dataset for each of the initial cluster center points;
  • the server performs clustering on each of the clients according to the distance between the centroid of the dataset of each client's local dataset and the centroid of the dataset of the initial clustering center point, and performs a clustering result on the client. Iterative update determines the s final cluster center points and the clients under each cluster;
  • the final cluster center point reads the initial model for credit risk assessment from the server side, and sends the read initial model to the client of the corresponding cluster;
  • the client trains the initial model based on the local data set to obtain initial model parameters, and sends the initial model parameters to the final cluster center point of the corresponding cluster;
  • the local data set includes basic information and credit information ;
  • the final cluster center point aggregates the received initial model parameters of each of the clients under the corresponding cluster to obtain intermediate model parameters of the corresponding cluster;
  • Each of the final cluster center points uploads the intermediate model parameters of the corresponding cluster to the server side;
  • the server side aggregates the intermediate model parameters of each cluster to obtain final model parameters, and uses the final model parameters to update the initial model to obtain a global model for credit risk assessment.
  • an embodiment of the present application provides a joint risk assessment system based on client classification and aggregation, including a server and multiple clients; wherein the server includes: an initial selection unit, a final selection unit, and a model an update unit, the multiple clients include s final cluster center points, each of the final cluster center points includes a read-and-delivery unit, an initial model parameter aggregation unit and an upload unit, each of the client Including: model training unit;
  • the initial selection unit is used to randomly select s from a plurality of clients as the initial cluster center points, and calculate the data set centroid of the local data set of each of the initial cluster center points;
  • the final selection unit is used to cluster each of the clients according to the distance between the data set centroid of the local data set of each of the clients and the data set centroid of the initial clustering center point, and cluster the The result is iteratively updated to determine the s final cluster center points and the clients under each cluster;
  • the reading and issuing unit is used to read the initial model for credit risk assessment from the server side, and send the read initial model to the client of the corresponding cluster;
  • a model training unit for training the initial model based on a local data set to obtain initial model parameters, and sending the initial model parameters to the final cluster center point of the corresponding cluster;
  • the local data set includes basic information and credit information;
  • an initial model parameter aggregation unit configured to aggregate the received initial model parameters of each of the clients under the corresponding cluster to obtain intermediate model parameters of the corresponding cluster
  • an uploading unit used for uploading the intermediate model parameters of the corresponding clusters to the server side
  • a model updating unit configured to aggregate the intermediate model parameters of each cluster to obtain final model parameters, and use the final model parameters to update the initial model to obtain a global model for credit risk assessment.
  • an embodiment of the present application provides a computer device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor executes the computer program When implementing the following steps:
  • Cluster each of the clients according to the distance between the centroid of the local dataset of each client and the centroid of the dataset of the initial cluster center point, and iteratively update the clustering result to determine s a final cluster center point and a client under each cluster; make the final cluster center point read the initial model for credit risk assessment from the server, and deliver the read initial model to the corresponding cluster and make the client train the initial model based on the local data set to obtain initial model parameters, and send the initial model parameters to the final cluster center point of the corresponding cluster; the local data set Including basic information and credit information; and making the received initial model parameters of the client under the corresponding cluster to be aggregated to obtain the intermediate model parameters of the corresponding cluster; and making each of the final cluster center points
  • the intermediate model parameters of the corresponding clusters are uploaded to the server side;
  • the intermediate model parameters of each cluster are aggregated to obtain final model parameters, and the initial model is updated by using the final model parameters to obtain a global model for credit risk assessment.
  • an embodiment of the present application provides a computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, and when executed by a processor, the computer program causes the processor to perform the following steps:
  • Cluster each of the clients according to the distance between the centroid of the local dataset of each client and the centroid of the dataset of the initial cluster center point, and iteratively update the clustering result to determine s a final cluster center point and a client under each cluster; make the final cluster center point read the initial model for credit risk assessment from the server, and deliver the read initial model to the corresponding cluster and make the client train the initial model based on the local data set to obtain initial model parameters, and send the initial model parameters to the final cluster center point of the corresponding cluster; the local data set Including basic information and credit information; and making the received initial model parameters of the client under the corresponding cluster to be aggregated to obtain the intermediate model parameters of the corresponding cluster; and making each of the final cluster center points
  • the intermediate model parameters of the corresponding clusters are uploaded to the server side;
  • the intermediate model parameters of each cluster are aggregated to obtain final model parameters, and the initial model is updated by using the final model parameters to obtain a global model for credit risk assessment.
  • the embodiment of the present application provides a joint risk assessment method and related equipment based on client classification and aggregation.
  • clients of the same type after clustering, clients of the same type have relatively similar local data. It can better meet the assumption of independent and identical distribution, so the training of federated learning in the same class of clients can achieve better results, and the final model prediction accuracy is higher.
  • FIG. 1 is a schematic flowchart of a joint risk assessment method based on client classification and aggregation provided by an embodiment of the present application
  • FIG. 2 is a schematic sub-flow diagram of a joint risk assessment method based on client classification and aggregation provided by an embodiment of the present application;
  • FIG. 3 is a schematic diagram of another sub-flow of a joint risk assessment method based on client classification and aggregation provided by an embodiment of the present application;
  • FIG. 4 is a schematic diagram of another sub-flow of a joint risk assessment method based on client classification and aggregation provided by an embodiment of the present application;
  • FIG. 5 is a schematic diagram of another sub-flow of a joint risk assessment method based on client classification and aggregation provided by an embodiment of the present application;
  • FIG. 6 is a schematic block diagram of a joint risk assessment system based on client classification and aggregation provided by an embodiment of the present application
  • FIG. 7 is a schematic block diagram of subunits of a joint risk assessment system based on client classification and aggregation provided by an embodiment of the present application;
  • FIG. 8 is a schematic block diagram of another subunit of a joint risk assessment system based on client classification and aggregation provided by an embodiment of the present application;
  • FIG. 9 is a schematic block diagram of another subunit of a joint risk assessment system based on client classification and aggregation provided by an embodiment of the present application.
  • FIG. 10 is a schematic block diagram of another subunit of a joint risk assessment system based on client classification and aggregation provided by an embodiment of the present application;
  • FIG. 11 is a schematic block diagram of a computer device provided by an embodiment of the present application.
  • FIG. 1 is a schematic flowchart of a joint risk assessment method based on client classification and aggregation provided by an embodiment of the present application, which includes:
  • the server randomly selects s from a plurality of clients as initial cluster center points, and calculates the data set centroid of the local data set of each of the initial cluster center points;
  • the client is divided into s similar clusters of local data sets (where s is preset according to the training task and the number of clients participating in the training). good value).
  • s clients are randomly selected from multiple clients as initial cluster center points, for example, s clients are selected from x clients as initial cluster center points.
  • the multiple clients mentioned here are all clients that need to participate in the training. These clients can be randomly selected by the server, or can be selected according to certain rules, such as the size of the data set and so on.
  • the server side can determine the initialization model scheme according to the training task and give the initialization parameters, and then according to the specified number of clients participating in this training (ie the aforementioned x clients).
  • the server will deliver the initialized original model to a randomly selected fixed number of clients, that is, deliver the original model to the aforementioned x clients, while in the embodiment of the present application, the client needs to be performed first. Classify and aggregate, select the center point, and send the original model to the selected center point.
  • the dataset centroid is calculated as follows:
  • these data points are all two-dimensional data, and their coordinates are (x1, y1), (x2, y2), ..., then these two data points can be calculated.
  • the average value of the dimensional data to obtain the centroid of the dataset, specifically The dataset centroid is actually the average of N data points.
  • the server performs clustering on each of the clients according to the distance between the data set centroid of the local data set of each client and the data set centroid of the initial clustering center point, and clustering The result is iteratively updated to determine the s final cluster center points and the clients under each cluster;
  • initial clustering needs to be performed according to the distance between the centroid of the dataset of each client's local dataset and the centroid of the dataset at the initial clustering center point, and then iteratively updated to determine s final clusters.
  • the class center point can also be used to determine the clients under each cluster, that is, the cluster to which each client belongs.
  • the step S102 includes:
  • the dataset centroid for each of the client's local datasets can be calculated in the same way as the dataset centroid of the initial cluster center point. Since the initial cluster center point is also a client, but its dataset centroid has been calculated in the previous steps, there is no need to repeat the calculation here.
  • each of the clients mentioned here refers to all clients except all the initial cluster center points , that is, there is no need to calculate the distance between the initial cluster center point itself and itself and the distance between different initial cluster center points.
  • the distance can be calculated according to the Euclidean distance. Taking two-dimensional data as an example, the calculation formula of the Euclidean distance from the client i to the initial cluster center point j is as follows:
  • the above formula takes two-dimensional data as an example, and can be expanded according to the above formula for multi-dimensional data.
  • each client can be clustered.
  • the specific method is to compare the distances between each client and all initial cluster center points, and then select the initial cluster center point corresponding to the smallest distance, and then divide the corresponding client into the initial cluster. Under the clustering of the center point, because the distance between the client and the center point of the initial cluster is the smallest, that is, the data is the most similar.
  • the centroid point of the corresponding cluster is calculated, and the centroid point represents the average position of the data of the corresponding cluster, and then the distance between the client under the cluster and the centroid point is calculated, so as to determine the new cluster center point, as Intermediate cluster center point.
  • the step S202 includes:
  • the selection method of the centroid point is: first calculate the centroid of the local data set of all clients under the corresponding cluster, and all the clients mentioned here also include the initial cluster center point, because in the preceding steps The data set centroid of each client has been calculated, and the data set centroid of each initial cluster center point has also been calculated, so the calculated value can be directly extracted, and then the data set centroid of all clients can be calculated. Average, the calculated average is taken as the centroid point of the corresponding cluster.
  • cluster A has three clients, A1, A2, and A3, and the centroids of the local datasets are A1(x1,y1), A2(x2,y2), A3(x3,y3), respectively, then calculate The centroid points of the corresponding clusters are
  • the distance between the client and the centroid that is, the distance between the client's data set centroid and the centroid.
  • the distance calculation method in this step may also be the Euclidean distance calculation method.
  • A1(x1, y1) has the smallest distance from the centroid point, which means that A1 is the closest to the centroid point, then A1 can be selected as the new center point of cluster A. , the new center point is used as the intermediate cluster center point.
  • S203 Calculate the distance between the data set centroid of the local data set of each of the clients and the data set centroid of each intermediate cluster center point, and according to the calculated distance Clustering is performed again, and the clustering results are iteratively updated to determine the s final cluster center points and the clients under each cluster.
  • the distance from each client to each intermediate cluster center point can be recalculated in the same manner as in the previous step. distance, and then re-cluster according to the size of the calculated distance.
  • the new center points of each cluster can be reselected according to the above steps, and used as the intermediate cluster center points, and so on, iteratively updated, so as to determine the s final cluster center points and clients under each cluster.
  • the stopping condition of the iteration may be that the number of iterations reaches a preset value, or that the center points of each intermediate cluster do not change, and the final intermediate cluster center point can be used as the final cluster center point.
  • the final cluster center point reads the initial model for credit risk assessment from the server side, and the initial model read is issued to the client of the corresponding clustering;
  • each final cluster center point reads the initial model from the server side, and then the final cluster center point sends the initial model to the client of the corresponding cluster, instead of all clients reading from the server side Take the initial model, which can facilitate the aggregation of the corresponding clustering results at the final cluster center point and improve the training effect.
  • the client trains the initial model based on a local data set to obtain initial model parameters, and sends the initial model parameters to the final cluster center point of the corresponding cluster;
  • the local data set includes basic information and credit information;
  • each final cluster center point can also participate in training, that is, each final cluster center point is also trained as a client, and the final cluster center point can obtain its own initial model parameters.
  • the local data set includes basic information and credit information.
  • the basic information may be name, age, gender, marital status, etc.
  • the basic information may be company name, Number of companies, time of establishment, annual turnover, annual net profit, etc.
  • the credit information may be accounts, loans, overdue records, and the like.
  • the local dataset may also include other auxiliary information, such as a good or bad reputation score that can be used to judge an individual or institution.
  • the final cluster center point aggregates the received initial model parameters of each of the clients under the corresponding cluster to obtain intermediate model parameters of the corresponding cluster;
  • each final cluster center point will aggregate the initial model parameters of all clients under the cluster (which may also include the final cluster center point itself) to obtain intermediate model parameters of the corresponding cluster.
  • the step S105 includes:
  • the final cluster center point collects the initial model parameters reported by all the clients under the corresponding cluster
  • the final cluster center point first needs to collect the initial model parameters reported by all clients under the cluster (which may also include the final cluster center point itself), and then use the FedAvg algorithm to aggregate, so as to obtain a representation that can represent the Intermediate model parameters for clustering.
  • the step S402 includes:
  • the collected initial model parameters are aggregated using the following formula:
  • n k is the number of samples used by the kth client for training
  • n is the total number of samples in the entire federated learning process
  • M is the number of clients under the corresponding cluster
  • w t+1 is the intermediate model parameter of the corresponding cluster.
  • each client will have a weight, and the weight is reflected in the ratio of n k to n, so that the number of samples of the client is reflected in the model change.
  • each of the final cluster center points uploads the intermediate model parameters of the corresponding cluster to the server;
  • the server side aggregates the intermediate model parameters of each cluster to obtain final model parameters, and uses the final model parameters to update the initial model to obtain a global model for credit risk assessment.
  • the server will aggregate the intermediate model parameters uploaded by all the final cluster center points, so as to obtain the final model parameters that can reflect the global, and use the final model parameters to update the initial model.
  • the step S107 includes:
  • the server collects all the intermediate model parameters reported by the final cluster center point
  • the server first needs to collect all the intermediate model parameters reported by the final cluster center points, and then also use the FedAvg algorithm to aggregate, so as to obtain the final model parameters that can represent the whole world.
  • the step S502 includes:
  • the collected intermediate model parameters are aggregated using the following formula:
  • n k is the total number of samples used for training by all clients under the cluster corresponding to the kth final cluster center point
  • n is the total number of samples in the entire federated learning process
  • K is the number of final cluster center points
  • w t+1 is the final model parameter.
  • each cluster will have a weight, which is reflected in n k (n k here is different from n k in the previous embodiment, where n k refers to the ratio of the total number of samples used by all clients for training under the cluster corresponding to the kth final cluster center point) to n (n here is the same as n in the previous embodiment), so that the corresponding cluster is All sample sizes of are reflected in the model variance.
  • the server uses the aggregated final model parameters to update the initial model, and sends the updated global model to the selected final cluster center point again, and repeats the training process until the model meets the set accuracy requirements, or reaches the preset Set the maximum number of iterations.
  • the application scenario of the embodiment of the present application may be: multiple banks need to jointly evaluate customer credit risk, and the purpose is to use the machine learning method to construct a global model for credit risk evaluation, so that after inputting the relevant characteristic information of individuals or institutions, automatic A credit risk assessment grade is given.
  • multiple banks will select an organizer (ie server, or server side), responsible for the establishment, initialization and subsequent parameter aggregation of the original model, and other participating banks are called clients.
  • the server For the training of the bank risk assessment model, the server initializes the initial model (that is, given the initial parameters of the model, such as the specific machine learning model selected, the model structure, the determination of hyperparameters such as the number of training rounds, etc.) and then Aggregation of parameters; other participants who contribute their own local data to participate in training are clients.
  • the initial model that is, given the initial parameters of the model, such as the specific machine learning model selected, the model structure, the determination of hyperparameters such as the number of training rounds, etc.
  • the embodiment of this application uses federated learning to carry out joint risk control to ensure data privacy among multiple subordinate institutions of a bank, effectively solving the "data island" that cannot share data for model training due to financial supervision, data privacy and other issues among the subordinate institutions of the bank The problem.
  • the embodiment of the present application optimizes the training process of federated learning and the aggregation method of model parameters, so that nodes with similar local data distribution are clustered together to perform joint training and iterative update, and finally the model results of different clusters are aggregated .
  • This design reduces the problem of accuracy drop caused by the local data heterogeneity of each participant (large category deviation, non-IID, etc.) to model training, and helps to improve the federated learning method to train the risk control model the quality of.
  • FIG. 6 is a schematic block diagram of a joint risk assessment system based on client classification and aggregation provided by an embodiment of the present application.
  • the joint risk assessment system 600 based on client classification and aggregation includes: a server 610 and a plurality of The client 620; wherein, the server 610 includes: an initial selection unit 601, a final selection unit 602 and a model update unit 607, the multiple clients 620 include s final cluster center points, each of the final The cluster center point includes a reading and issuing unit 603, an initial model parameter aggregation unit 605 and an uploading unit 606, and each of the clients includes: a model training unit 604;
  • the initial selection unit 601 is used to randomly select s from a plurality of clients as the initial cluster center points, and calculate the data set centroid of the local data set of each of the initial cluster center points;
  • the final selection unit 602 is configured to perform clustering on each of the clients according to the distance between the centroid of the data set of the local data set of each of the clients and the centroid of the data set of the initial clustering center point, and cluster the The class results are iteratively updated to determine the s final cluster center points and the clients under each cluster;
  • the reading and issuing unit 603 is used for reading the initial model for credit risk assessment from the server side, and issuing the read initial model to the client of the corresponding cluster;
  • a model training unit 604 configured to train the initial model based on a local data set to obtain initial model parameters, and send the initial model parameters to the final cluster center point of the corresponding cluster;
  • the local data set includes basic information and credit information;
  • the initial model parameter aggregation unit 605 is configured to aggregate the received initial model parameters of each of the clients under the corresponding cluster to obtain intermediate model parameters of the corresponding cluster;
  • An uploading unit 606, configured to upload the intermediate model parameters of the corresponding clusters to the server side;
  • the model updating unit 607 is configured to aggregate the intermediate model parameters of each cluster to obtain final model parameters, and use the final model parameters to update the initial model to obtain a global model for credit risk assessment.
  • the final selection unit 602 includes:
  • the dividing unit 701 is configured to calculate the distance between the centroid of the data set of the local data set of each of the clients and the centroid of the data set of each of the initial cluster center points, and according to the size of the calculated distance, divide the The clients are divided into clusters corresponding to the initial cluster center points;
  • the reselection unit 702 is used to calculate the centroid point of the corresponding cluster for each initial cluster center point, and calculate the distance between all clients under the corresponding cluster and the centroid point, and calculate the distance according to the calculated distance. size, reselect the cluster center point of the corresponding cluster as the intermediate cluster center point;
  • the iterative unit 703 is configured to calculate the distance between the centroid of the data set of the local data set of each of the clients and the centroid of the data set of each of the intermediate cluster center points, and according to the size of the calculated distance
  • the client performs clustering again, and iteratively updates the clustering result to determine s final cluster center points and clients under each cluster.
  • the reselection unit 702 includes:
  • the centroid point calculation unit 801 is configured to, for each initial cluster center point, calculate the centroid of the data set of the local data sets of all clients under the corresponding cluster, and obtain the centroid of the dataset of all clients under the corresponding cluster. The average value is obtained to obtain the centroid point of the corresponding cluster;
  • the intermediate cluster center point selection unit 802 is configured to calculate the distances between all clients under the corresponding cluster and the centroid point, and use the client corresponding to the smallest distance as the intermediate cluster center point.
  • the initial model parameter aggregation unit 605 includes:
  • the first collection unit 901 is used for the final cluster center point to collect the initial model parameters reported by all the clients under the corresponding cluster;
  • the first aggregation unit 902 is configured to use the FedAvg algorithm to aggregate the collected initial model parameters to obtain intermediate model parameters corresponding to the clusters.
  • the first aggregation unit 902 includes:
  • the first aggregation subunit is used to aggregate the collected initial model parameters using the following formula:
  • n k is the number of samples used by the kth client for training
  • n is the total number of samples in the entire federated learning process
  • M is the number of clients under the corresponding cluster
  • w t+1 is the intermediate model parameter of the corresponding cluster.
  • the model updating unit 607 includes:
  • a second collection unit 1001, configured to collect all the intermediate model parameters reported by the final cluster center point
  • the second aggregation unit 1002 is configured to use the FedAvg algorithm to aggregate the collected intermediate model parameters to obtain final model parameters;
  • the updating unit 1003 is configured to update the initial model by using the final model parameters to obtain a global model.
  • the second aggregation unit 1002 includes:
  • the second aggregation subunit is used to aggregate the collected intermediate model parameters using the following formula:
  • n k is the total number of samples used for training by all clients under the cluster corresponding to the kth final cluster center point
  • n is the total number of samples in the entire federated learning process
  • K is the number of final cluster center points
  • w t+1 is the final model parameter.
  • clients of the same type after clustering, clients of the same type have relatively similar local data, and can satisfy the assumption of independent and identical distribution more than before clustering. Therefore, federated learning is performed on clients of the same type. The training effect achieved is better, and the final model prediction accuracy is higher.
  • FIG. 11 is a schematic block diagram of a computer device provided by an embodiment of the present application.
  • the computer device 1100 is a server, and the server may be an independent server or a server cluster composed of multiple servers.
  • the computer device 1100 includes a processor 1102 , a memory and a network interface 1105 connected through a system bus 1101 , wherein the memory may include a non-volatile storage medium 1103 and an internal memory 1104 .
  • the nonvolatile storage medium 1103 can store an operating system 11031 and a computer program 11032 .
  • the computer program 11032 when executed, can cause the processor 1102 to perform a joint risk assessment method based on client classification aggregation.
  • the processor 1102 is used to provide computing and control capabilities to support the operation of the entire computer device 1100 .
  • the internal memory 1104 provides an environment for running the computer program 11032 in the non-volatile storage medium 1103.
  • the processor 1102 can execute the joint risk assessment method based on client classification and aggregation.
  • the network interface 1105 is used for network communication, such as providing transmission of data information.
  • the network interface 1105 is used for network communication, such as providing transmission of data information.
  • FIG. 11 is only a block diagram of a partial structure related to the solution of the present application, and does not constitute a limitation on the computer device 1100 to which the solution of the present application is applied.
  • the specific computer device 1100 may include more or fewer components than shown in the figures, or combine certain components, or have a different arrangement of components.
  • the processor 1102 is configured to run the computer program 11032 stored in the memory, so as to realize the following functions: randomly select s from a plurality of clients as initial cluster center points, and calculate each of the initial cluster centers The data set centroid of the local data set of the point; according to the distance between the data set centroid of the local data set of each of the clients and the data set centroid of the initial clustering center point, each of the clients is clustered, Iteratively update the clustering results to determine s final cluster center points and clients under each cluster; make the final cluster center points read the initial model for credit risk assessment from the server side, and read the The obtained initial model is sent to the client of the corresponding cluster; and the client is made to train the initial model based on the local data set to obtain initial model parameters, and the initial model parameters are sent to the final corresponding cluster.
  • Each of the final cluster center points uploads the intermediate model parameters of the corresponding cluster to the server; aggregates the intermediate model parameters of each cluster to obtain the final model parameters, and uses the final model parameters to The model is updated to obtain a global model for credit risk assessment.
  • the processor 1102 is configured to run the computer program 11032 stored in the memory, so as to implement the above-mentioned joint risk assessment method based on client classification and aggregation. included steps.
  • a computer-readable storage medium may be a non-volatile computer-readable storage medium or a volatile computer-readable storage medium.
  • the computer-readable storage medium stores a computer program, wherein when the computer program is executed by the processor, the following steps are implemented: randomly select s from a plurality of clients as initial cluster center points, and calculate each of the initial cluster center points The data set centroid of the local data set; according to the distance between the data set centroid of each client's local data set and the data set centroid of the initial clustering center point, each of the clients is clustered, and Iteratively update the clustering results to determine s final cluster center points and the client under each cluster; make the final cluster center point read the initial model for credit risk assessment from the server side, and read the The initial model is sent to the client of the corresponding cluster; and the client is made to train the initial model based on the local data set to obtain initial model parameters, and the initial model parameters are sent to the final cluster of the corresponding
  • the local data set includes basic information and credit information
  • the initial model parameters of each client under the received corresponding cluster are aggregated to obtain the intermediate model parameters of the corresponding cluster
  • the final cluster center point uploads the intermediate model parameters of the corresponding clusters to the server; aggregates the intermediate model parameters of each cluster to obtain final model parameters, and uses the final model parameters to compare the initial model parameters. Updates are made to obtain a global model for credit risk assessment.

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Accounting & Taxation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Finance (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Development Economics (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • Strategic Management (AREA)
  • Technology Law (AREA)
  • General Business, Economics & Management (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种基于客户端分类聚合的联合风险评估方法及相关设备,涉及人工智能技术领域,可应用于银行风险评估系统中,方法包括:服务器端对客户端进行聚类,确定s个最终聚类中心点;最终聚类中心点从服务器端读取初始模型,并下发至聚类的客户端;客户端基于本地数据集对所述初始模型进行训练得到初始模型参数,并将所述初始模型参数发送至最终聚类中心点;最终聚类中心点对各所述客户端的初始模型参数进行聚合,得到中间模型参数并上传;服务器端对各聚类的所述中间模型参数进行聚合得到最终模型参数,并进行更新,得到用于信用风险评估的全局模型。本方法在同一类的客户端中执行联邦学习的训练所达到的效果更好,最终模型预测准确率更高。

Description

基于客户端分类聚合的联合风险评估方法及相关设备
本申请要求于2020年11月24日提交中国专利局、申请号为202011327614.2,发明名称为“基于客户端分类聚合的联合风险评估方法及相关设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能技术领域和金融科技领域,特别涉及基于客户端分类聚合的联合风险评估方法及相关设备。
背景技术
银行的主要业务是借贷,为了降低贷款的信用风险,银行可以对借款人进行信用检查,而信用风险评估就是一种用于计算个人、企业或组织的信用度的方法。它可以帮助银行评估贷款申请人是否会违约以及违约的概率。在传统方法中,银行金融机构主要基于各种借贷数额和相关主观因素来评估信用风险等级,这种方法是反应性的而不是预测性的,模型效果也不理想。因此,需要进一步开发一些较为准确的定量预测模型,而机器学习模型,作为从以往海量数据中获得分析和见解的工具,被认为可以在风险评估中发挥很好的作用。
但是,诸如神经网络之类的机器学习模型需要一个训练中心使用大量的数据对模型进行训练,数据的种类特征越完善,训练的模型效果越好、越能进行准确的信誉评价。然而,一种常见的现象是,一个贷款风险极高的人,在A银行信誉极差无法通过贷款审核,但是在B银行的信誉表现不错,由于银行之间的数据不互通,此人在B银行获得贷款,造成了B银行的损失。因此,为了避免单个银行数据表现不完整带来的风险误判,同时联合使用多家银行的数据是必要的。可是,对于银行来说,出于金融监管和用户敏感数据隐私保护的考虑,当地分行的数据并不能离开本地,要求各家银行互相分享数据、将所有数据发送到一个中心服务器并对其应用机器学习算法训练是不可能也不现实的。这导致了一家家银行成为了一个个的“数据孤岛”。
联邦学习是一种全新的分布式机器学习框架,旨在满足多方用户隐私保护、数据安全和政府法规的要求的同时,进行数据协作和机器学习建模,解决“数据孤岛”的问题。
但是,联邦学习技术在应用的过程中面临着统计方面的挑战。机器学习与深度学习模型的迭代优化依赖于随机梯度下降算法(SGD),但是为了保证SGD算法中取到的随机梯度是完整梯度的无偏估计,训练数据应该满足独立同分布的假设。联邦学习现在广泛使用的聚合优化算法FedAvg也是基于数据独立同分布的假设,并在参数聚合时加以简单平均。已有研究证明,如果使用高度倾斜的非独立同分布数据进行训练,使用现有的迭代优化算法FedAvg,得到的神经网络模型准确率有显著的降低,降低的准确率甚至可达50%以上。
发明人意识到,由于在实践中,要求每个参与方的本地数据满足独立同分布的性质是不现实的,因此有必要对联邦学习的训练过程和参数聚合方法加以改进,尽可能减少各参与训练的客户端的数据异质性给训练的模型带来的影响。
申请内容
本申请的目的是提供基于客户端分类聚合的联合风险评估方法及相关设备,旨在解决现有应用联邦学习的风险评估方法预测准确率较低的问题。
第一方面,本申请实施例提供一种基于客户端分类聚合的联合风险评估方法,其中,包括:
服务器端从多个客户端中随机选择s个作为初始聚类中心点,计算每一个所述初始聚类中心点的本地数据集的数据集质心;
所述服务器端根据每一所述客户端的本地数据集的数据集质心与所述初始聚类中心点的数据集质心之间的距离对各个所述客户端进行聚类,并对聚类结果进行迭代更新确定s个最终聚类中心点以及各聚类下的客户端;
所述最终聚类中心点从服务器端读取用于信用风险评估的初始模型,并将读取的初始模型下发至对应聚类的客户端;
所述客户端基于本地数据集对所述初始模型进行训练得到初始模型参数,并将所述初始模型参数发送至对应聚类的最终聚类中心点;所述本地数据集包括基本信息和信用信息;
所述最终聚类中心点对接收到的对应聚类下的各所述客户端的初始模型参数进行聚合,得到对应聚类的中间模型参数;
各所述最终聚类中心点将所述对应聚类的中间模型参数上传至服务器端;
所述服务器端对各聚类的所述中间模型参数进行聚合,得到最终模型参数,并利用所述最终模型参数对初始模型进行更新,得到用于信用风险评估的全局模型。
第二方面,本申请实施例提供一种基于客户端分类聚合的联合风险评估系统,其中,包括服务器端和多个客户端;其中,所述服务器端包括:初始选取单元、最终选取单元和模型更新单元,所述多个客户端中包括s个最终聚类中心点,每一所述最终聚类中心点包括读取下发单元、初始模型参数聚合单元和上传单元,每一所述客户端包括:模型训练单元;
初始选取单元,用于从多个客户端中随机选择s个作为初始聚类中心点,计算每一个所述初始聚类中心点的本地数据集的数据集质心;
最终选取单元,用于根据每一所述客户端的本地数据集的数据集质心与所述初始聚类中心点的数据集质心之间的距离对各个所述客户端进行聚类,并对聚类结果进行迭代更新确定s个最终聚类中心点以及各聚类下的客户端;
读取下发单元,用于从服务器端读取用于信用风险评估的初始模型,并将读取的初始模型下发至对应聚类的客户端;
模型训练单元,用于基于本地数据集对所述初始模型进行训练得到初始模型参数,并将所述初始模型参数发送至对应聚类的最终聚类中心点;所述本地数据集包括基本信息和信用信息;
初始模型参数聚合单元,用于对接收到的对应聚类下的各所述客户端的初始模型参数进行聚合,得到对应聚类的中间模型参数;
上传单元,用于将所述对应聚类的中间模型参数上传至服务器端;
模型更新单元,用于对各聚类的所述中间模型参数进行聚合,得到最终模型参数,并利用所述最终模型参数对初始模型进行更新,得到用于信用风险评估的全局模型。
第三方面,本申请实施例提供一种计算机设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,其中,所述处理器执行所述计算机程序时实现如下步骤:
从多个客户端中随机选择s个作为初始聚类中心点,计算每一个所述初始聚类中心点的本地数据集的数据集质心;
根据每一所述客户端的本地数据集的数据集质心与所述初始聚类中心点的数据集质心之间的距离对各个所述客户端进行聚类,并对聚类结果进行迭代更新确定s个最终聚类中心点以及各聚类下的客户端;使所述最终聚类中心点从服务器端读取用于信用风险评估的初始模型,并将读取的初始模型下发至对应聚类的客户端;以及使所述客户端基于本地数据集对所述初始模型进行训练得到初始模型参数,并将所述初始模型参数发送至对应聚类的最终聚类中心点;所述本地数据集包括基本信息和信用信息;以及使对接收到的对应聚类下的各所述客户端的初始模型参数进行聚合,得到对应聚类的中间模型参数;以及使各所述最终聚类中心点将所述对应聚类的中间模型参数上传至服务器端;
对各聚类的所述中间模型参数进行聚合,得到最终模型参数,并利用所述最终模型参数对初始模型进行更新,得到用于信用风险评估的全局模型。
第四方面,本申请实施例提供一种计算机可读存储介质,其中,所述计算机可读存储介质存储有计算机程序,所述计算机程序当被处理器执行时使所述处理器执行如下步骤:
从多个客户端中随机选择s个作为初始聚类中心点,计算每一个所述初始聚类中心点的本地数据集的数据集质心;
根据每一所述客户端的本地数据集的数据集质心与所述初始聚类中心点的数据集质心之间的距离对各个所述客户端进行聚类,并对聚类结果进行迭代更新确定s个最终聚类中心点以及各聚类下的客户端;使所述最终聚类中心点从服务器端读取用于信用风险评估的初始模型,并将读取的初始模型下发至对应聚类的客户端;以及使所述客户端基于本地数据集对所述初始模型进行训练得到初始模型参数,并将所述初始模型参数发送至对应聚类的最终聚类中心点;所述本地数据集包括基本信息和信用信息;以及使对接收到的对应聚类下的各所述客户端的初始模型参数进行聚合,得到对应聚类的中间模型参数;以及使各所述最终聚类中心点将所述对应聚类的中间模型参数上传至服务器端;
对各聚类的所述中间模型参数进行聚合,得到最终模型参数,并利用所述最终模型参数对初始模型进行更新,得到用于信用风险评估的全局模型。
本申请实施例提供了基于客户端分类聚合的联合风险评估方法及相关设备,本申请实施例提供的方法,经过聚类后同一类的客户端拥有较为相似的本地数据,相较于聚类前更能满足独立同分布的假设,因此在同一类的客户端中执行联邦学习的训练所达到的效果更好,最终模型预测准确率更高。
附图说明
为了更清楚地说明本申请实施例技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本申请实施例提供的一种基于客户端分类聚合的联合风险评估方法的流程示意图;
图2为本申请实施例提供的一种基于客户端分类聚合的联合风险评估方法的子流程示意图;
图3为本申请实施例提供的一种基于客户端分类聚合的联合风险评估方法的另一子流程示意图;
图4为本申请实施例提供的一种基于客户端分类聚合的联合风险评估方法的另一子流程示意图;
图5为本申请实施例提供的一种基于客户端分类聚合的联合风险评估方法的另一子流程示意图;
图6为本申请实施例提供的一种基于客户端分类聚合的联合风险评估系统的示意性框图;
图7为本申请实施例提供的一种基于客户端分类聚合的联合风险评估系统的子单元示意性框图;
图8为本申请实施例提供的一种基于客户端分类聚合的联合风险评估系统的另一子单元示意性框图;
图9为本申请实施例提供的一种基于客户端分类聚合的联合风险评估系统的另一子单元示意性框图;
图10为本申请实施例提供的一种基于客户端分类聚合的联合风险评估系统的另一子单元示意性框图;
图11为本申请实施例提供的计算机设备的示意性框图。
具体实施方式
请参阅图1,图1为本申请实施例提供的一种基于客户端分类聚合的联合风险评估方法的流程示意图,其包括:
S101、服务器端从多个客户端中随机选择s个作为初始聚类中心点,计算每一个所述初始聚类中心点的本地数据集的数据集质心;
本申请实施例中,根据各客户端本地数据特征与数据分布的特点,将客户端划分为s个本地数据集类似的聚类(其中s是根据训练任务和参与训练的客户端数量预先设定好的值)。
具体的,首先从多个客户端中随机选择s个客户端作为初始聚类中心点,如从x个客户端中选择s个客户端作为初始聚类中心点。其中这里提到的多个客户端都是需要参与训练的客户端,这些客户端可以是服务器端随机选取的,也可以根据一定的规则进行选取,例如按照数据集的大小等等。服务器端可以根据训练任务确定初始化模型方案并给出初始化参数,之后根据规定好的参与本次训练客户端数量(即前述的x个客户端)。现有技术中,服务器 端会将初始化的原始模型下发给随机选择的固定数量的客户端,即将原始模型下发给前述的x个客户端,而本申请实施例则需要先对客户端进行分类聚合,选择中心点,将原始模型下发给所选择的中心点。
在选择了s个初始聚类中心点后,需计算各个初始聚类中心点的本地数据集的数据集质心。
数据集质心的计算方式如下:
假设某个初始聚类中心点的本地数据集拥有N个数据点,这些数据点都是二维数据,其坐标分别(x1,y1),(x2,y2),…,那么可计算出这些二维数据的平均值,从而得到数据集质心,具体为
Figure PCTCN2021096655-appb-000001
该数据集质心实际上就是N个数据点的平均值。
对于其他维数的数据,同样可以按照类似的方式,进行计算,得到数据集质心。
S102、所述服务器端根据每一所述客户端的本地数据集的数据集质心与所述初始聚类中心点的数据集质心之间的距离对各个所述客户端进行聚类,并对聚类结果进行迭代更新确定s个最终聚类中心点以及各聚类下的客户端;
本步骤中,需要根据每个客户端的本地数据集的数据集质心与初始聚类中心点的数据集质心之间的距离来进行初始聚类,然后不断的迭代更新,从而确定的s个最终聚类中心点,同时还可以确定各个聚类下的客户端,即每一客户端所属的聚类。
在一实施例中,如图2所示,所述步骤S102包括:
S201、计算每一所述客户端的本地数据集的数据集质心与每一所述初始聚类中心点的数据集质心的距离,并按照所计算出的距离的大小,将各所述客户端划分到对应初始聚类中心点的聚类下;
对于每一所述客户端的本地数据集的数据集质心可以按照初始聚类中心点的数据集质心相同的方式进行计算。由于初始聚类中心点也是一个客户端,但其数据集质心在前述步骤中已经计算过,所以此处可无需重复计算。
然后计算出每一所述客户端的数据集质心(本文提到的数据集质心就是本地数据集的数据集质心)与每一所述初始聚类中心点的数据集质心的距离,该距离也即每一所述客户端与每一所述初始聚类中心点的距离,需说明的是,此处提到的每一所述客户端是指除所有初始聚类中心点之外的所有客户端,即无需计算初始聚类中心点自身与自身的距离以及不同初始聚类中心点之间的距离。
距离的计算方式可以是按欧式距离的计算方式,以二维数据为例,客户端i到初始聚类中心点j的欧式距离的计算公式如下:
Figure PCTCN2021096655-appb-000002
上述公式以二维数据为例,对于多维数据按上述公式进行拓展即可。
根据计算出的距离的大小,即可对各个客户端进行聚类。具体方法是,针对每一所述客户端,比较其与所有初始聚类中心点的距离大小,然后选取最小的距离对应的初始聚类中心点,然后将对应的客户端划分到该初始聚类中心点的聚类下,因为该客户端与该初始聚类中心点的距离最小,即数据最为相似。
S202、针对每一个初始聚类中心点,计算对应聚类的质心点,并计算对应聚类下的所有客户端与所述质心点的距离,并按照所计算出的距离的大小,重新选取对应聚类的聚类中心点作为中间聚类中心点;
由于最终选取的s个初始聚类中心点,其为随机选取的,所以并不一定是真正的聚类中心点,所以此时需要对聚类中心点进行更新。具体是计算对应聚类的质心点,该质心点代表了对应聚类的数据的平均位置,然后计算出该聚类下的客户端与质心点的距离,从而确定新的聚类中心点,作为中间聚类中心点。
在一实施例中,如图3所示,所述步骤S202包括:
S301、针对每一个初始聚类中心点,计算对应聚类下的所有客户端的本地数据集的数据集质心,并对所述对应聚类下的所有客户端的数据集质心求平均值,得到对应聚类的质心点;
此步骤中,质心点的选取方式是:先计算出对应聚类下的所有客户端的本地数据集的数据集质心,这里提到的所有客户端也包括初始聚类中心点,由于在前述步骤中已经计算过每一所述客户端的数据集质心,同样也计算过每一所述初始聚类中心点的数据集质心,所以可以直接提取所计算出的值,然后对所有客户端的数据集质心求平均值,将计算出的平均值作为对应聚类的质心点。
例如,假如聚类A拥有A1、A2、A3三个客户端,其本地数据集的数据集质心分别为A1(x1,y1),A2(x2,y2),A3(x3,y3),则计算出的对应聚类的质心点为
Figure PCTCN2021096655-appb-000003
S302、计算对应聚类下的所有客户端与所述质心点的距离,将最小距离对应的客户端作为中间聚类中心点。
客户端与所述质心点的距离,也即客户端的数据集质心与质心点的距离。此步骤中的距离计算方式同样可以是欧式距离的计算方式。
以前述聚类A为例,假设最后计算出的距离中,A1(x1,y1)与该质心点距离最小,代表A1与该质心点最近,那么可以选取A1成为聚类A的新的中心点,该新的中心点作为中间聚类中心点。
S203、计算每一所述客户端的本地数据集的数据集质心与每一所述中间聚类中心点的数据集质心之间的距离,并按照计算出的距离的大小,对各个所述客户端重新进行聚类,并对聚类结果进行迭代更新确定s个最终聚类中心点以及各聚类下的客户端。
此步骤中,在重新确定了聚类中心点(即中间聚类中心点)后,即可按照前述步骤相同的方式,重新计算每一所述客户端到每一所述中间聚类中心点的距离,然后按照计算出的距离的大小,重新进行聚类。
对于聚类后的结果,又重新可以按照前述步骤重新选取各个聚类新的中心点,并作为中间聚类中心点,以此类推,不断的迭代更新,从而确定出s个最终聚类中心点和各个聚类下的客户端。
迭代的停止条件可以是迭代次数达到预设值,也可以是各个中间聚类中心点不再变化,即可将最终的中间聚类中心点作为最终聚类中心点。
S103、所述最终聚类中心点从服务器端读取用于信用风险评估的初始模型,并将读取的 初始模型下发至对应聚类的客户端;
本步骤中,是由各个最终聚类中心点从服务器端读取初始模型,然后由最终聚类中心点下发初始模型给对应聚类的客户端,而不再是所有客户端从服务器端读取初始模型,这样可以方便最终聚类中心点对对应聚类的结果进行聚合,提升训练效果。
S104、所述客户端基于本地数据集对所述初始模型进行训练得到初始模型参数,并将所述初始模型参数发送至对应聚类的最终聚类中心点;所述本地数据集包括基本信息和信用信息;
各聚类下的客户端在接收到初始模型后,便基于自身的本地数据集对初始模型进行训练,从而得到对应的初始模型参数,然后将初始模型参数上报给对应聚类的最终聚类中心点,以使最终聚类中心点对同类型的客户端的数据进行聚合。本实施例中,各最终聚类中心点同样可以参与训练,即各最终聚类中心点作为客户端也进行训练,最终聚类中心点可以获取自身的初始模型参数。
本申请实施例中,所述本地数据集包括基本信息和信用信息,针对个人,所述基本信息可以是姓名、年龄、性别、婚姻状况等等,针对机构,所述基本信息可以是企业名称、企业人数、成立时间、年营业额、年净利润等等。所述信用信息可以是账户、贷款、逾期记录等等。所述本地数据集还可以包括其他一些辅助信息,例如可用于对个人或机构进行判断的信誉良好或信誉不佳的评分。
S105、所述最终聚类中心点对接收到的对应聚类下的各所述客户端的初始模型参数进行聚合,得到对应聚类的中间模型参数;
本步骤中,每一个最终聚类中心点将会对其聚类下的所有客户端(同样可以包括最终聚类中心点自身)的初始模型参数进行聚合,从而得到对应聚类的中间模型参数。
在一实施例中,如图4所示,所述步骤S105包括:
S401、所述最终聚类中心点收集对应聚类下的所有所述客户端上报的初始模型参数;
S402、采用FedAvg算法对收集的所述初始模型参数进行聚合,得到对应聚类的中间模型参数。
本实施例中,最终聚类中心点首先需要收集其聚类下的所有客户端(同样可以包括最终聚类中心点自身)上报的初始模型参数,然后采用FedAvg算法进行聚合,从而得到能够代表该聚类的中间模型参数。
在一实施例中,所述步骤S402包括:
采用如下公式对收集的所述初始模型参数进行聚合:
Figure PCTCN2021096655-appb-000004
其中,n k为第k个客户端用于训练的样本数量,n为整个联邦学习过程中的样本总数,M为对应聚类下的客户端数量,
Figure PCTCN2021096655-appb-000005
为第k个客户端本次训练得到的初始模型参数,w t+1为所述对应聚类的中间模型参数。
本实施例就是将所有初始模型参数进行聚合,在聚合过程中,各个客户端会有一个权重,该权重体现在n k与n的比值,从而将客户端的样本数量体现在模型变化量中。
S106、各所述最终聚类中心点将所述对应聚类的中间模型参数上传至服务器端;
本步骤中,每一个最终聚类中心点对其聚类下的所有客户端的初始模型参数进行聚合后,将聚合得到的中间模型参数上传到服务器端。
S107、所述服务器端对各聚类的所述中间模型参数进行聚合,得到最终模型参数,并利用所述最终模型参数对初始模型进行更新,得到用于信用风险评估的全局模型。
本步骤中,服务器端将会对所有最终聚类中心点上传的中间模型参数进行聚合,从而得到能够反应全局的最终模型参数,并利用该最终模型参数对初始模型进行更新。
在一实施例中,如图5所示,所述步骤S107包括:
S501、所述服务器端收集所有所述最终聚类中心点上报的所述中间模型参数;
S502、采用FedAvg算法对收集的所述中间模型参数进行聚合,得到最终模型参数;
S503、采用所述最终模型参数对所述初始模型进行更新,得到全局模型。
本实施例中,服务器端首先需要收集所有最终聚类中心点上报的中间模型参数,然后同样采用FedAvg算法进行聚合,从而得到能够代表全局的最终模型参数。
在一实施例中,所述步骤S502包括:
采用如下公式对收集的所述中间模型参数进行聚合:
Figure PCTCN2021096655-appb-000006
其中,n k为第k个最终聚类中心点对应聚类下所有客户端用于训练的总样本数量,n为整个联邦学习过程中的样本总数,K为最终聚类中心点的数量,
Figure PCTCN2021096655-appb-000007
为第k个最终聚类中心点本次训练得到的中间模型参数,w t+1为最终模型参数。
本实施例就是将所有中间模型参数进行聚合,在聚合过程中,各个聚类会有一个权重,该权重体现在n k(此处的n k与前述实施例的n k不同,此处的n k是指第k个最终聚类中心点对应聚类下所有客户端用于训练的总样本数量)与n(此处的n与前述实施例的n相同)的比值,从而将对应聚类下的所有样本数量体现在模型变化量中。
服务器端使用聚合后的最终模型参数更新初始模型,并将更新后的全局模型再次下发给选择的最终聚类中心点,重复进行训练的过程,直到模型达到设定的精度要求,或达到预先设置的最大迭代次数为止。
本申请实施例的应用场景可以是:多家银行需要联合进行客户信用风险的评估,其目的是利用机器学习的方法构建出一个全局模型进行信用风险评估,达到输入个人或机构相关特征信息之后自动给出信用风险评估等级。其中,多家银行将选定一个组织者(即服务器,或称服务器端),负责原始模型的建立、初始化和之后参数的聚合,其他参与的银行被称为客户端。
对于银行风险评估模型的训练,服务器对初始模型进行初始化(即给定模型初始参数,如,选定的是具体的机器学习模型,模型结构,训练轮数等超参数的确定等等)和之后参数的聚合;其余贡献自己本地数据参与训练的参与方为客户端。
本申请实施例使用联邦学习进行银行多下属机构间保障数据隐私的联合风险控制,有效解决了银行各下属机构由于金融监管、数据隐私等问题考虑导致的无法分享数据进行模型训练的“数据孤岛”的问题。同时本申请实施例优化了联邦学习的训练过程和模型参数的聚合方法,使得本地数据分布较为相似的节点聚类在一起执行联合的训练与迭代更新,最后再将 不同聚类的模型结果进行聚合,这样的设计减少了由于各参与方的本地数据异质性(类别偏差大、非独立同分布等)给模型训练带来的精度下降的问题,有助于提升联邦学习方式训练出风控模型的质量。
请参阅图6,其为本申请实施例提供的一种基于客户端分类聚合的联合风险评估系统的示意性框图,该基于客户端分类聚合的联合风险评估系统600包括:服务器端610和多个客户端620;其中,所述服务器端610包括:初始选取单元601、最终选取单元602和模型更新单元607,所述多个客户端620中包括s个最终聚类中心点,每一所述最终聚类中心点包括读取下发单元603、初始模型参数聚合单元605和上传单元606,每一所述客户端包括:模型训练单元604;
初始选取单元601,用于从多个客户端中随机选择s个作为初始聚类中心点,计算每一个所述初始聚类中心点的本地数据集的数据集质心;
最终选取单元602,用于根据每一所述客户端的本地数据集的数据集质心与所述初始聚类中心点的数据集质心之间的距离对各个所述客户端进行聚类,并对聚类结果进行迭代更新确定s个最终聚类中心点以及各聚类下的客户端;
读取下发单元603,用于从服务器端读取用于信用风险评估的初始模型,并将读取的初始模型下发至对应聚类的客户端;
模型训练单元604,用于基于本地数据集对所述初始模型进行训练得到初始模型参数,并将所述初始模型参数发送至对应聚类的最终聚类中心点;所述本地数据集包括基本信息和信用信息;
初始模型参数聚合单元605,用于对接收到的对应聚类下的各所述客户端的初始模型参数进行聚合,得到对应聚类的中间模型参数;
上传单元606,用于将所述对应聚类的中间模型参数上传至服务器端;
模型更新单元607,用于对各聚类的所述中间模型参数进行聚合,得到最终模型参数,并利用所述最终模型参数对初始模型进行更新,得到用于信用风险评估的全局模型。
在一实施例中,如图7所示,所述最终选取单元602包括:
划分单元701,用于计算每一所述客户端的本地数据集的数据集质心与每一所述初始聚类中心点的数据集质心的距离,并按照所计算出的距离的大小,将各所述客户端划分到对应初始聚类中心点的聚类下;
重选单元702,用于针对每一个初始聚类中心点,计算对应聚类的质心点,并计算对应聚类下的所有客户端与所述质心点的距离,并按照所计算出的距离的大小,重新选取对应聚类的聚类中心点作为中间聚类中心点;
迭代单元703,用于计算每一所述客户端的本地数据集的数据集质心与每一所述中间聚类中心点的数据集质心之间的距离,并按照计算出的距离的大小,对各个所述客户端重新进行聚类,并对聚类结果进行迭代更新确定s个最终聚类中心点以及各聚类下的客户端。
在一实施例中,如图8所示,所述重选单元702包括:
质心点计算单元801,用于针对每一个初始聚类中心点,计算对应聚类下的所有客户端 的本地数据集的数据集质心,并对所述对应聚类下的所有客户端的数据集质心求平均值,得到对应聚类的质心点;
中间聚类中心点选取单元802,用于计算对应聚类下的所有客户端与所述质心点的距离,将最小距离对应的客户端作为中间聚类中心点。
在一实施例中,如图9所示,所述初始模型参数聚合单元605包括:
第一收集单元901,用于所述最终聚类中心点收集对应聚类下的所有所述客户端上报的初始模型参数;
第一聚合单元902,用于采用FedAvg算法对收集的所述初始模型参数进行聚合,得到对应聚类的中间模型参数。
在一实施例中,所述第一聚合单元902包括:
第一聚合子单元,用于采用如下公式对收集的所述初始模型参数进行聚合:
Figure PCTCN2021096655-appb-000008
其中,n k为第k个客户端用于训练的样本数量,n为整个联邦学习过程中的样本总数,M为对应聚类下的客户端数量,
Figure PCTCN2021096655-appb-000009
为第k个客户端本次训练得到的初始模型参数,w t+1为所述对应聚类的中间模型参数。
在一实施例中,如图10所示,所述模型更新单元607包括:
第二收集单元1001,用于收集所有所述最终聚类中心点上报的所述中间模型参数;
第二聚合单元1002,用于采用FedAvg算法对收集的所述中间模型参数进行聚合,得到最终模型参数;
更新单元1003,用于采用所述最终模型参数对所述初始模型进行更新,得到全局模型。
在一实施例中,所述第二聚合单元1002包括:
第二聚合子单元,用于采用如下公式对收集的所述中间模型参数进行聚合:
Figure PCTCN2021096655-appb-000010
其中,n k为第k个最终聚类中心点对应聚类下所有客户端用于训练的总样本数量,n为整个联邦学习过程中的样本总数,K为最终聚类中心点的数量,
Figure PCTCN2021096655-appb-000011
为第k个最终聚类中心点本次训练得到的中间模型参数,w t+1为最终模型参数。
本申请实施例提供的系统,经过聚类后同一类的客户端拥有较为相似的本地数据,相较于聚类前更能满足独立同分布的假设,因此在同一类的客户端中执行联邦学习的训练所达到的效果更好,最终模型预测准确率更高。
请参阅图11,图11是本申请实施例提供的计算机设备的示意性框图。该计算机设备1100是服务器,服务器可以是独立的服务器,也可以是多个服务器组成的服务器集群。
参阅图11,该计算机设备1100包括通过系统总线1101连接的处理器1102、存储器和网络接口1105,其中,存储器可以包括非易失性存储介质1103和内存储器1104。
该非易失性存储介质1103可存储操作系统11031和计算机程序11032。该计算机程序11032被执行时,可使得处理器1102执行基于客户端分类聚合的联合风险评估方法。
该处理器1102用于提供计算和控制能力,支撑整个计算机设备1100的运行。
该内存储器1104为非易失性存储介质1103中的计算机程序11032的运行提供环境,该计算机程序11032被处理器1102执行时,可使得处理器1102执行基于客户端分类聚合的联合风险评估方法。
该网络接口1105用于进行网络通信,如提供数据信息的传输等。本领域技术人员可以理解,图11中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备1100的限定,具体的计算机设备1100可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。
其中,所述处理器1102用于运行存储在存储器中的计算机程序11032,以实现如下功能:从多个客户端中随机选择s个作为初始聚类中心点,计算每一个所述初始聚类中心点的本地数据集的数据集质心;根据每一所述客户端的本地数据集的数据集质心与所述初始聚类中心点的数据集质心之间的距离对各个所述客户端进行聚类,并对聚类结果进行迭代更新确定s个最终聚类中心点以及各聚类下的客户端;使所述最终聚类中心点从服务器端读取用于信用风险评估的初始模型,并将读取的初始模型下发至对应聚类的客户端;以及使所述客户端基于本地数据集对所述初始模型进行训练得到初始模型参数,并将所述初始模型参数发送至对应聚类的最终聚类中心点;所述本地数据集包括基本信息和信用信息;以及使对接收到的对应聚类下的各所述客户端的初始模型参数进行聚合,得到对应聚类的中间模型参数;以及使各所述最终聚类中心点将所述对应聚类的中间模型参数上传至服务器端;对各聚类的所述中间模型参数进行聚合,得到最终模型参数,并利用所述最终模型参数对初始模型进行更新,得到用于信用风险评估的全局模型。
在本申请的实施例中提供的计算机设备1100中,其中所述处理器1102用于运行存储在存储器中的计算机程序11032,以实现上述基于客户端分类聚合的联合风险评估方法每一实施例所包含的步骤。
在本申请的另一实施例中提供计算机可读存储介质。该计算机可读存储介质可以为非易失性的计算机可读存储介质,也可以为易失性的计算机可读存储介质。该计算机可读存储介质存储有计算机程序,其中计算机程序被处理器执行时实现以下步骤:从多个客户端中随机选择s个作为初始聚类中心点,计算每一个所述初始聚类中心点的本地数据集的数据集质心;根据每一所述客户端的本地数据集的数据集质心与所述初始聚类中心点的数据集质心之间的距离对各个所述客户端进行聚类,并对聚类结果进行迭代更新确定s个最终聚类中心点以及各聚类下的客户端;使所述最终聚类中心点从服务器端读取用于信用风险评估的初始模型,并将读取的初始模型下发至对应聚类的客户端;以及使所述客户端基于本地数据集对所述初始模型进行训练得到初始模型参数,并将所述初始模型参数发送至对应聚类的最终聚类中心点;所述本地数据集包括基本信息和信用信息;以及使对接收到的对应聚类下的各所述客户端的初始模型参数进行聚合,得到对应聚类的中间模型参数;以及使各所述最终聚类中心点将所述对应聚类的中间模型参数上传至服务器端;对各聚类的所述中间模型参数进行聚合,得到最终模型参数,并利用所述最终模型参数对初始模型进行更新,得到用于信用风险评估的全局模型。
在本申请的实施例中提供的计算机可读存储介质中,其中计算机程序被处理器执行时实现上述基于客户端分类聚合的联合风险评估方法每一实施例所包含的步骤。

Claims (20)

  1. 一种基于客户端分类聚合的联合风险评估方法,其中,包括:
    服务器端从多个客户端中随机选择s个作为初始聚类中心点,计算每一个所述初始聚类中心点的本地数据集的数据集质心;
    所述服务器端根据每一所述客户端的本地数据集的数据集质心与所述初始聚类中心点的数据集质心之间的距离对各个所述客户端进行聚类,并对聚类结果进行迭代更新确定s个最终聚类中心点以及各聚类下的客户端;
    所述最终聚类中心点从服务器端读取用于信用风险评估的初始模型,并将读取的初始模型下发至对应聚类的客户端;
    所述客户端基于本地数据集对所述初始模型进行训练得到初始模型参数,并将所述初始模型参数发送至对应聚类的最终聚类中心点;所述本地数据集包括基本信息和信用信息;
    所述最终聚类中心点对接收到的对应聚类下的各所述客户端的初始模型参数进行聚合,得到对应聚类的中间模型参数;
    各所述最终聚类中心点将所述对应聚类的中间模型参数上传至服务器端;
    所述服务器端对各聚类的所述中间模型参数进行聚合,得到最终模型参数,并利用所述最终模型参数对初始模型进行更新,得到用于信用风险评估的全局模型。
  2. 根据权利要求1所述的基于客户端分类聚合的联合风险评估方法,其中,所述服务器端根据每一所述客户端的本地数据集的数据集质心与所述初始聚类中心点的数据集质心之间的距离对各个所述客户端进行聚类,并对聚类结果进行迭代更新确定s个最终聚类中心点以及各聚类下的客户端,包括:
    计算每一所述客户端的本地数据集的数据集质心与每一所述初始聚类中心点的数据集质心的距离,并按照所计算出的距离的大小,将各所述客户端划分到对应初始聚类中心点的聚类下;
    针对每一个初始聚类中心点,计算对应聚类的质心点,并计算对应聚类下的所有客户端与所述质心点的距离,并按照所计算出的距离的大小,重新选取对应聚类的聚类中心点作为中间聚类中心点;
    计算每一所述客户端的本地数据集的数据集质心与每一所述中间聚类中心点的数据集质心之间的距离,并按照计算出的距离的大小,对各个所述客户端重新进行聚类,并对聚类结果进行迭代更新确定s个最终聚类中心点以及各聚类下的客户端。
  3. 根据权利要求2所述的基于客户端分类聚合的联合风险评估方法,其中,所述针对每一个初始聚类中心点,计算对应聚类的质心点,并计算对应聚类下的所有客户端与所述质心点的距离,并按照所计算出的距离的大小,重新选取对应聚类的聚类中心点作为中间聚类中心点,包括:
    针对每一个初始聚类中心点,计算对应聚类下的所有客户端的本地数据集的数据集质心,并对所述对应聚类下的所有客户端的数据集质心求平均值,得到对应聚类的质心点;
    计算对应聚类下的所有客户端与所述质心点的距离,将最小距离对应的客户端作为中间聚类中心点。
  4. 根据权利要求1所述的基于客户端分类聚合的联合风险评估方法,其中,所述最终聚类中心点对接收到的对应聚类下的各所述客户端的初始模型参数进行聚合,得到对应聚类的中间模型参数,包括:
    所述最终聚类中心点收集对应聚类下的所有所述客户端上报的初始模型参数;
    采用FedAvg算法对收集的所述初始模型参数进行聚合,得到对应聚类的中间模型参数。
  5. 根据权利要求4所述的基于客户端分类聚合的联合风险评估方法,其中,所述采用FedAvg算法对收集的所述初始模型参数进行聚合,得到对应聚类的中间模型参数,包括:
    采用如下公式对收集的所述初始模型参数进行聚合:
    Figure PCTCN2021096655-appb-100001
    其中,n k为第k个客户端用于训练的样本数量,n为整个联邦学习过程中的样本总数,M为对应聚类下的客户端数量,
    Figure PCTCN2021096655-appb-100002
    为第k个客户端本次训练得到的初始模型参数,w t+1为所述对应聚类的中间模型参数。
  6. 根据权利要求1所述的基于客户端分类聚合的联合风险评估方法,其中,所述服务器端对各聚类的所述中间模型参数进行聚合,得到最终模型参数,并利用所述最终模型参数对初始模型进行更新,得到用于信用风险评估的全局模型,包括:
    所述服务器端收集所有所述最终聚类中心点上报的所述中间模型参数;
    采用FedAvg算法对收集的所述中间模型参数进行聚合,得到最终模型参数;
    采用所述最终模型参数对所述初始模型进行更新,得到全局模型。
  7. 根据权利要求6所述的基于客户端分类聚合的联合风险评估方法,其中,所述采用FedAvg算法对收集的所述中间模型参数进行聚合,得到最终模型参数,包括:
    采用如下公式对收集的所述中间模型参数进行聚合:
    Figure PCTCN2021096655-appb-100003
    其中,n k为第k个最终聚类中心点对应聚类下所有客户端用于训练的总样本数量,n为整个联邦学习过程中的样本总数,K为最终聚类中心点的数量,
    Figure PCTCN2021096655-appb-100004
    为第k个最终聚类中心点本次训练得到的中间模型参数,w t+1为最终模型参数。
  8. 一种基于客户端分类聚合的联合风险评估系统,其中,包括服务器端和多个客户端;其中,所述服务器端包括:初始选取单元、最终选取单元和模型更新单元,所述多个客户端中包括s个最终聚类中心点,每一所述最终聚类中心点包括读取下发单元、初始模型参数聚合单元和上传单元,每一所述客户端包括:模型训练单元;
    初始选取单元,用于从多个客户端中随机选择s个作为初始聚类中心点,计算每一个所述初始聚类中心点的本地数据集的数据集质心;
    最终选取单元,用于根据每一所述客户端的本地数据集的数据集质心与所述初始聚类中心点的数据集质心之间的距离对各个所述客户端进行聚类,并对聚类结果进行迭代更新确定s个最终聚类中心点以及各聚类下的客户端;
    读取下发单元,用于从服务器端读取用于信用风险评估的初始模型,并将读取的初始模型下发至对应聚类的客户端;
    模型训练单元,用于基于本地数据集对所述初始模型进行训练得到初始模型参数,并将 所述初始模型参数发送至对应聚类的最终聚类中心点;所述本地数据集包括基本信息和信用信息;
    初始模型参数聚合单元,用于对接收到的对应聚类下的各所述客户端的初始模型参数进行聚合,得到对应聚类的中间模型参数;
    上传单元,用于将所述对应聚类的中间模型参数上传至服务器端;
    模型更新单元,用于对各聚类的所述中间模型参数进行聚合,得到最终模型参数,并利用所述最终模型参数对初始模型进行更新,得到用于信用风险评估的全局模型。
  9. 一种计算机设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,其中,所述处理器执行所述计算机程序时实现如下步骤:
    从多个客户端中随机选择s个作为初始聚类中心点,计算每一个所述初始聚类中心点的本地数据集的数据集质心;
    根据每一所述客户端的本地数据集的数据集质心与所述初始聚类中心点的数据集质心之间的距离对各个所述客户端进行聚类,并对聚类结果进行迭代更新确定s个最终聚类中心点以及各聚类下的客户端;使所述最终聚类中心点从服务器端读取用于信用风险评估的初始模型,并将读取的初始模型下发至对应聚类的客户端;以及使所述客户端基于本地数据集对所述初始模型进行训练得到初始模型参数,并将所述初始模型参数发送至对应聚类的最终聚类中心点;所述本地数据集包括基本信息和信用信息;以及使对接收到的对应聚类下的各所述客户端的初始模型参数进行聚合,得到对应聚类的中间模型参数;以及使各所述最终聚类中心点将所述对应聚类的中间模型参数上传至服务器端;
    对各聚类的所述中间模型参数进行聚合,得到最终模型参数,并利用所述最终模型参数对初始模型进行更新,得到用于信用风险评估的全局模型。
  10. 根据权利要求9所述的计算机设备,其中,所述服务器端根据每一所述客户端的本地数据集的数据集质心与所述初始聚类中心点的数据集质心之间的距离对各个所述客户端进行聚类,并对聚类结果进行迭代更新确定s个最终聚类中心点以及各聚类下的客户端,包括:
    计算每一所述客户端的本地数据集的数据集质心与每一所述初始聚类中心点的数据集质心的距离,并按照所计算出的距离的大小,将各所述客户端划分到对应初始聚类中心点的聚类下;
    针对每一个初始聚类中心点,计算对应聚类的质心点,并计算对应聚类下的所有客户端与所述质心点的距离,并按照所计算出的距离的大小,重新选取对应聚类的聚类中心点作为中间聚类中心点;
    计算每一所述客户端的本地数据集的数据集质心与每一所述中间聚类中心点的数据集质心之间的距离,并按照计算出的距离的大小,对各个所述客户端重新进行聚类,并对聚类结果进行迭代更新确定s个最终聚类中心点以及各聚类下的客户端。
  11. 根据权利要求10所述的计算机设备,其中,所述针对每一个初始聚类中心点,计算对应聚类的质心点,并计算对应聚类下的所有客户端与所述质心点的距离,并按照所计算出的距离的大小,重新选取对应聚类的聚类中心点作为中间聚类中心点,包括:
    针对每一个初始聚类中心点,计算对应聚类下的所有客户端的本地数据集的数据集质心,并对所述对应聚类下的所有客户端的数据集质心求平均值,得到对应聚类的质心点;
    计算对应聚类下的所有客户端与所述质心点的距离,将最小距离对应的客户端作为中间聚类中心点。
  12. 根据权利要求9所述的计算机设备,其中,所述最终聚类中心点对接收到的对应聚类下的各所述客户端的初始模型参数进行聚合,得到对应聚类的中间模型参数,包括:
    所述最终聚类中心点收集对应聚类下的所有所述客户端上报的初始模型参数;
    采用FedAvg算法对收集的所述初始模型参数进行聚合,得到对应聚类的中间模型参数。
  13. 根据权利要求12所述的计算机设备,其中,所述采用FedAvg算法对收集的所述初始模型参数进行聚合,得到对应聚类的中间模型参数,包括:
    采用如下公式对收集的所述初始模型参数进行聚合:
    Figure PCTCN2021096655-appb-100005
    其中,n k为第k个客户端用于训练的样本数量,n为整个联邦学习过程中的样本总数,M为对应聚类下的客户端数量,
    Figure PCTCN2021096655-appb-100006
    为第k个客户端本次训练得到的初始模型参数,w t+1为所述对应聚类的中间模型参数。
  14. 根据权利要求9所述的计算机设备,其中,所述服务器端对各聚类的所述中间模型参数进行聚合,得到最终模型参数,并利用所述最终模型参数对初始模型进行更新,得到用于信用风险评估的全局模型,包括:
    所述服务器端收集所有所述最终聚类中心点上报的所述中间模型参数;
    采用FedAvg算法对收集的所述中间模型参数进行聚合,得到最终模型参数;
    采用所述最终模型参数对所述初始模型进行更新,得到全局模型。
  15. 一种计算机可读存储介质,其中,所述计算机可读存储介质存储有计算机程序,所述计算机程序当被处理器执行时使所述处理器执行如下步骤:
    从多个客户端中随机选择s个作为初始聚类中心点,计算每一个所述初始聚类中心点的本地数据集的数据集质心;
    根据每一所述客户端的本地数据集的数据集质心与所述初始聚类中心点的数据集质心之间的距离对各个所述客户端进行聚类,并对聚类结果进行迭代更新确定s个最终聚类中心点以及各聚类下的客户端;使所述最终聚类中心点从服务器端读取用于信用风险评估的初始模型,并将读取的初始模型下发至对应聚类的客户端;以及使所述客户端基于本地数据集对所述初始模型进行训练得到初始模型参数,并将所述初始模型参数发送至对应聚类的最终聚类中心点;所述本地数据集包括基本信息和信用信息;以及使对接收到的对应聚类下的各所述客户端的初始模型参数进行聚合,得到对应聚类的中间模型参数;以及使各所述最终聚类中心点将所述对应聚类的中间模型参数上传至服务器端;
    对各聚类的所述中间模型参数进行聚合,得到最终模型参数,并利用所述最终模型参数对初始模型进行更新,得到用于信用风险评估的全局模型。
  16. 根据权利要求15所述的计算机可读存储介质,其中,所述服务器端根据每一所述客户端的本地数据集的数据集质心与所述初始聚类中心点的数据集质心之间的距离对各个所述 客户端进行聚类,并对聚类结果进行迭代更新确定s个最终聚类中心点以及各聚类下的客户端,包括:
    计算每一所述客户端的本地数据集的数据集质心与每一所述初始聚类中心点的数据集质心的距离,并按照所计算出的距离的大小,将各所述客户端划分到对应初始聚类中心点的聚类下;
    针对每一个初始聚类中心点,计算对应聚类的质心点,并计算对应聚类下的所有客户端与所述质心点的距离,并按照所计算出的距离的大小,重新选取对应聚类的聚类中心点作为中间聚类中心点;
    计算每一所述客户端的本地数据集的数据集质心与每一所述中间聚类中心点的数据集质心之间的距离,并按照计算出的距离的大小,对各个所述客户端重新进行聚类,并对聚类结果进行迭代更新确定s个最终聚类中心点以及各聚类下的客户端。
  17. 根据权利要求16所述的计算机可读存储介质,其中,所述针对每一个初始聚类中心点,计算对应聚类的质心点,并计算对应聚类下的所有客户端与所述质心点的距离,并按照所计算出的距离的大小,重新选取对应聚类的聚类中心点作为中间聚类中心点,包括:
    针对每一个初始聚类中心点,计算对应聚类下的所有客户端的本地数据集的数据集质心,并对所述对应聚类下的所有客户端的数据集质心求平均值,得到对应聚类的质心点;
    计算对应聚类下的所有客户端与所述质心点的距离,将最小距离对应的客户端作为中间聚类中心点。
  18. 根据权利要求15所述的计算机可读存储介质,其中,所述最终聚类中心点对接收到的对应聚类下的各所述客户端的初始模型参数进行聚合,得到对应聚类的中间模型参数,包括:
    所述最终聚类中心点收集对应聚类下的所有所述客户端上报的初始模型参数;
    采用FedAvg算法对收集的所述初始模型参数进行聚合,得到对应聚类的中间模型参数。
  19. 根据权利要求18所述的计算机可读存储介质,其中,所述采用FedAvg算法对收集的所述初始模型参数进行聚合,得到对应聚类的中间模型参数,包括:
    采用如下公式对收集的所述初始模型参数进行聚合:
    Figure PCTCN2021096655-appb-100007
    其中,n k为第k个客户端用于训练的样本数量,n为整个联邦学习过程中的样本总数,M为对应聚类下的客户端数量,
    Figure PCTCN2021096655-appb-100008
    为第k个客户端本次训练得到的初始模型参数,w t+1为所述对应聚类的中间模型参数。
  20. 根据权利要求15所述的计算机可读存储介质,其中,所述服务器端对各聚类的所述中间模型参数进行聚合,得到最终模型参数,并利用所述最终模型参数对初始模型进行更新,得到用于信用风险评估的全局模型,包括:
    所述服务器端收集所有所述最终聚类中心点上报的所述中间模型参数;
    采用FedAvg算法对收集的所述中间模型参数进行聚合,得到最终模型参数;
    采用所述最终模型参数对所述初始模型进行更新,得到全局模型。
PCT/CN2021/096655 2020-11-24 2021-05-28 基于客户端分类聚合的联合风险评估方法及相关设备 WO2022110721A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011327614.2 2020-11-24
CN202011327614.2A CN112465626B (zh) 2020-11-24 2020-11-24 基于客户端分类聚合的联合风险评估方法及相关设备

Publications (1)

Publication Number Publication Date
WO2022110721A1 true WO2022110721A1 (zh) 2022-06-02

Family

ID=74799683

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/096655 WO2022110721A1 (zh) 2020-11-24 2021-05-28 基于客户端分类聚合的联合风险评估方法及相关设备

Country Status (2)

Country Link
CN (1) CN112465626B (zh)
WO (1) WO2022110721A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115344050A (zh) * 2022-09-15 2022-11-15 安徽工程大学 一种基于改进的聚类算法堆垛机路径规划方法
CN117313901A (zh) * 2023-11-28 2023-12-29 北京邮电大学 一种基于多任务聚类联邦个性学习的模型训练方法及装置
WO2024001806A1 (zh) * 2022-06-28 2024-01-04 华为技术有限公司 一种基于联邦学习的数据价值评估方法及其相关设备

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112465626B (zh) * 2020-11-24 2023-08-29 平安科技(深圳)有限公司 基于客户端分类聚合的联合风险评估方法及相关设备
CN113033820B (zh) * 2021-03-25 2023-05-26 蚂蚁金服(杭州)网络技术有限公司 联邦学习方法、装置以及设备
CN113177561B (zh) * 2021-04-27 2022-06-24 深圳市慧鲤科技有限公司 多用户集合方法及装置、电子设备和存储介质
CN113033712B (zh) * 2021-05-21 2021-09-14 华中科技大学 一种基于联邦学习的多用户协同训练人流统计方法及系统
CN113312667B (zh) * 2021-06-07 2022-09-02 支付宝(杭州)信息技术有限公司 一种风险防控方法、装置及设备
CN113313266B (zh) * 2021-06-15 2023-10-24 厦门大学 基于两阶段聚类的联邦学习模型训练方法和存储设备
CN113344220B (zh) * 2021-06-18 2022-11-11 山东大学 一种联邦学习中基于局部模型梯度的用户筛选方法、系统、设备及存储介质
CN113642737B (zh) * 2021-08-12 2024-03-05 广域铭岛数字科技有限公司 一种基于汽车用户数据的联邦学习方法及系统
CN114095503A (zh) * 2021-10-19 2022-02-25 广西综合交通大数据研究院 一种基于区块链的联邦学习参与节点选择方法
CN116702918A (zh) * 2022-02-28 2023-09-05 华为技术有限公司 一种联邦学习方法以及相关设备
CN115936110B (zh) * 2022-11-18 2024-09-03 重庆邮电大学 一种缓解异构性问题的联邦学习方法
CN116525117B (zh) * 2023-07-04 2023-10-10 之江实验室 一种面向数据分布漂移检测与自适应的临床风险预测系统
CN117150321B (zh) * 2023-10-31 2024-01-30 北京邮电大学 设备信任度评价方法、装置、服务设备及存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180374098A1 (en) * 2016-02-19 2018-12-27 Alibaba Group Holding Limited Modeling method and device for machine learning model
CN110503207A (zh) * 2019-08-28 2019-11-26 深圳前海微众银行股份有限公司 联邦学习信用管理方法、装置、设备及可读存储介质
CN111461874A (zh) * 2020-04-13 2020-07-28 浙江大学 一种基于联邦模式的信贷风险控制系统及方法
CN112465626A (zh) * 2020-11-24 2021-03-09 平安科技(深圳)有限公司 基于客户端分类聚合的联合风险评估方法及相关设备

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150228015A1 (en) * 2014-02-13 2015-08-13 Xerox Corporation Methods and systems for analyzing financial dataset
CN107305637B (zh) * 2016-04-21 2020-10-16 华为技术有限公司 基于K-Means算法的数据聚类方法和装置
CN109242740A (zh) * 2018-07-18 2019-01-18 平安科技(深圳)有限公司 身份信息风险评定方法、装置、计算机设备和存储介质
CN109685635A (zh) * 2018-09-11 2019-04-26 深圳平安财富宝投资咨询有限公司 金融业务的风险评估方法、风控服务端及存储介质
CN111639584A (zh) * 2020-05-26 2020-09-08 深圳壹账通智能科技有限公司 基于多分类器的风险识别方法、装置及计算机设备
CN111967910A (zh) * 2020-08-18 2020-11-20 中国银行股份有限公司 一种用户客群分类方法和装置

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180374098A1 (en) * 2016-02-19 2018-12-27 Alibaba Group Holding Limited Modeling method and device for machine learning model
CN110503207A (zh) * 2019-08-28 2019-11-26 深圳前海微众银行股份有限公司 联邦学习信用管理方法、装置、设备及可读存储介质
CN111461874A (zh) * 2020-04-13 2020-07-28 浙江大学 一种基于联邦模式的信贷风险控制系统及方法
CN112465626A (zh) * 2020-11-24 2021-03-09 平安科技(深圳)有限公司 基于客户端分类聚合的联合风险评估方法及相关设备

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
SATTLER FELIX, MULLER KLAUS-ROBERT, SAMEK WOJCIECH: "Clustered Federated Learning", 32ND CONFERENCE ON NEURAL INFORMATION PROCESSING SYSTEMS (NIPS 2018), 31 December 2018 (2018-12-31), pages 1 - 5, XP055932725, [retrieved on 20220617] *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024001806A1 (zh) * 2022-06-28 2024-01-04 华为技术有限公司 一种基于联邦学习的数据价值评估方法及其相关设备
CN115344050A (zh) * 2022-09-15 2022-11-15 安徽工程大学 一种基于改进的聚类算法堆垛机路径规划方法
CN115344050B (zh) * 2022-09-15 2024-04-26 安徽工程大学 一种基于改进的聚类算法堆垛机路径规划方法
CN117313901A (zh) * 2023-11-28 2023-12-29 北京邮电大学 一种基于多任务聚类联邦个性学习的模型训练方法及装置
CN117313901B (zh) * 2023-11-28 2024-04-02 北京邮电大学 一种基于多任务聚类联邦个性学习的模型训练方法及装置

Also Published As

Publication number Publication date
CN112465626B (zh) 2023-08-29
CN112465626A (zh) 2021-03-09

Similar Documents

Publication Publication Date Title
WO2022110721A1 (zh) 基于客户端分类聚合的联合风险评估方法及相关设备
US11070643B2 (en) Discovering signature of electronic social networks
TWI712981B (zh) 風險辨識模型訓練方法、裝置及伺服器
WO2015135321A1 (zh) 基于金融数据的社会关系挖掘的方法及装置
Marqués et al. On the suitability of resampling techniques for the class imbalance problem in credit scoring
WO2021174693A1 (zh) 一种数据分析方法、装置、计算机系统及可读存储介质
US20140244335A1 (en) Techniques for deriving a social proximity score for use in allocating resources
Sävje et al. Generalized full matching
CN112799708B (zh) 联合更新业务模型的方法及系统
US20170193066A1 (en) Data mart for machine learning
Xu et al. Trust propagation and trust network evaluation in social networks based on uncertainty theory
CN112101577B (zh) 基于XGBoost的跨样本联邦学习、测试方法、系统、设备和介质
WO2022257731A1 (zh) 针对隐私计算进行算法协商的方法、装置及系统
CN109919793B (zh) 活动参与分析及推荐方法
CN112365007B (zh) 模型参数确定方法、装置、设备及存储介质
WO2016122575A1 (en) Product, operating system and topic based recommendations
Mattei et al. Design and analysis of experiments
Tian et al. Synergetic focal loss for imbalanced classification in federated xgboost
Allahbakhsh et al. Harnessing implicit teamwork knowledge to improve quality in crowdsourcing processes
CN109801162A (zh) 一种社交媒体数据与多标准交叉认证融合的信用评级方法
CN116384502A (zh) 联邦学习中参与方价值贡献计算方法、装置、设备及介质
TW202345042A (zh) 專案分工方法
CN115545088A (zh) 模型构建方法、分类方法、装置和电子设备
Xie et al. Federated Learning With Personalized Differential Privacy Combining Client Selection
TWI720638B (zh) 存款利率議價調整系統及其方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21896231

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21896231

Country of ref document: EP

Kind code of ref document: A1