CN114266645A

CN114266645A - Financial loan risk assessment method, device, storage medium and equipment

Info

Publication number: CN114266645A
Application number: CN202111582573.6A
Authority: CN
Inventors: 梁天恺
Original assignee: GRG Banking Equipment Co Ltd
Current assignee: GRG Banking Equipment Co Ltd
Priority date: 2021-12-22
Filing date: 2021-12-22
Publication date: 2022-04-01

Abstract

The invention provides a financial individual loan risk assessment method, a device, a storage medium and equipment, belonging to the field of risk assessment in artificial intelligence, wherein the method comprises the steps of collecting data of all parties, constructing an individual loan risk assessment model, assessing risk and outputting risk level, classifying users through the individual loan risk assessment model, outputting a decision report as a risk assessment result, and using the decision report as a loan risk assessment basis; the invention uses the federal learning technology to evaluate the individual credit risk based on multi-party data, and realizes the purposes of reducing the individual credit bad account rate and improving the individual credit examination and approval efficiency on the premise of ensuring the data security and the user privacy.

Description

Financial loan risk assessment method, device, storage medium and equipment

Technical Field

The invention belongs to the field of risk assessment in artificial intelligence, and particularly relates to a financial individual loan risk assessment method, a device, a storage medium and equipment.

Background

The statements in this section merely provide background information related to the present disclosure and may not constitute prior art or the prior art.

Currently, the financial industry is developed, the risk assessment before loan is very important, the risk assessment mostly adopts manual calculation at present, along with the development of artificial intelligence, the intelligent loan risk assessment based on a learning algorithm gradually appears, and the following problems still exist.

1) The existing loan risk assessment algorithm (for example, an assessment algorithm based on XGboost, which is mentioned in "personal credit risk assessment method facing high-dimensional data" by Liaojun et al (computer engineering and application 2020,56(04): 219-. That is, the current association rule analysis algorithm belongs to single-point analysis, and cannot combine data in different areas to perform joint knowledge learning, so that the knowledge learned by certain areas lacking data is too simple.

2) Meanwhile, the requirement of the existing centralized association rule analysis algorithm on computer resources is high, the centralized algorithm requires that all user data of multiple parties can be learned only after being transmitted to a centralized server, and under the large trend of data right confirmation, the data copyright of each party and the data privacy of users cannot be guaranteed.

Therefore, an evaluation scheme of an efficient algorithm which is used for protecting data privacy and data security, breaking a data island and combining multi-party data for federal learning is urgently needed.

Disclosure of Invention

In order to overcome the disadvantages of the prior art, the present invention provides a financial loan risk assessment method, apparatus, storage medium and device, which can solve the above problems.

The design principle is as follows: the method comprises the steps of utilizing data provided by different parties, breaking a data island by utilizing a SecureBoost algorithm on the premise of guaranteeing privacy protection and data security among multiple parties, combining data of all parties to carry out federal learning, and carrying out personal loan risk assessment on a user.

The design purpose is as follows: the privacy and the data safety are guaranteed, the bad account rate of the individual credit is reduced, and the examination and approval efficiency of the individual credit is improved.

The overall scheme is as follows: in order to solve the above problem, the overall design of the present application is as follows.

A financial individual loan risk assessment method based on federal learning comprises the following steps:

s1, collecting data of each party to participate in the task of risk assessment;

s2, constructing a credit risk assessment model: obtaining a federal decision tree through a SecureBoost federal learning algorithm based on data of each party, and training the federal decision tree to serve as an individual credit risk assessment model;

s3, risk assessment: the user individual loan application is used for inputting the user information into an individual loan risk assessment model for risk assessment;

and S4, outputting the risk level, classifying the users through the individual loan risk assessment model, outputting a decision report as a risk assessment result, and using the decision report as a loan risk assessment basis.

Further, the method for constructing the individual loan risk assessment model in step S2 includes:

s21, preprocessing data, aligning data of each passive side and encrypting and decrypting the data; the method specifically comprises the following steps:

s211, aligning data, wherein the data of each passive party is aligned by using a data alignment algorithm so as to extract a common sample of the data of each party;

s212, data encryption and decryption, namely data are encrypted and decrypted by using a homomorphic encryption and decryption algorithm, and the safety of the data and the privacy protection of each party are guaranteed.

S22, the active side calculates the encryption gradient and distributes the encryption gradient to the passive side;

s23, aggregating the gradient statistic values by each passive side and sending the gradient statistic values to the active side;

s24, the initiative party searches for the optimal segmentation according to the gradient information;

s25, the passive side and the active side cooperate to divide the sample;

and S26, outputting the model when the training stopping condition is reached, stopping training when the training iteration number of the decision tree exceeds the preset iteration number or the depth of the decision tree exceeds n/2 of the depth of the preset decision tree, wherein n is the sum of the attributes of all the data sets, taking the current decision tree as the final federal decision tree, and outputting the federal decision tree as the personal loan risk assessment model.

Further, the risk assessment of step S3 includes:

s31, the active side inquires the passive side information related to the current node: when a new sample is input into the constructed decision tree, the decision tree sends sample data and record id to the corresponding passive party according to the [ passive party id, record id ] record associated with the node where the new sample is currently located, and inquires the next direction of tree search to the corresponding passive party;

s32, the passive side determines the tree searching direction: after receiving the data of step S31 sent by the master, the passive side compares the corresponding feature values of the samples according to the [ record id, feature, threshold ] lookup table, and sends the feature direction greater than the threshold as the next direction of the tree search to the master;

s33, the master goes to the corresponding child node: after receiving the tree searching direction sent by the passive side, the active side goes to the corresponding child node;

and S34, repeating the steps S31-S33 until a leaf node is reached, and obtaining a classification result, namely the individual credit risk rating of the applicant.

The invention also provides a financial individual loan risk assessment device based on federal learning, which comprises: the data processing module is used for acquiring common samples owned by all passive parties, aligning data based on a SecureBoost algorithm and carrying out homomorphic encryption on the data; the risk evaluation module is used for constructing a decision tree as an individual loan risk evaluation model based on the common sample, inputting user data for iterative computation when receiving a user request of financial individual loan, and acquiring a classification label as a risk evaluation result; and the output module is used for acquiring the risk evaluation result of the risk evaluation module and the manual evaluation conclusion of the evaluation expert and generating a corresponding loan decision suggestion based on the risk evaluation result and/or the evaluation conclusion of the evaluation expert.

The invention also provides a computer readable storage medium, wherein the computer readable storage medium stores financial individual loan risk assessment instructions, and the financial individual loan risk assessment instructions are executed by a processor to realize the steps of the financial individual loan risk assessment method.

The invention also provides financial individual loan risk assessment equipment which comprises an input and output unit, a processor, a memory and financial individual loan risk assessment instructions stored on the memory and executable by the processor, wherein the financial individual loan risk assessment instructions are executed by the processor to realize the steps of the financial individual loan risk assessment method.

Compared with the prior art, the invention has the beneficial effects that: the invention uses federal learning technology to evaluate personal loan risk based on multi-party data. First, the present invention needs to align data of each party and find a common sample owned by multiple parties. And then constructing a Boost-tree by using the sample, and establishing a credit risk evaluation model. And finally, performing risk evaluation by using the obtained individual credit risk evaluation model, so that the individual credit bad account rate is reduced and the individual credit approval efficiency is improved on the premise of ensuring the data security and the user privacy.

Drawings

FIG. 1 is a flow chart of a method for financial loan risk assessment based on federal learning in accordance with the present invention;

FIG. 2 is a flow chart of a method for establishing an individual loan risk assessment model using the SecureBoost federated learning algorithm;

FIG. 3 is a schematic diagram of federal learning to which the present invention relates;

FIG. 4 is a flow chart of federal prediction using a trained model;

FIG. 5 is a flowchart of the financial loan risk assessment system as a whole.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.

It should be understood that "system", "device", "unit" and/or "module" as used in this specification is a method for distinguishing different components, elements, parts or assemblies at different levels. However, other words may be substituted by other expressions if they accomplish the same purpose.

As used in this specification and the appended claims, the terms "a," "an," "the," and/or "the" are not intended to be inclusive in the singular, but rather are intended to be inclusive in the plural, unless the context clearly dictates otherwise. In general, the terms "comprises" and "comprising" merely indicate that steps and elements are included which are explicitly identified, that the steps and elements do not form an exclusive list, and that a method or apparatus may include other steps or elements.

Flow charts are used in this description to illustrate operations performed by a system according to embodiments of the present description. It should be understood that the preceding or following operations are not necessarily performed in the exact order in which they are performed. Rather, the various steps may be processed in reverse order or simultaneously. Meanwhile, other operations may be added to the processes, or a certain step or several steps of operations may be removed from the processes.

First embodiment

A financial individual loan risk assessment method based on federal learning, see fig. 1, the method comprising:

The method for constructing the individual loan risk assessment model in step S2 is described. The SecureBoost is a longitudinal federated learning algorithm improved on the basis of an XGboost algorithm, and the basic idea is to construct a decision tree according to sample data, and then to classify newly input data by using the decision tree. The method for constructing the individual loan risk assessment model by using the SecureBoost algorithm is shown in FIG. 2 and mainly comprises the following steps: the method comprises the following steps of data alignment, calculation of encryption gradient by a driving party and distribution to passive parties, aggregation of gradient statistic values by each passive party and transmission to the driving party, finding optimal segmentation by the driving party according to gradient information, and collaborative sample division by the passive party and the driving party.

And S21, preprocessing data, and aligning and encrypting and decrypting data of each passive side. The method comprises the following specific steps:

s211, aligning data, wherein the data of each passive party is aligned by using a data alignment algorithm so as to extract a common sample of the data of each party; in the SecureBoost algorithm, a Privacy-maintaining Inter-Database Operations data alignment algorithm is used.

S212, data encryption and decryption, namely data are encrypted and decrypted by using a homomorphic encryption and decryption algorithm, and the safety of the data and the privacy protection of each party are guaranteed. In order to ensure the security of data and realize privacy protection, a homomorphic encryption algorithm is used for data encryption and decryption. Compared with other algorithms, the homomorphic encryption algorithm does not need to encrypt and decrypt data for many times in the operation process, so that the calculation efficiency is effectively improved, and meanwhile, no extra performance loss is generated.

And S22, the active side calculates the encryption gradient and distributes the encryption gradient to the passive side. Specifically, in step S22, the loss function of the training process target of the SecureBoost algorithm is:

in step S22, the loss function of the training process target of the SecureBoost algorithm is:

in formula 1I represents a sample; t represents the tth sub-tree; x is the number of_iA feature value representing a sample i; y is_iA formal label representing sample i;

the predicted label of the output of a certain subtree is shown; f. of_tRepresenting the t tree model; g_iIs a first order gradient over the loss function; h is_iIs a second order gradient over the loss function; omega is a punishment item of a loss function, has the function of artificially introducing errors, prevents overfitting of the model and improves the generalization capability of the model, and is an L2 regular punishment item; λ is a regularization coefficient; w represents the weight of a leaf node of the decision tree.

Wherein the first-order gradient and the second-order gradient are homomorphic encrypted and shared in the training process;

thus, g is adjusted_i、h_iThe gradient information is used as the encrypted gradient information, and the encrypted gradient information is sent to all passive parties by the active party.

S23, aggregating the gradient statistic values by each passive side and sending the gradient statistic values to the active side; the step S23 of aggregating the gradient statistics by the passive side and sending the gradient statistics to the active side includes:

s231, after receiving the encryption gradient information sent by the active side, the passive side carries out barrel separation on the samples of the sample space of the current node;

s232, based on the characteristic values after barrel division, performing addition processing on the gradient information to realize aggregation of gradient statistic values;

and S233, the passive side sends the aggregation result back to the active side.

S24, the initiative party searches for the optimal segmentation according to the gradient information; specifically, the optimal segmentation in step S24 includes:

s241, after receiving the aggregation gradient information of each passive party, the active party firstly decrypts the encrypted gradient information;

s242, calculating segmentation scores, and calculating the segmentation scores before and after segmentation according to the following formula 2:

in formula 2, i represents a sample; g_iIs a first order gradient over the loss function; h is_iIs a second order gradient over the loss function; t represents the tth sub-tree; y is_iA formal label representing sample i;

the predicted label of the output of a certain subtree is shown; λ is a regularization coefficient; i denotes the entire sample set, I_LIs the sample set of the left subtree after the segmentation, I_RNamely a sample set of a right subtree after the division; in other words, it is equal to I in the example_LAll samples representing sexes of women, I_RAll samples of males were sex. Thus, L can be known_split1/2[ information value of left subtree after splitting + information value of right subtree after splitting-information value of all samples not being split]I.e. the information gain before and after the segmentation (such a segmentation is not favorable for the classification of the labels) if L is assumed_split>When 0 indicates that such division brings information gain, tree division is performed. So L_spiltIs a difference value including information values (scores) before and after the division.

If the score after the segmentation is larger than the score before the segmentation, the segmentation is carried out;

and S243, when the step S242 judges that the division is needed, the active side sends the divided threshold value to the passive side.

And S25, the passive side and the active side cooperate to divide the sample. Specifically, the sample division includes: after receiving the threshold value sent by the active party, the passive party divides the current sample space and sends the divided sample space I back to the active party according to the form of [ record id, characteristics and threshold value ]; the active side divides the current node according to the received passive side information and forms a lookup table in the form of the passive side id and the record id according to the record source.

At this point, the SecureBoost federal learning is finished, and the decision tree obtained in step S26 is the federal decision tree and is used as the final individual credit evaluation model.

Referring to fig. 3, for the related federal learning architecture diagram, in order to solve the data islanding problem and comply with the development trend of data right, the invention proposes to use federal learning to build a credit risk assessment model. The method is characterized in that a plurality of passive sides perform cooperative learning under the coordination of an active side, data is not shared among the passive sides, a training model is updated only in a mode of sharing encryption gradient with the active side, and a federal learning process of enabling the active side to process things in a blind box through operation information of the passive sides under the condition that the active side does not know what the blind box is filled with things is achieved. This is the basic principle of step 2 in constructing a model for assessing credit risk.

The risk assessment in step S3, referring to fig. 4, includes:

Second embodiment

A financial individual loan risk assessment apparatus based on federal learning, the apparatus comprising:

the data processing module is used for acquiring common samples owned by all passive parties, aligning data based on a SecureBoost algorithm and carrying out homomorphic encryption on the data;

the risk evaluation module is used for constructing a decision tree as an individual loan risk evaluation model based on the common sample, inputting user data for iterative computation when receiving a user request of financial individual loan, and acquiring a classification label as a risk evaluation result;

and the output module is used for acquiring the risk evaluation result of the risk evaluation module and the manual evaluation conclusion of the evaluation expert and generating a corresponding loan decision suggestion based on the risk evaluation result and/or the evaluation conclusion of the evaluation expert.

The whole process layout is shown in fig. 5 by adopting the risk assessment principle of the device. Firstly, data alignment processing is carried out on data of all parties through a data processing module, and a common sample owned by the parties is searched. And then, a risk evaluation module constructs a Boost-tree according to the sample and establishes a credit risk evaluation model. And finally, performing risk assessment by using the learned individual loan risk assessment model, and visually presenting assessment results through an output module. The method and the device realize the purposes of reducing the bad account rate of the individual credit, improving the examination and approval efficiency of the individual credit and the like on the premise of ensuring the data security and the privacy of the user.

Third embodiment

A computer readable storage medium having stored thereon financial individual loan risk assessment instructions, wherein the financial individual loan risk assessment instructions, when executed by a processor, perform the steps of the financial individual loan risk assessment method of the first embodiment. For the evaluation method, please refer to the detailed description in the previous section, which is not repeated herein.

It will be appreciated by those of ordinary skill in the art that all or a portion of the steps of the various methods of the embodiments described above may be performed by associated hardware as instructed by a program that may be stored on a computer readable storage medium, which may include non-transitory and non-transitory, removable and non-removable media, to implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

Computer program code required for the operation of various portions of the present application may be written in any one or more programming languages, including an object oriented programming language such as Java, Scala, Smalltalk, Eiffel, JADE, Emerald, C + +, C #, VB.NET, Python, and the like, a conventional programming language such as C, Visualbasic, Fortran2003, Perl, COBOL2002, PHP, ABAP, a dynamic programming language such as Python, Ruby, and Groovy, or other programming languages, and the like. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or processing device. In the latter scenario, the remote computer may be connected to the user's computer through any network format, such as a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet), or in a cloud computing environment, or as a service, such as a software as a service (SaaS).

Fourth embodiment

A financial individual loan risk assessment apparatus comprising an input output unit (keyboard, mouse, display, printer, etc.), a processor, a memory, and financial individual loan risk assessment instructions stored on the memory and executable by the processor, wherein the financial individual loan risk assessment instructions, when executed by the processor, perform the steps of the financial individual loan risk assessment method of the first embodiment. For details, the method is described in the foregoing section, and is not repeated here.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A financial individual loan risk assessment method based on federal learning is characterized by comprising the following steps:

2. The financial individual loan risk assessment method according to claim 1, wherein the construction method of the individual loan risk assessment model in step S2 comprises:

s21, preprocessing data, aligning data of each passive side and encrypting and decrypting the data;

s25, the passive side and the active side cooperate to divide the sample;

3. The financial individual loan risk assessment method according to claim 2, wherein the data preprocessing of step S21 includes:

4. The financial individual loan risk assessment method according to claim 2, wherein in step S22, the loss function of the training process objective of the SecureBoost algorithm is:

in formula 1, i represents a sample; t represents thet subtrees; x is the number of_iA feature value representing a sample i; y is_iA formal label representing sample i;

the predicted label of the output of a certain subtree is shown; f. of_tRepresenting the t tree model; g_iIs a first order gradient over the loss function; h is_iIs a second order gradient over the loss function; omega is a punishment item of a loss function, and has the functions of artificially introducing errors, preventing overfitting of the model and improving the generalization capability of the model; λ is a regularization coefficient; w represents the weight of a leaf node of the decision tree;

5. The financial loan risk assessment method according to claim 2, wherein the step S23 of aggregating the gradient statistics and sending to the master comprises:

6. The financial loan risk assessment method of claim 2, wherein the optimal segmentation of step S24 includes:

the predicted label of the output of a certain subtree is shown; λ is a regularization coefficient; i denotes the entire sample set, I_LIs the sample set of the left subtree after the segmentation, I_RNamely a sample set of a right subtree after the division;

7. The financial individual loan risk assessment method according to claim 2, wherein the sample division of step S25 includes: after receiving the threshold value sent by the active party, the passive party divides the current sample space and sends the divided sample space I back to the active party according to the form of [ record id, characteristics and threshold value ]; the active side divides the current node according to the received passive side information and forms a lookup table in the form of the passive side id and the record id according to the record source.

8. The financial individual loan risk assessment method according to claim 1, wherein the risk assessment of step S3 includes:

9. Financial individual loan risk assessment device based on federal learning, characterized in that the device includes:

10. A computer readable storage medium having stored thereon financial loan risk assessment instructions, wherein the financial loan risk assessment instructions, when executed by a processor, implement the steps of the financial loan risk assessment method according to any of claims 1 to 8.

11. A financial individual loan risk assessment apparatus comprising an input output unit, a processor, a memory, and financial individual loan risk assessment instructions stored on the memory and executable by the processor, wherein the financial individual loan risk assessment instructions, when executed by the processor, implement the steps of the financial individual loan risk assessment method of any of claims 1 to 8.