CN117670509A

CN117670509A - Method and related device for training breach risk prediction model and breach risk prediction

Info

Publication number: CN117670509A
Application number: CN202311540596.XA
Authority: CN
Inventors: 林娜; 吕杨; 蔡智; 邓健
Original assignee: Bank of China Ltd
Current assignee: Bank of China Ltd
Priority date: 2023-11-17
Filing date: 2023-11-17
Publication date: 2024-03-08

Abstract

The application discloses a method and a related device for training a breach risk prediction model and breach risk prediction, which can be applied to the field of artificial intelligence or the field of finance. If the default risk prediction model is trained by using a sample client including candidate attribute features with importance degrees smaller than or equal to the first threshold value or attribute features with importance levels higher than a preset importance level and derivative features, the accuracy of the default risk probability output by the default risk prediction model is reduced, and the data volume in the process of training the default risk prediction model is increased. Therefore, a sample client comprising a plurality of target attribute features and target derivative features is used for training the default risk prediction model, the data size in the process of training the default risk prediction model is greatly reduced, and the accuracy of outputting the default risk probability of the default risk prediction model is improved. Therefore, the accurate default risk probability of the client to be tested can be obtained, and whether to transact loans for the client to be tested can be determined.

Description

Method and related device for training breach risk prediction model and breach risk prediction

Technical Field

The present application relates to the field of artificial intelligence, and more particularly, to a method and apparatus for training and predicting risk of default.

Background

Among the numerous businesses of banks, the individual housing loan business has the characteristics of long loan period and high loan amount. If the individual housing loan business once breaks about, the mobility and the safety of bank funds are negatively affected. Thus, customer identification is made during the development of individual housing loan customers, and risk can be controlled at the source.

At present, the process of determining whether a client has default risk is based on subjective judgment of the client manager's own business experience, and the problem that the deviation between a risk identification result and a real situation is large due to insufficient experience of business personnel is easy to cause.

Disclosure of Invention

In view of the foregoing, the present application provides a method and related apparatus for training a breach risk prediction model and breach risk prediction.

In order to achieve the above purpose, the present application provides the following technical solutions:

according to a first aspect of an embodiment of the present disclosure, there is provided a method for training a breach risk prediction model, including:

obtaining importance degrees corresponding to a plurality of candidate attribute features respectively, wherein the plurality of candidate attribute features comprise: customer base information, individual housing loan application information, and loan mortgage information;

Acquiring attribute features with the absolute value of the importance degree larger than or equal to a first threshold value from a plurality of candidate attribute features;

taking the attribute value of the attribute feature of a sample client as input, taking the labeling default risk probability of the sample client as a training target, and training to obtain a gradient lifting decision tree model, wherein the gradient lifting decision tree model comprises one or more decision trees, each non-leaf node contained in the decision tree corresponds to one attribute feature, and a plurality of sub-nodes corresponding to the non-leaf node are divided based on the attribute value of the attribute feature corresponding to the non-leaf node;

for each decision tree, obtaining derivative features corresponding to leaf nodes in the decision tree, wherein the derivative features corresponding to the leaf nodes are formed by attribute features and attribute values, which are passed from a father node to the leaf nodes, of the decision tree;

taking the attribute values of the attribute features and the attribute values of the derivative features of the sample clients as input, taking the labeling default risk probability of the sample clients as a training target, and training to obtain a logistic regression model;

obtaining the significance level corresponding to the attribute feature and the significance level corresponding to the derivative feature from the logistic regression model;

Acquiring target attribute characteristics and target derivative characteristics, wherein the significance level of the target attribute characteristics is lower than or equal to a preset significance level, from the attribute characteristics and the derivative characteristics;

and taking the attribute value of the target attribute characteristic of the sample client and the attribute value of the target derivative characteristic as input, and taking the labeling default risk probability of the sample client as a training target to train to obtain a default risk prediction model.

According to a second aspect of embodiments of the present disclosure, there is provided a breach risk prediction method, including:

acquiring a to-be-measured data set of a to-be-measured client, wherein the to-be-measured data set comprises attribute values of target attribute characteristics of the to-be-measured client and attribute values of target derivative characteristics;

and inputting the data set to be measured into a default risk prediction model, and obtaining the default risk probability of the customer to be measured through the default risk prediction model, wherein the default risk prediction model is trained by adopting the default risk prediction model training method according to the first aspect.

According to a third aspect of embodiments of the present disclosure, there is provided an apparatus for training a breach risk prediction model, including:

the first obtaining module is configured to obtain importance degrees corresponding to a plurality of candidate attribute features, where the plurality of candidate attribute features include: customer base information, individual housing loan application information, and loan mortgage information;

The second acquisition module is used for acquiring attribute features with the absolute value of the importance degree being greater than or equal to a first threshold value from a plurality of candidate attribute features;

the first training module is used for taking the attribute value of the attribute characteristic of a sample client as input, taking the labeling default risk probability of the sample client as a training target, training to obtain a gradient lifting decision tree model, wherein the gradient lifting decision tree model comprises one or more decision trees, each non-leaf node contained in the decision tree corresponds to one attribute characteristic, and a plurality of sub-nodes corresponding to one non-leaf node are divided based on the attribute value of the attribute characteristic corresponding to the non-leaf node;

the third acquisition module is used for acquiring derivative features corresponding to leaf nodes in the decision tree aiming at each decision tree, wherein the derivative features corresponding to the leaf nodes are attribute features and attribute values, which are passed by a father node to the leaf nodes, of the decision tree;

the second training module is used for taking the attribute values of the attribute features and the attribute values of the derivative features of the sample clients as input, taking the labeling default risk probability of the sample clients as a training target, and training to obtain a logistic regression model;

A fourth obtaining module, configured to obtain a significance level corresponding to the attribute feature and a significance level corresponding to the derivative feature from the logistic regression model;

a fifth obtaining module, configured to obtain, from the attribute features and the derivative features, a target attribute feature and a target derivative feature with the saliency level being lower than or equal to a preset saliency level;

and the third training module is used for taking the attribute value of the target attribute characteristic of the sample client and the attribute value of the target derivative characteristic as input, taking the labeling default risk probability of the sample client as a training target, and training to obtain a default risk prediction model.

According to a fourth aspect of embodiments of the present disclosure, there is provided an breach risk prediction apparatus, including:

a sixth acquisition module, configured to acquire a to-be-measured data set of a to-be-measured client, where the to-be-measured data set includes an attribute value of a target attribute feature of the to-be-measured client and an attribute value of a target derivative feature;

the prediction module is configured to input the data set to be measured into a default risk prediction model, obtain a default risk probability of the customer to be measured through the default risk prediction model, and train the default risk prediction model by using the default risk prediction model training method according to any one of claims 1 to 6.

According to a fifth aspect of embodiments of the present disclosure, there is provided a server comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the breach risk prediction model training method of the first aspect; alternatively, a breach risk prediction method as described in the second aspect is implemented.

According to the technical scheme, the method for training the default risk prediction model is provided, because the target attribute characteristics of a sample client training the default risk prediction model are higher than a first threshold value, and the significance level of the target attribute characteristics and the target derivative characteristics is lower than or equal to a preset significance level, the training of the default risk prediction model through the attribute values of the target attribute characteristics and the target derivative characteristics is described, the accuracy of the default risk prediction model can be improved, and the candidate attribute characteristics lower than the first threshold value, or the attribute characteristics and the derivative characteristics with the significance level higher than the preset significance level can not influence the default risk probability, so that the method is not beneficial to training the default risk prediction model. If the default risk prediction model is trained by using a sample client including candidate attribute features with importance degrees smaller than or equal to the first threshold value or attribute features with importance levels higher than a preset importance level and derivative features, the accuracy of the default risk probability output by the default risk prediction model is reduced, and the data volume in the process of training the default risk prediction model is increased. Therefore, a sample client comprising a plurality of target attribute features and target derivative features is used for training the default risk prediction model, the data size in the process of training the default risk prediction model is greatly reduced, and the accuracy of outputting the default risk probability of the default risk prediction model is improved. Because the default risk probability output by the default risk prediction model is accurate, the default risk probability of the client to be measured can be obtained accurately, and whether to transact loans for the client to be measured can be determined.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present application, and that other drawings may be obtained according to the provided drawings without inventive effort to a person skilled in the art.

Fig. 1 is a schematic diagram of a hardware architecture according to an embodiment of the present application;

FIG. 2 is a flowchart illustrating a method of training a breach risk prediction model, according to an example embodiment;

FIG. 3 is a schematic diagram of a trained gradient-lifting decision tree model provided in an embodiment of the present application;

FIG. 4 is a flowchart of a method for predicting risk of default according to an embodiment of the present application;

FIG. 5 is a block diagram of an offending risk prediction model training device, shown in accordance with an exemplary embodiment;

FIG. 6 is a block diagram of an breach risk prediction device, according to an example embodiment;

fig. 7 is a block diagram illustrating an apparatus for a server according to an exemplary embodiment.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

It should be noted that, the user information (including, but not limited to, user equipment information, user personal information, etc.) and the data (including, but not limited to, data for analysis, stored data, presented data, etc.) referred to in the present application are information and data authorized by the user or sufficiently authorized by each party, and the collection, use and processing of the related data are required to comply with the related laws and regulations and standards of the related countries and regions.

The embodiment of the application provides a method and a related device for training a breach risk prediction model and breach risk prediction. Before introducing the technical solution provided by the embodiments of the present application, a description is first given of a hardware architecture related to the present application.

As shown in fig. 1, a schematic diagram of a hardware architecture according to an embodiment of the present application includes: an electronic device 11, a server 12 and a server 13.

By way of example, the electronic device 11 may be any electronic product that can interact with a user in one or more ways, such as a keyboard, a touch pad, a touch screen, a remote control, a voice interaction or a handwriting device, for example, a mobile phone, a tablet computer, a palm top computer, a personal computer, a wearable device, a smart television, etc.

For example, the electronic device 11 may be a device of a bank counter, or a device held by a loan officer, so as to input, via the electronic device 11, the attribute values of the target attribute features and the attribute values of the target derivative features of the customer under test.

Illustratively, the server 12 stores a pre-constructed breach risk prediction model. The electronic device 11 sends the attribute value of the target attribute feature and the attribute value of the target derivative feature of the customer to be tested to the server 12, so that the server 12 runs the default risk prediction method provided in the embodiment of the present application, thereby obtaining the probability of default risk of the customer to be tested, so as to determine whether to loan the customer to be tested.

The server 12 may be a server, a server cluster comprising a plurality of servers, or a cloud computing service center, for example.

The server 13 is illustratively a server that trains the breach risk prediction model. For example, the server 13 may perform the breach risk prediction model training method; after the server 13 trains or updates the breach risk prediction model, the breach risk prediction model is synchronized to the server 12.

The server 13 may be a server, a server cluster formed by a plurality of servers, or a cloud computing service center.

The server 12 and the server 13 may be the same server or different servers, for example.

Those skilled in the art will appreciate that the above-described electronic devices and servers are merely examples, and that other existing or future-occurring electronic devices or servers, as applicable to the present disclosure, are intended to be within the scope of the present disclosure and are incorporated herein by reference.

The method for training the breach risk prediction model provided in the embodiment of the present application is described below with reference to the above hardware architecture.

Fig. 2 is a flowchart illustrating a method of training a breach risk prediction model according to an exemplary embodiment, as shown in fig. 2, for use in the server 12, including the following steps S21 to S28.

Step S21: obtaining importance degrees corresponding to a plurality of candidate attribute features respectively, wherein the plurality of candidate attribute features comprise: customer base information, individual housing loan application information, and loan mortgage information.

Three types of candidate attribute features, client base information, individual housing loan application information, and loan mortgage information, are described below.

In the first case, customer base information includes, but is not limited to: age, gender, marital status, family population, academic, industry to which the unit belongs, occupation, position, working years, and monthly income; exemplary personal housing loan application information includes, but is not limited to: one or more of loan amount, repayment mode, total term number, house purchasing purpose and whether the loan is a first housing accommodation loan; exemplary loan mortgage information includes, but is not limited to: the brand of the developer of the intention building, the nature of the business of the developer of the intention building, whether it is a new house, the identity of the future houses, the unit price of the house market, the area of the house to be purchased.

The intention building refers to a building to which a house to be purchased by a customer belongs.

By way of example, the nature of the developer's business with intent building may be private or national or central. The property of the developer enterprise can influence the probability of the rotting tail of the building, if the property of the developer enterprise is national enterprise or central enterprise, the probability of rotting tail is extremely low. It will be appreciated that if the house that the customer is about to purchase is a futures house and the probability of the developer rotting the tail is high, the probability of customer default will be increased.

If the developer causes broken capital chain and tension in fluidity for some reasons, the situation of rotten tail of the building can occur, and further, the customer can pay off on time to generate default phenomenon.

For example, the house-buying use is classified into self-living, relatives, renting and investment, wherein under the condition that other characteristics are the same, the default probability of the house-buying use as investment is higher than the default probability of the house-buying use as renting; the probability of default for house-buying use for rental is higher than the probability of default for house-buying use for live/relatives. Housing purchased by customers for investment reasons is relatively more prone to default than the rigid requirements of customers.

It can be understood that as the same amount of funds are purchased differently with economic development, for example, the purchasing power of 10 ten thousand yuan in 2010 is greater than the purchasing power of 10 ten thousand yuan in 2023, in order to make the time range of applying the trained default risk prediction model wider, for example, the default risk prediction model can be applied to 2023 and 2033, the application deals with the month income, loan amount and house market price, and see the second case in detail.

In the second case, processing the money class characteristics based on the first case, such as changing the month income included in the customer basic information into comparable month income; changing the loan amount included in the individual housing loan application information to a comparable loan amount; the house market price included in the loan mortgage information is changed to a comparable market price.

Wherein the comparable monthly revenue = monthly revenue/annual resident consumption price index corresponding to the set year; the comparable loan amount = loan amount/the resident consumption price index; the comparable market price = house market price/the residential consumption price index.

Illustratively, the set year may be any historical year. Illustratively, the resident consumption price index (Consumer Price Index, CPI) is issued by the national statistics bureau, reflects the relative number of the trend and degree of the price fluctuation of the consumer goods and service items purchased by urban and rural residents in a certain period, and is the result of comprehensive summarization calculation of the urban resident consumption price index and the rural resident consumption price index. Illustratively, the "current year" in the above equation changes over time. However, the year is based on the set year, regardless of the change.

The monetary type characteristics in the first case are processed, so that the influence of the price change of different years is eliminated, and the accuracy of the prediction model of the offence risk in different years is improved.

In a third case, in the first case or the second case, the plurality of candidate attribute features further includes: loan ratio and income-repayment ratio; the loan duty ratio = the comparable loan amount/(housing total price/resident consumption price index); the earning debt ratio = month payoff amount/the month earning.

It will be appreciated that a higher loan ratio indicates a higher repayment pressure and a higher probability of customer default. The smaller the loan ratio, the smaller the revenue repayment ratio, which means less repayment pressure, and the lower the probability of customer default.

The importance of the candidate attribute features characterizes whether the candidate attribute features would affect the risk of breach probability. It can be appreciated that the greater the absolute value of the degree of importance, the greater the probability that the candidate attribute features will affect the risk of breach; the smaller the absolute value of the importance, the smaller the probability that the candidate attribute features affect the breach risk. If the candidate attribute features have little or no impact on the risk of breach, the candidate attribute features need to be removed.

Step S22: and acquiring attribute features with the absolute value of the importance degree larger than or equal to a first threshold value from the candidate attribute features.

Illustratively, the IV (Information Value ) value of the candidate attribute feature is of importance. The first threshold may be based on practical circumstances, such as 0.02, for example.

Step S23: and taking the attribute value of the attribute characteristic of the sample client as input, taking the labeling default risk probability of the sample client as a training target, and training to obtain a gradient lifting decision tree model.

The gradient lifting decision tree model comprises one or more decision trees, each non-leaf node contained in the decision tree corresponds to an attribute feature, and a plurality of sub-nodes corresponding to the non-leaf nodes are divided based on attribute values of the attribute features corresponding to the non-leaf nodes.

The attribute values of the attribute features, such as the attribute values of 20, 21, etc. of any positive integer greater than 0, are described below. The attribute value of the attribute feature 'working period' is any positive integer of 1, 2, 3 and the like.

In an alternative implementation, to avoid the over-fitting problem with training a logistic regression model and training a gradient-lifting decision tree model using the same set of data, a hierarchical sampling may be used to sample the positive sample client set according to 7:2:1, extracting the proportion to form a positive sample test set data_test, a positive sample training set data_train and a positive sample verification set data_verification; negative sample clients can be assembled as per 7:2:1, a negative sample test set data_test, a negative sample training set data_train and a negative sample verification set data_verification are formed by extraction.

Then, adopting layered sampling, respectively carrying out data_train on the positive sample training set and the negative sample training set according to 1: the ratio of 1 is extracted to generate a training set data_train_gbdt for the gradient lifting decision tree model and a training set data_train_lr for the logistic regression model.

The positive sample client sets comprise sample clients which are default clients; the negative sample client set contains sample clients that are all non-offending clients.

For example, if the number of sample clients included in the negative sample client set is greater than the number of sample clients included in the positive sample client set, and the number of sample clients included in the negative sample client set is less; to solve this problem, the negative-sample client set may be sampled back using a Bagging method, thereby obtaining a negative-sample client set that includes more sample clients.

Step S24: and obtaining derivative features corresponding to leaf nodes in the decision tree aiming at each decision tree.

The derivative features corresponding to the leaf nodes are formed by attribute features and attribute values which pass from the father node to the leaf nodes.

Illustratively, the gradient-lifting decision tree model is a GBDT (Gradient Boosting Decision Tree, gradient descent tree) model.

The gradient-lifting decision tree model obtained by training is described below by way of example.

FIG. 3 is a schematic diagram of a trained gradient-lifting decision tree model according to an embodiment of the present application.

The gradient-lifting decision tree model shown in fig. 3 includes two decision trees, and fig. 3 is merely an example and does not limit the number of decision trees that the gradient-lifting decision tree model includes. Assuming that the number of attribute features is 6, respectively attribute feature A ₁ Attribute characteristics A ₂ Attribute characteristics A ₃ Attribute characteristics A ₄ Attribute characteristics A ₅ Attribute feature a ₆ 。

As shown in FIG. 3, each non-leaf node corresponds to an attribute feature, such as the parent node of the decision tree on the left side of FIG. 3 corresponds to attribute feature A ₃ Attribute characteristics A ₃ The attribute characteristics corresponding to the two child nodes of the corresponding father node are respectively attribute characteristics A ₂ Attribute feature a ₅ . Each leaf node corresponds to one sample customer set; as shown in fig. 3, the classification 1 to the classification 8 are all leaf nodes, and the sample clients included in the sample client set corresponding to any one of the classification 1 to the classification 8 are all default sample clients, or are all non-default sample clients. The identity of the resulting classifications is different due to the assignment results that are obtained from different branches of the decision tree.

The following description will be made by way of example of the phrase "a plurality of child nodes corresponding to the non-leaf nodes are divided based on the attribute values of the attribute features corresponding to the non-leaf nodes".

Attribute feature A ₃ Corresponding father node is according to attribute feature A ₃ Can be divided into two sample client sets (each sample client set comprises one or more sample clients), and each sample client set corresponds to one child node; thereby obtaining the attribute characteristics A ₂ Corresponding child node and attribute feature A ₅ And the corresponding child node.

The following describes that the derivative feature corresponding to the leaf node is formed by attribute features and attribute values which pass from a father node to the leaf node.

Illustratively, classifications 1 through 8 are each leaf nodes, and classifications 1 through 8 correspond to sample client sets, respectively.

Acquiring classifications belonging to a first type from classifications 1 through 8; the class belonging to the second type is acquired from the classes 1 to 8. The first type is assumed to comprise a class 1 and a class 4, namely, sample clients contained in sample client sets corresponding to the class 1 and the class 4 are all default clients; the second type includes class 6, i.e., the sample clients included in the sample client set corresponding to class 6 are all non-offending clients. Since the risk of surprise of the sample clients corresponding to the first type is consistent and the risk of surprise of the sample clients corresponding to the second type is consistent, derivative features can be obtained by the first type and the second type.

It can be seen in connection with fig. 3 that class 1 belonging to the first type is according to attribute feature a ₃ Attribute value < cut point 1 and attribute feature a ₅ The attribute value of (2) is equal to or greater than the dividing rule of the dividing point, so the derivative characteristics corresponding to the classification 1 are as follows: attribute feature A ₃ Attribute value < cut point 1 and attribute feature a ₅ The attribute value of (2) is more than or equal to the segmentation point; the category 4 belonging to the first type is according to the attribute characteristics a ₃ Attribute value is equal to or greater than division point 1 and attribute feature A ₂ The attribute value < the division rule of the division point 3, so the derivative features corresponding to the classification 4 are as follows: attribute feature A ₃ Attribute value is equal to or greater than division point 1 and attribute feature A ₂ Attribute value < cut point 3.

It can be seen in connection with fig. 3 that the classification 6 belonging to the second type is according to the attribute characteristics a ₁ Attribute value is equal to or greater than segmentation point 4 and attribute feature A ₄ The attribute value of (2) is greater than or equal to the division rule of the division point 5, so the derivative characteristics corresponding to the classification 6 are as follows: attribute feature A ₁ Attribute value is equal to or greater than segmentation point 4 and attribute feature A ₄ The attribute value of (2) is more than or equal to the segmentation point 5.

Illustratively, in this application scenario, the derivative feature may be (age. Ltoreq.24) and (house buying purpose is live/relatives) and (comparable market price > 25663.47).

Step S25: and taking the attribute values of the attribute features and the attribute values of the derivative features of the sample clients as input, and taking the labeling default risk probability of the sample clients as a training target to train to obtain a logistic regression model.

It is understood that the derivative feature is a combination of a plurality of attribute features, and the attribute values of the derivative features are described below, and for any derivative feature, if the sample client satisfies the derivative feature, the attribute value of the derivative feature of the sample client is 1; if the sample client does not satisfy the derived feature, the attribute value of the derived feature of the sample client is 0.

Step S26: and obtaining the significance level corresponding to the attribute feature and the significance level corresponding to the derivative feature from the logistic regression model.

Step S27: and acquiring target attribute characteristics and target derivative characteristics, wherein the significance level of the target attribute characteristics is lower than or equal to a preset significance level, from the attribute characteristics and the derivative characteristics.

For example, the logistic regression coefficients respectively corresponding to the attribute features and the derivative features and the standard errors respectively corresponding to the attribute features and the derivative features can be obtained through a logistic regression model. For example, the function coef () or sum () may be used to obtain the logistic regression coefficients respectively corresponding to the attribute features and the derivative features, and the standard errors respectively corresponding to the attribute features and the derivative features. For example, the significance level may be calculated by logistic regression coefficients corresponding to the attribute feature and the derivative feature, respectively, and standard errors corresponding to the attribute feature and the derivative feature, respectively.

Illustratively, the predicted saliency level may be based on the actual situation, for example, 0.05.

Step S28: and taking the attribute value of the target attribute characteristic of the sample client and the attribute value of the target derivative characteristic as input, and taking the labeling default risk probability of the sample client as a training target to train to obtain a default risk prediction model.

Illustratively, at least one of the techniques of artificial neural network, confidence network, reinforcement learning, transfer learning, induction learning, teaching learning, etc. in machine learning is involved in training the breach risk prediction model.

Illustratively, the breach risk prediction model may be any one of a neural network model, a logistic regression model, a linear regression model, a Support Vector Machine (SVM), adaboost, a lifting tree model, a transducer-Encoder model.

The neural network model may be any one of a cyclic neural network-based model, a convolutional neural network-based model, and a transducer-encoder-based classification model, for example.

By way of example, the breach risk prediction model may be a deep mix model of a recurrent neural network-based model, a convolutional neural network-based model, and a transducer-encoder-based classification model.

Illustratively, the breach risk prediction model may be any of an attention-based depth model, a memory network-based depth model.

The short text classification model based on deep learning is a Recurrent Neural Network (RNN) or a Convolutional Neural Network (CNN) or a variant based on the recurrent neural network or the convolutional neural network.

Illustratively, some simple domain adaptations may be made on an already pre-trained model to arrive at a breach risk prediction model.

Exemplary, "simple domain adaptation" includes, but is not limited to, secondary pre-training with large-scale unsupervised domain corpus again on an already pre-trained model, and/or model compression of an already pre-trained model by way of model distillation.

For example, the process of training the default risk prediction model described above may be supervised learning. By way of example, semi-supervised learning may also be performed on the machine learning model. Semi-supervised learning is a learning method combining supervised learning with unsupervised learning. Semi-supervised learning uses a large amount of unlabeled data and simultaneously labeled data to perform pattern recognition tasks.

The embodiment of the application provides a training method for an offending risk prediction model, because the target attribute characteristics of a sample client training the offending risk prediction model are higher than a first threshold value, and the significance level of the target attribute characteristics and the target derivative characteristics is lower than or equal to a preset significance level, the accuracy of the offending risk prediction model can be improved by training the offending risk prediction model through the attribute values of the target attribute characteristics and the target derivative characteristics, and the candidate attribute characteristics lower than the first threshold value, or the attribute characteristics and the derivative characteristics with the significance level higher than the preset significance level can not influence the offending risk probability, so that the training of the offending risk prediction model is not beneficial. If the default risk prediction model is trained by using a sample client including candidate attribute features with importance degrees smaller than or equal to the first threshold value or attribute features with importance levels higher than a preset importance level and derivative features, the accuracy of the default risk probability output by the default risk prediction model is reduced, and the data volume in the process of training the default risk prediction model is increased. Therefore, a sample client comprising a plurality of target attribute features and target derivative features is used for training the default risk prediction model, the data size in the process of training the default risk prediction model is greatly reduced, and the accuracy of outputting the default risk probability of the default risk prediction model is improved. Because the default risk probability output by the default risk prediction model is accurate, the default risk probability of the client to be measured can be obtained accurately, and whether to transact loans for the client to be measured can be determined.

It can be appreciated that there are various implementation manners of step S21, and the embodiments of the present application provide, but are not limited to, the following methods. The implementation of step S21 includes the following steps a11 to a13.

Step A11: and acquiring a candidate data set corresponding to the sample client, wherein the candidate data set comprises attribute values respectively corresponding to the plurality of candidate attribute features.

The candidate attribute feature may be of a discrete variable or a continuous variable, for example. For example, if the candidate attribute feature is gender, the candidate attribute feature is a discrete variable, and if the candidate attribute feature is age, the candidate attribute feature is a continuous variable.

Step A12: and taking the candidate data set corresponding to the sample client as input, taking the labeling default risk probability corresponding to the sample client as a training target, and training to obtain an importance degree measurement model, wherein the importance degree measurement model comprises a decision tree, the decision tree comprises non-leaf nodes and leaf nodes, the non-leaf nodes are associated with the candidate attribute characteristics, and the leaf nodes are associated with the labeling default risk probability.

By way of example, the importance metric model may be any of an XgBoost model, a GBDT (Gradient Boosting Decision Tree, gradient descent tree) model, a boosting tree model, and a random forest model.

Illustratively, the decision tree is a CART tree (Classification And Regression Tree, classification regression tree).

Step A13: and obtaining the importance degrees respectively corresponding to the candidate attribute features based on the decision tree contained in the importance degree measurement model.

Illustratively, WOE (Weight of Evidence, weight) and IV (Information Value ) values are calculated by a decision tree. The importance of the candidate attribute feature is the IV value.

In an alternative implementation, the implementation of step S28 is various, and the embodiment of the present application provides, but is not limited to, a method including steps B11 to B14.

Step B11: a plurality of candidate default risk prediction models are obtained, wherein the candidate default risk prediction models comprise a first number of first machine learning models and a second machine learning model, the input of the first number of first machine learning models is the input of the candidate default risk prediction models, the output of the first number of first machine learning models is the input of the second machine learning models, the output of the second machine learning models is the output of the candidate default risk prediction models, and the types of the first number of first machine learning models are different.

Wherein the first number of the first machine learning models in a plurality of the candidate breach risk prediction models having the same second machine learning model is different; the second machine learning model of a plurality of the candidate breach risk prediction models having the same first machine learning model is of a different type.

A plurality of candidate breach risk prediction models are described below by way of example.

Assume that the plurality of candidate breach risk prediction models includes a maximum number of first machine learning models of 5. The types of the 5 first machine learning models are respectively: a limiting gradient lifting (XGboost) model, an adaptive lifting (Adaboost) model, a Support Vector Machine (SVM) model, a Random Forest (RF) model, a classification regression tree (CART) model. Illustratively, the plurality of second machine learning models are respectively: a limiting gradient lifting (XGboost) model, an adaptive lifting (Adaboost) model, and a logistic regression model.

Assuming that the plurality of candidate default risk prediction models includes a minimum number of 3 first machine learning models, 3 first candidate default risk prediction models, 15 second candidate default risk prediction models, and 30 third candidate default risk prediction models may be obtained, for a total of 48 candidate default risk prediction models.

The first candidate default risk prediction model comprises 5 first machine learning models and one second machine learning model; the 3 first candidate default risk prediction models comprise the same 5 first machine learning models, and the second machine learning models comprise different second machine learning models. The second candidate default risk model includes 4 first machine learning models and one second machine learning model. The third candidate default risk model includes 3 first machine learning models and one second machine learning model.

The following steps B12 to B13 are performed for each of the candidate breach risk prediction models.

Step B12: and aiming at each first machine learning model, taking the attribute value of the target attribute characteristic and the attribute value of the target derivative characteristic of a sample client as input, taking the labeling default risk probability of the sample client as a training target, and training to obtain the first machine learning model.

Step B13: and taking the default risk probabilities of the sample clients respectively output by the first number of first machine learning models as the input of the second machine learning model, taking the labeling default risk probabilities of the sample clients as training targets, and training to obtain the second machine learning model so as to obtain the candidate default risk prediction model.

Step B14: and determining an optimal candidate default risk prediction model in a plurality of candidate default risk prediction models as the default risk prediction model.

For example, the AUC (Area Under Curve) values, accuracy, precision, recall of the 48 candidate breach risk prediction models described above may be calculated using the positive sample test set data_test and the negative sample test set data_test. An optimal default risk prediction model is determined based on AUC (Area Under Curve) values, accuracy rates, precision rates, recall rates of the 48 candidate default risk prediction models.

Wherein, precision = number 1/number 2. The number 1 is the number of sample clients belonging to the positive sample test set data_test and having a predicted risk of violating the constraint higher than or equal to a preset threshold, and the number 2 is the total number of sample clients belonging to the positive sample test set data_test and the negative sample test set data_test and having a predicted risk of violating the constraint higher than or equal to the preset threshold. Recall = number 1/number 3, number 3 referring to the number of sample clients that the positive sample test set data_test contains. Erroneous judgment rate=number 4/number 5, number 4 means the sum of "number of sample clients belonging to the positive sample test set data_test and having a predicted risk of violating the constraint lower than a preset threshold value" and "number of sample clients belonging to the negative sample test set data_test and having a predicted risk of violating the constraint higher than or equal to a preset threshold value"; the number 5 is the sum of the number of sample clients contained in the positive sample test set data_test and the number of sample clients contained in the negative sample test set data_test.

For example, the preset threshold may be based on the actual situation.

The method includes the steps of verifying a plurality of candidate default risk prediction models by using a positive sample verification set data_verification and a negative sample verification set data_verification to obtain AUC (Area Under Curve) values, accuracy rates, precision rates and recall rates respectively corresponding to the plurality of candidate default risk prediction models. Illustratively, in a specific embodiment, the first machine learning model in the optimal candidate breach risk prediction model is an Adaboost model, an XGboost model, and an SVM model, and the second machine learning model is a logistic regression model; the AUC value of the optimal candidate default risk prediction model is 0.837, the accuracy is higher than 0.85, and the accuracy is higher than 0.85.

In an alternative implementation, there may be a case where the number of sample clients included in the positive sample client set is greatly different from the number of sample clients included in the negative sample client set, in order to avoid such a problem, 4 balanced data sets may be generated by an undersampling method, an oversampling method, a method of combining oversampling with undersampling, and a manual data synthesis method, which are respectively data_bandwidth_under, data_bandwidth_over, data_bandwidth_both, and data_rose. The logistic regression model mentioned in step S24 was trained separately using 4 balanced datasets. The specific steps include C01 to C11.

Step C01: a positive sample client set is obtained, the positive sample client set comprising attribute values of the attribute features and attribute values of the derivative features of the offending sample clients, and a negative sample client set comprising attribute values of the attribute features and attribute values of the derivative features of the non-offending sample clients.

Step C02: from the positive and negative sample client sets, a first set of clients is obtained that contains a greater number of sample clients, and a second set of clients is obtained that contains a lesser number of sample clients.

Step C03: and undersampling the first client set to obtain a third client set, wherein the absolute value of the difference value between the number of sample clients contained in the third client set and the number of sample clients contained in the second client set is smaller than or equal to a preset first value.

The embodiment of the application refers to the attribute values of the attribute features of one sample client and the attribute values of the derivative features as sample data.

Undersampling (Undersampling) is primarily the processing of a first set of clients, balancing the data set by reducing the number of sample data for the first set of clients. This approach is preferable when the sample size of the data set is large, it also reduces computation time and memory overhead by reducing the training sample size, but it has a significant drawback in that using this approach may result in significant information being lost from the sample data contained by the first customer set due to the large amount of sample data to be deleted.

There are two main types of undersampling methods, random undersampling (Random Undersampling) and informative undersampling (Informative Undersampling). The random undersampling method randomly deletes the sample data contained in the first client set until the data set is balanced. The informative undersampling algorithm is to delete the sample data contained in the first customer set according to a predetermined criteria.

And combining the third client set and the second client set, and training a logistic regression model based on the combined data set.

The balanced data set data_balance_under includes a third set of clients and a second set of clients.

Step C04: and carrying out oversampling processing on the second client set to obtain a fourth client set, wherein the absolute value of the difference value between the number of sample clients contained in the fourth client set and the number of sample clients contained in the first client set is smaller than or equal to the preset first value.

Oversampling (Oversampling) processes the second customer set to balance the data in such a way that the sample data contained in the second customer set is repeated to obtain the fourth customer set, which has the advantage that there is no information loss disadvantage in that by adding repeated samples of the second customer set, an overfitting is possible, i.e. by which a very high fitting accuracy can be obtained on the training set, but the predictive performance on the test set is worse, while the calculation time and the storage overhead are greatly increased.

Similar to undersampling, it also falls into two categories, random oversampling (Random Oversampling) and informative oversampling (Informative Oversampling). Randomly selecting sample data contained in the second client set randomly through random oversampling, and repeating the sample data; the informative oversampling then follows certain criteria to generate sample data contained by the second client set.

The balanced data set data_balance_over includes a fourth set of clients and a first set of clients.

Step C05: undersampling the first customer set to obtain a fifth customer set; performing oversampling processing on the second client set to obtain a sixth client set; the absolute value of the difference between the number of sample clients contained in the fifth client set and the number of sample clients contained in the sixth client set is less than or equal to the preset first value.

In order to prevent the occurrence of an overfitting condition and loss of information of sample data contained in the excessive first client set, an oversampling method and an undersampling method can be combined, namely, the second client set is subjected to put-back oversampling to obtain a sixth client set, and the first client set is subjected to put-back undersampling to obtain a fifth client set.

The balanced data set data_balance_both includes a fifth client set and a sixth client set.

Step C06: and carrying out manual data synthesis processing on the second client set to obtain a seventh client set, wherein the absolute value of the difference value between the number of sample clients contained in the seventh client set and the number of sample clients contained in the first client set is smaller than or equal to the preset first value.

The artificial synthesis data method is to generate data instead of repeating original sample data to solve the problem of unbalance among data classes, and is also an oversampling technology in essence. The SMOTE method (Synthetic Minority Oversampling Technique) is an efficient and common method in this field, which generates new data based on feature space that is similar to the sample data contained by the second customer set. The basic principle is to measure similarity based on euclidean distance, and generate some artificial sample data in the feature space. Calculating the distance between sample data and determining the neighbor of the distance; (2) generating a uniform random number from 0 to 1 and multiplying it by the distance to obtain a value; (3) adding the value generated in step (2) to the feature vector of the sample data to obtain new sample data, wherein the process is equivalent to randomly selecting a point on the connecting line of the two sample data as the new sample data.

The balanced data set data_rose includes a seventh set of clients and a first set of clients.

Step C07: and taking the third client set and the second client set as input, taking the labeling default risk probability of the sample clients contained in the third client set and the second client set as training targets, and training to obtain a first logistic regression model.

Step C08: and taking the fourth client set and the first client set as input, taking the labeling default risk probability of the sample clients contained in the fourth client set and the first client set as training targets, and training to obtain a second logistic regression model.

Step C09: and taking the fifth client set and the sixth client set as input, taking the labeling default risk probability of the sample clients contained in the fifth client set and the sixth client set as training targets, and training to obtain a third logistic regression model.

Step C10: and taking the seventh client set and the first client set as input, taking the labeling default risk probability of the sample clients contained in the seventh client set and the first client set as training targets, and training to obtain a fourth logistic regression model.

Step C11: and determining the optimal model in the first logistic regression model, the second logistic regression model, the third logistic regression model and the fourth logistic regression model as the logistic regression model.

Exemplary, the misjudgment rate, recall rate, precision rate and AUC value corresponding to the first logistic regression model are obtained. And obtaining the misjudgment rate, recall rate, precision rate and AUC value corresponding to the second logistic regression model. And obtaining the misjudgment rate, recall rate, precision rate and AUC value corresponding to the third logistic regression model. And obtaining the misjudgment rate, recall rate, precision rate and AUC value corresponding to the fourth logistic regression model. And selecting an optimal logistic regression model based on the misjudgment rate, recall rate, precision rate and AUC value respectively corresponding to the first logistic regression model, the second logistic regression model, the third logistic regression model and the fourth logistic regression model.

As shown in fig. 4, a flowchart of a method for predicting risk of default according to an embodiment of the present application is provided, and the method includes steps S41 to S42.

Step S41: and acquiring a to-be-measured data set of the to-be-measured client, wherein the to-be-measured data set comprises attribute values of target attribute characteristics of the to-be-measured client and attribute values of target derivative characteristics.

Step S42: and inputting the data set to be measured into a default risk prediction model, and obtaining the default risk probability of the customer to be measured through the default risk prediction model, wherein the default risk prediction model is trained by adopting any default risk prediction model training method.

The method is described in detail in the embodiments disclosed in the application, and the method can be implemented by using various devices, so that the application also discloses a device, and a specific embodiment is given in the following detailed description.

FIG. 5 is a block diagram illustrating an offending risk prediction model training apparatus, according to an exemplary embodiment. Referring to fig. 5, the apparatus includes: first acquisition module 51, second acquisition module 52, first training module 53, third acquisition module 54, second training module 55, fourth acquisition module 56, fifth acquisition module 57, and third training module 58, wherein:

the first obtaining module 51 is configured to obtain importance degrees corresponding to a plurality of candidate attribute features, where the plurality of candidate attribute features include: customer base information, individual housing loan application information, and loan mortgage information;

a second obtaining module 52, configured to obtain, from a plurality of the candidate attribute features, attribute features whose absolute values of the importance degrees are greater than or equal to a first threshold;

A first training module 53, configured to take an attribute value of the attribute feature of a sample client as an input, take a labeling default risk probability of the sample client as a training target, and train to obtain a gradient-lifting decision tree model, where the gradient-lifting decision tree model includes one or more decision trees, each non-leaf node included in the decision tree corresponds to an attribute feature, and a plurality of child nodes corresponding to the non-leaf node are divided based on attribute values of the attribute feature corresponding to the non-leaf node;

the third obtaining module 54 is configured to obtain, for each decision tree, derivative features corresponding to leaf nodes in the decision tree, where the derivative features corresponding to the leaf nodes are attribute features and attribute values that the decision tree passes from a parent node to the leaf nodes;

the second training module 55 is configured to take, as input, an attribute value of the attribute feature and an attribute value of the derivative feature of the sample client, and take a labeling default risk probability of the sample client as a training target, and train to obtain a logistic regression model;

a fourth obtaining module 56, configured to obtain, from the logistic regression model, a significance level corresponding to the attribute feature and a significance level corresponding to the derivative feature;

A fifth obtaining module 57, configured to obtain, from the attribute features and the derivative features, a target attribute feature and a target derivative feature with the saliency level being lower than or equal to a preset saliency level;

and a third training module 58, configured to train to obtain a default risk prediction model by taking the attribute value of the target attribute feature and the attribute value of the target derivative feature of the sample client as inputs and the labeling default risk probability of the sample client as a training target.

In an alternative implementation, the third training module includes:

a first obtaining unit, configured to obtain a plurality of candidate default risk prediction models, where the candidate default risk prediction models include a first number of first machine learning models, an input of which is an input of the candidate default risk prediction model, and a second machine learning model, an output of which is an input of the second machine learning model, and an output of which is an output of the candidate default risk prediction model, where the first number of first machine learning models are different types;

Wherein the first number of the first machine learning models in a plurality of the candidate breach risk prediction models having the same second machine learning model is different; the second machine learning model of a plurality of the candidate breach risk prediction models having the same first machine learning model is of a different type;

for each of the candidate breach risk prediction models, performing the following operations:

the first training unit is used for taking the attribute value of the target attribute characteristic and the attribute value of the target derivative characteristic of the sample client as input and the labeling default risk probability of the sample client as a training target for each first machine learning model, and training to obtain the first machine learning model;

the second training unit is used for taking the default risk probabilities of the sample clients respectively output by the first number of first machine learning models as the input of the second machine learning model, taking the labeling default risk probabilities of the sample clients as training targets, training to obtain the second machine learning model, and training to obtain the candidate default risk prediction model;

and the first determining unit is used for determining the optimal candidate default risk prediction model in the plurality of candidate default risk prediction models as the default risk prediction model.

In an alternative implementation, the customer base information includes comparable monthly revenues, the individual housing loan application information includes comparable loan amounts, and the loan mortgage information includes comparable market price;

In an alternative implementation, the plurality of candidate attribute features further includes a loan ratio and a revenue repayment ratio; the loan ratio=the comparable loan amount/(house total price/resident consumption price index) the income and debt ratio=the monthly refund amount/the monthly income.

In an alternative implementation, the second training module includes:

a second obtaining unit, configured to obtain a positive sample client set and a negative sample client set, where the positive sample client set includes attribute values of the attribute features and attribute values of the derivative features of the sample clients that are violating, and the negative sample client set includes attribute values of the attribute features and attribute values of the derivative features of the sample clients that are not violating;

A third obtaining unit, configured to obtain a first client set with a larger number of included sample clients from the positive sample client set and the negative sample client set, and obtain a second client set with a smaller number of included sample clients;

a fourth obtaining unit, configured to perform undersampling processing on the first client set to obtain a third client set, where an absolute value of a difference between a number of sample clients included in the third client set and a number of sample clients included in the second client set is less than or equal to a preset first value;

a fifth obtaining unit, configured to perform oversampling processing on the second client set to obtain a fourth client set, where an absolute value of a difference between a number of sample clients included in the fourth client set and a number of sample clients included in the first client set is less than or equal to the preset first value;

a sixth obtaining unit, configured to perform undersampling processing on the first client set to obtain a fifth client set; performing oversampling processing on the second client set to obtain a sixth client set; the absolute value of the difference between the number of sample clients contained in the fifth client set and the number of sample clients contained in the sixth client set is less than or equal to the preset first value;

A seventh obtaining unit, configured to perform manual data synthesis processing on the second client set to obtain a seventh client set, where an absolute value of a difference between a number of sample clients included in the seventh client set and a number of sample clients included in the first client set is less than or equal to the preset first value;

the third training unit is used for taking the third client set and the second client set as input, taking the labeling default risk probability of the sample clients contained in the third client set and the second client set as a training target, and training to obtain a first logistic regression model;

the fourth training unit is used for taking the fourth client set and the first client set as input, taking the labeling default risk probability of the sample clients contained in the fourth client set and the first client set as a training target, and training to obtain a second logistic regression model;

the fifth training unit is used for taking the fifth client set and the sixth client set as input, taking the labeling default risk probability of the sample clients contained in the fifth client set and the sixth client set as a training target, and training to obtain a third logistic regression model;

The sixth training unit is configured to train to obtain a fourth logistic regression model by taking the seventh client set and the first client set as inputs and taking labeling default risk probabilities of sample clients included in the seventh client set and the first client set as training targets;

and the second determining unit is used for determining the optimal model in the first logistic regression model, the second logistic regression model, the third logistic regression model and the fourth logistic regression model as the logistic regression model.

In an alternative implementation, the first obtaining module includes:

an eighth obtaining unit, configured to obtain a candidate data set corresponding to a sample client, where the candidate data set includes attribute values corresponding to the plurality of candidate attribute features respectively;

a seventh training unit, configured to take a candidate data set corresponding to the sample client as input, take a labeling default risk probability corresponding to the sample client as a training target, and train to obtain an importance degree measurement model, where the importance degree measurement model includes a decision tree, and the decision tree includes a non-leaf node and a leaf node, the non-leaf node is associated with the candidate attribute feature, and the leaf node is associated with the labeling default risk probability;

And a ninth obtaining unit, configured to obtain importance degrees corresponding to the plurality of candidate attribute features respectively based on a decision tree included in the importance degree measurement model.

The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.

FIG. 6 is a block diagram illustrating an breach risk prediction device, according to an example embodiment. Referring to fig. 6, the apparatus includes: a sixth acquisition module 61 and a prediction module 62, wherein:

a sixth obtaining module 61, configured to obtain a to-be-measured data set of a to-be-measured client, where the to-be-measured data set includes an attribute value of a target attribute feature of the to-be-measured client and an attribute value of a target derivative feature;

the prediction module 62 is configured to input the set of data to be measured into a default risk prediction model, obtain a default risk probability of the customer to be measured through the default risk prediction model, and train the default risk prediction model by using any one of the default risk prediction model training methods described above.

Servers include, but are not limited to: a processor 71, a memory 72, a network interface 73, an I/O controller 74, and a communication bus 75.

It should be noted that the structure of the server shown in fig. 7 is not limited to the server, and the server may include more or less components than those shown in fig. 7, or may combine some components, or may be arranged with different components, as will be understood by those skilled in the art.

The following describes the respective constituent elements of the server in detail with reference to fig. 7:

the processor 71 is a control center of the server, connects various parts of the entire server using various interfaces and lines, and performs various functions of the server and processes data by running or executing software programs and/or modules stored in the memory 72, and calling data stored in the memory 72, thereby performing overall monitoring of the server. Processor 71 may include one or more processing units; by way of example, the processor 71 may integrate an application processor that primarily handles operating systems, user interfaces, applications, etc., with a modem processor that primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 71.

Processor 71 may be a central processing unit (Central Processing Unit, CPU), or a specific integrated circuit ASIC (Application Specific Integrated Circuit), or one or more integrated circuits configured to implement embodiments of the present invention, etc.;

the Memory 72 may include a Memory such as a Random-Access Memory (RAM) 721 and a Read-Only Memory (ROM) 722, and may further include a mass storage device 723 such as at least 1 disk Memory and the like. Of course, the server may also include hardware required for other services.

The memory 72 is used for storing instructions executable by the processor 71. The above-described processor 71 has a function of executing the default risk prediction model training method or a function of executing the default risk prediction method.

A wired or wireless network interface 73 is configured to connect the server to a network.

The processor 71, memory 72, network interface 73, and I/O controller 74 may be interconnected by a communication bus 75, which may be an ISA (Industry Standard Architecture ) bus, PCI (Peripheral Component Interconnect, peripheral component interconnect standard) bus, or EISA (Extended Industry Standard Architecture ) bus, among others. The buses may be classified as address buses, data buses, control buses, etc.

In an exemplary embodiment, the server may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic elements for performing the above-described breach risk prediction model training method or breach risk prediction method.

In an exemplary embodiment, the disclosed embodiments provide a storage medium including instructions, such as memory 72 including instructions, executable by processor 71 of a server to perform the above-described method. Alternatively, the storage medium may be a non-transitory computer readable storage medium, which may be, for example, ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, and the like.

In an exemplary embodiment, a computer readable storage medium is also provided, which can be directly loadable into an internal memory of a computer, such as the memory 72 described above, and contains software code that, when loaded and executed via the computer, enables the above-described breach risk prediction model training method or breach risk prediction method.

In an exemplary embodiment, a computer program product is also provided, which can be directly loaded into an internal memory of a computer, for example, a memory contained in the server, and contains software codes, and the computer program can implement the above-mentioned method for training the breach risk prediction model or the breach risk prediction method after being loaded and executed by the computer.

It should be noted that the method for training the breach risk prediction model and predicting breach risk and the related device provided by the invention can be used in the artificial intelligence field or the financial field. The foregoing is merely exemplary, and is not intended to limit the application fields of the method for training the breach risk prediction model and the breach risk prediction method and the related apparatus provided by the present invention.

The features described in the respective embodiments in the present specification may be replaced with each other or combined with each other. For device or system class embodiments, the description is relatively simple as it is substantially similar to method embodiments, with reference to the description of method embodiments in part.

It is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. The software modules may be disposed in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method for training a breach risk prediction model, comprising:

2. The method according to claim 1, wherein the step of training to obtain the default risk prediction model includes the steps of:

obtaining a plurality of candidate default risk prediction models, wherein the candidate default risk prediction models comprise a first number of first machine learning models and a second machine learning model, the inputs of the first number of first machine learning models are inputs of the candidate default risk prediction models, the outputs of the first number of first machine learning models are inputs of the second machine learning models, the outputs of the second machine learning models are outputs of the candidate default risk prediction models, and the types of the first number of first machine learning models are different;

for each first machine learning model, taking the attribute values of the target attribute features and the attribute values of the target derivative features of a sample client as input, taking the labeling default risk probability of the sample client as a training target, and training to obtain the first machine learning model;

taking the default risk probabilities of the sample clients respectively output by the first number of first machine learning models as the input of the second machine learning model, taking the labeling default risk probabilities of the sample clients as training targets, and training to obtain the second machine learning model so as to obtain the candidate default risk prediction model;

and determining an optimal candidate default risk prediction model in a plurality of candidate default risk prediction models as the default risk prediction model.

3. The method of training a breach risk prediction model of claim 1 or 2, wherein the customer base information includes comparable monthly revenues, the individual housing loan application information includes comparable loan amounts, and the loan mortgage information includes comparable market unit prices;

4. The method of claim 3, wherein the plurality of candidate attribute features further comprises a loan duty cycle and a revenue repayment ratio; the loan ratio=the comparable loan amount/(house total price/resident consumption price index), the income and debt ratio=the monthly refund amount/the monthly income.

5. The method according to claim 1, wherein the step of training to obtain a logistic regression model using the attribute values of the attribute features and the attribute values of the derivative features of the sample clients as inputs and the labeled risk of violating the rule of the sample clients as training targets comprises:

Acquiring a positive sample client set and a negative sample client set, wherein the positive sample client set comprises attribute values of the attribute features and attribute values of the derivative features of the surprise sample clients, and the negative sample client set comprises attribute values of the attribute features and attribute values of the derivative features of the surprise sample clients;

acquiring a first customer set with a larger number of included sample customers from the positive sample customer set and the negative sample customer set, and acquiring a second customer set with a smaller number of included sample customers;

undersampling the first client set to obtain a third client set, wherein the absolute value of the difference between the number of sample clients contained in the third client set and the number of sample clients contained in the second client set is smaller than or equal to a preset first value;

the second client set is subjected to oversampling processing to obtain a fourth client set, and the absolute value of the difference value between the number of sample clients contained in the fourth client set and the number of sample clients contained in the first client set is smaller than or equal to the preset first value;

undersampling the first customer set to obtain a fifth customer set; performing oversampling processing on the second client set to obtain a sixth client set; the absolute value of the difference between the number of sample clients contained in the fifth client set and the number of sample clients contained in the sixth client set is less than or equal to the preset first value;

Performing manual data synthesis processing on the second client set to obtain a seventh client set, wherein the absolute value of the difference between the number of sample clients contained in the seventh client set and the number of sample clients contained in the first client set is smaller than or equal to the preset first value;

taking the third client set and the second client set as input, taking the labeling default risk probability of the sample clients contained in the third client set and the second client set as training targets, and training to obtain a first logistic regression model;

taking the fourth client set and the first client set as input, taking the labeling default risk probability of the sample clients contained in the fourth client set and the first client set as training targets, and training to obtain a second logistic regression model;

taking the fifth client set and the sixth client set as input, taking the labeling default risk probability of the sample clients contained in the fifth client set and the sixth client set as training targets, and training to obtain a third logistic regression model;

taking the seventh client set and the first client set as input, taking the labeling default risk probability of the sample clients contained in the seventh client set and the first client set as training targets, and training to obtain a fourth logistic regression model;

And determining the optimal model in the first logistic regression model, the second logistic regression model, the third logistic regression model and the fourth logistic regression model as the logistic regression model.

6. The method for training a breach risk prediction model according to claim 1, wherein the step of obtaining importance levels respectively corresponding to a plurality of candidate attribute features includes:

acquiring a candidate data set corresponding to a sample client, wherein the candidate data set comprises attribute values respectively corresponding to the plurality of candidate attribute features;

taking a candidate data set corresponding to the sample client as input, taking the labeling default risk probability corresponding to the sample client as a training target, and training to obtain an importance degree measurement model, wherein the importance degree measurement model comprises a decision tree, the decision tree comprises non-leaf nodes and leaf nodes, the non-leaf nodes are associated with the candidate attribute characteristics, and the leaf nodes are associated with the labeling default risk probability;

and obtaining the importance degrees respectively corresponding to the candidate attribute features based on the decision tree contained in the importance degree measurement model.

7. A method of breach risk prediction comprising:

inputting the data set to be measured into a default risk prediction model, and obtaining the default risk probability of the customer to be measured through the default risk prediction model, wherein the default risk prediction model is trained by adopting the default risk prediction model training method according to any one of claims 1 to 6.

8. An apparatus for training a breach risk prediction model, comprising:

9. An apparatus for predicting risk of breach, comprising:

10. A server, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the breach risk prediction model training method of any of claims 1-6; alternatively, the breach risk prediction method of claim 7 is implemented.