CN115880056A - Training method of bad asset recovery rate prediction model and recovery rate prediction method - Google Patents

Training method of bad asset recovery rate prediction model and recovery rate prediction method Download PDF

Info

Publication number
CN115880056A
CN115880056A CN202211620585.8A CN202211620585A CN115880056A CN 115880056 A CN115880056 A CN 115880056A CN 202211620585 A CN202211620585 A CN 202211620585A CN 115880056 A CN115880056 A CN 115880056A
Authority
CN
China
Prior art keywords
information
model
loan
double
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211620585.8A
Other languages
Chinese (zh)
Inventor
傅莉莉
朱富荣
林宜领
庄佳和
巫小兰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Construction Bank Corp
CCB Finetech Co Ltd
Original Assignee
China Construction Bank Corp
CCB Finetech Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Construction Bank Corp, CCB Finetech Co Ltd filed Critical China Construction Bank Corp
Priority to CN202211620585.8A priority Critical patent/CN115880056A/en
Publication of CN115880056A publication Critical patent/CN115880056A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Financial Or Insurance-Related Operations Such As Payment And Settlement (AREA)

Abstract

The invention provides a training method of a bad asset recovery rate prediction model and a recovery rate prediction method, and relates to the field of information processing. The method comprises the following steps: constructing a loan information broad table of the personal loan sample according to the comprehensive information of the personal loan sample and the multiple data sources; performing feature extraction on the loan information broad table to obtain a feature vector of a personal loan sample; inputting the feature vector of the personal loan sample into a double-layer nested model to obtain a recovery rate predicted value output by the double-layer nested model; and training the double-layer nested model according to the recovery rate predicted value and the real label information of the personal loan sample, and determining the trained double-layer nested model as a bad asset recovery rate prediction model. According to the training method of the bad asset recovery rate prediction model, the lightGBM binary classification model and the lightGBM regression model in the double-layer nested model can effectively process the condition that data samples are distributed unevenly, and particularly when data are distributed in a double-peak mode, the prediction accuracy of the model is improved.

Description

Training method of bad asset recovery rate prediction model and recovery rate prediction method
Technical Field
The disclosure relates to the field of information processing, and in particular relates to a training method of a bad asset recovery rate prediction model and a recovery rate prediction method.
Background
The bad asset recovery rate represents the recovery condition of the bank for the bad assets, and the recovery rate of the bad assets after a period of time in the future can be predicted based on the information of the acceptance time point of each bad asset so as to provide reference information for the bad asset valuation. However, in practice, the recovery of bad assets is not generally evenly distributed, and the data is generally concentrated in one or several cells, and may even be subject to a bimodal distribution. Therefore, how to adapt to the situation of uneven distribution of samples in the process of predicting the recovery rate of the poor assets becomes a problem to be solved urgently.
Disclosure of Invention
The invention provides a training method of an adverse asset recovery rate prediction model and a recovery rate prediction method, which are used for adapting to the condition of uneven distribution of samples in the process of predicting the recovery rate of an adverse asset.
According to a first aspect of the embodiments of the present disclosure, there is provided a method for training a poor asset recovery rate prediction model, including:
constructing a loan information broad table of the personal loan sample according to the comprehensive information of the personal loan sample and the multiple data sources;
performing feature extraction on the loan information broad table to obtain a feature vector of the personal loan sample;
inputting the characteristic vector of the personal loan sample into a double-layer nested model to obtain a recovery rate predicted value output by the double-layer nested model; the double-layer nested model is a double-layer model nested based on a gradient lifting decision tree LightGBM binary classification model and a LightGBM regression model;
and training the double-layer nested model according to the recovery rate prediction value and the real label information of the personal loan sample, and determining the trained double-layer nested model as the bad property recovery rate prediction model.
According to a second aspect of the embodiments of the present disclosure, there is provided a method for predicting a recovery rate of an undesirable asset, including:
constructing a loan information broad table of a target object according to the omnibearing information of multiple data sources of the target object;
performing feature extraction on the loan information broad table of the target object to obtain a feature vector of the target object;
inputting the characteristic vector into a preset bad asset recovery rate prediction model to obtain a bad asset recovery rate prediction value of the target object;
wherein the bad asset recovery rate prediction model is a model trained based on the training method of the first aspect.
According to a third aspect of the embodiments of the present disclosure, there is provided a training apparatus for a bad asset recovery rate prediction model, including:
the construction module is used for constructing a loan information broad table of the personal loan sample according to the omnibearing information of the personal loan sample and the multiple data sources;
the extraction module is used for carrying out feature extraction on the loan information broad table to obtain a feature vector of the personal loan sample;
the prediction module is used for inputting the characteristic vector of the personal loan sample into a double-layer nested model to obtain a recovery rate prediction value output by the double-layer nested model; the double-layer nested model is a double-layer model nested based on a gradient lifting decision tree LightGBM binary classification model and a LightGBM regression model;
and the training module is used for training the double-layer nested model according to the recovery prediction value and the real label information of the personal loan sample, and determining the trained double-layer nested model as the bad asset recovery prediction model.
According to a fourth aspect of the embodiments of the present disclosure, there is provided a bad asset recovery rate prediction apparatus including:
the construction module is used for constructing a loan information broad table of the target object according to the omnibearing information of multiple data sources of the target object;
the extraction module is used for extracting the features of the loan information broad table of the target object to obtain the feature vector of the target object;
the prediction module is used for inputting the feature vector into a preset bad asset recovery rate prediction model so as to obtain a bad asset recovery rate prediction value of the target object;
wherein the bad asset recovery prediction model is a model trained based on the training method according to any one of claims 1 to 7.
According to a fifth aspect of embodiments of the present disclosure, there is provided an electronic apparatus including: a processor, and a memory communicatively coupled to the processor;
the memory stores computer execution instructions;
the processor executes computer-executable instructions stored by the memory to implement the method of the first aspect or to implement the method of the second aspect.
According to a sixth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium having stored therein computer-executable instructions for implementing the method of the first aspect or the method of the second aspect when executed by a processor.
According to a seventh aspect of embodiments of the present disclosure, there is provided a computer program product, characterized in that it comprises a computer program, which when executed by a processor, implements the method of the aforementioned first aspect, or implements the method of the aforementioned second aspect.
According to the training method of the bad asset recovery rate prediction model, the LightGBM binary classification model and the LightGBM regression model in the double-layer nested model can effectively process the condition that data samples are distributed unevenly, particularly when data show obvious bimodal distribution, the prediction accuracy of the trained model is improved, and the training speed of the bad asset recovery rate prediction model is improved. In addition, both the LightGBM binary model and the LightGBM regression model depend on the LightGBM algorithm, and the model can be supported to process mass data.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.
FIG. 1 is a flow chart illustrating a method of training a bad asset recovery prediction model, according to an exemplary embodiment.
FIG. 2 is a flow chart illustrating another method of training a bad asset recovery prediction model, according to an exemplary embodiment.
FIG. 3 is a flow chart illustrating a method of training yet another poor asset recovery prediction model in accordance with an exemplary embodiment.
Fig. 4 is a flow diagram illustrating a method for constructing a wide table of loan information for a sample of a personal loan, in accordance with an exemplary embodiment.
FIG. 5 is a flow chart illustrating a method for poor asset recovery prediction according to an exemplary embodiment
FIG. 6 is a block diagram of a training apparatus for a bad asset recovery prediction model, according to an exemplary embodiment.
Fig. 7 is a schematic diagram illustrating the configuration of a bad asset recovery prediction device, according to an exemplary embodiment.
Fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.
Detailed Description
In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.
It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the foregoing drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.
In the technical scheme of the disclosure, the aspects of acquisition, mobile phone, updating, analysis, processing, use, transmission, storage and the like of the personal information of the related user all accord with the regulations of related laws, and the personal information is used for legal purposes without violating the public order and good customs. Necessary measures are taken for the personal information of the user, so that the illegal access to the personal information data of the user is prevented, and the personal information safety, the network safety and the national safety of the user are maintained.
It should be noted that, in practical situations, the recovery of bad assets is not usually evenly distributed. For example, in a lending type of poor asset, about 38% of the data samples are approximately fully recovered (i.e., recovery greater than 99%), about 10% of the data samples are approximately zero recovered (i.e., recovery less than 1%), and a bimodal distribution is adopted, i.e., 0 and 1 have more extreme values ("0" corresponds to "zero recovery" and "1" corresponds to "full recovery" of "the debt item). Therefore, in predicting the poor asset recovery rate, a problem of uneven distribution of samples is faced. Therefore, the training method and the recovery rate prediction method of the poor asset recovery rate prediction model can improve the accuracy of the recovery rate prediction value when the sample is not uniformly distributed.
FIG. 1 is a flow chart illustrating a method of training a bad asset recovery prediction model, as shown in FIG. 1, including the following steps, in accordance with an exemplary embodiment.
Step 101, constructing a loan information width table of a personal loan sample according to the omnibearing information of the personal loan sample and multiple data sources.
Optionally, in some embodiments of the present disclosure, the comprehensive information of multiple data sources of the personal loan sample may include multiple kinds of customer-related information, amount contract information, borrowing contract information, guarantee and/or mortgage contract information, resource information, loan account information, roll-out information, and risk classification information of the personal loan sample.
And 102, performing feature extraction on the loan information wide table to obtain a feature vector of the personal loan sample.
In some embodiments of the present disclosure, the fields in the loan information broad table may be processed into value-class fields, and feature extraction may be performed to obtain feature vectors of the personal loan sample.
And 103, inputting the feature vector of the personal loan sample into the double-layer nested model to obtain a recovery rate predicted value output by the double-layer nested model. The double-layer nested model is a double-layer model nested based on a gradient lifting decision tree LightGBM binary model and a LightGBM regression model.
In some embodiments of the present disclosure, the feature vector of the personal loan sample is input into the double-layered nested model, and the light gbm classification model in the double-layered nested model is used to classify and identify the feature vector of the personal loan sample, i.e., determine whether the personal loan sample is completely recycled (the recovery rate is greater than 99%), so as to obtain the classification result output by the light gbm classification model. When the classification result is the first classification category, the recovery rate (for example, 100%) corresponding to the first classification category is used as the recovery rate prediction value output by the double-layer nested model. Wherein the first classification category is used to indicate that the personal loan sample is fully reclaimed. And when the classification result is a second classification category (namely the personal loan sample is not completely recovered), acquiring a feature vector corresponding to the second classification category from the feature vectors of the personal loan sample, and performing regression prediction on the feature vector corresponding to the second classification category based on a LightGBM regression model in the double-layer nested model to obtain a recovery prediction value output by the double-layer nested model.
And step 104, training the double-layer nested model according to the recovery rate predicted value and the real label information of the personal loan sample, and determining the trained double-layer nested model as a bad asset recovery rate prediction model.
It should be noted that the real tag information of the personal loan sample is the recovery rate obtained by calculation based on the comprehensive information of the personal loan sample. The comprehensive information of the personal loan sample can comprise loan account information, transfer-in/transfer-out information and risk classification information of the personal loan sample. Optionally, in some embodiments of the present disclosure, a loss function of the two-layer nested model may be calculated based on the recovery prediction value and the true label information. And training the double-layer nested model according to the loss function.
According to the training method of the bad asset recovery rate prediction model, the condition that data samples are not uniformly distributed can be effectively processed through the LightGBM binary classification model and the LightGBM regression model in the double-layer nested model, particularly when data are obviously distributed in a double peak mode, the prediction accuracy of the trained model is improved, and the training speed of the bad asset recovery rate prediction model is improved. In addition, both the LightGBM binary classification model and the LightGBM regression model depend on the LightGBM algorithm, and the LightGBM regression model can support the model to process mass data.
Fig. 2 is a flow chart illustrating another method of training a bad asset recovery prediction model, according to an exemplary embodiment, which includes the following steps, as shown in fig. 2.
Step 201, constructing a loan information broad table of the personal loan sample according to the comprehensive information of the personal loan sample and the multiple data sources.
In step 202, the fields in the loan information width table are identified to obtain the field types of the fields in the loan information width table.
Optionally, in some embodiments of the present disclosure, the field type may include a long string field, a date and time class field, a type class field, and the like.
At step 203, at least some fields in the loan information width table are processed based on the field type to change the field types of all the fields in the loan information width table into value-type fields.
In some embodiments of the present disclosure, when the field type is a long string field, the length of each long string in the long string field in the loan information width table is extracted, and the extracted length of each long string is used as the numerical value information of the corresponding long string field.
In some embodiments of the present disclosure, when the field type is the first date and time class field, the date and time information of the first date and time class field in the loan information broad table is discretely extracted, and the extracted information is used as the numerical information corresponding to the first date and time class field. Wherein, the first date and time class field is a date and time class field with a standard format. The date and time information of the date and time class field in the standard format is discretely extracted, the information of year, month, day and time stamp is respectively obtained, and the extracted information is used as the numerical information corresponding to the first date and time class field. Wherein each information is a numerical quantity. Alternatively, the extraction method may use the pandas. The method calls the extraction functions of year, month, day and time stamp corresponding to the standard date and time format. For the time-based variable, information such as time, minute, and second may be discarded. For the date discretization extracted timestamp, the value of the timestamp can be represented by the current day noon, namely 12. After the date and time field in the standard format is extracted, the extracted value field needs to be marked with the type so as not to be confused with other value fields.
In some embodiments of the present disclosure, when the field type is the second date-time type field, the date and time information of the second date-time type field in the loan information broad table is extracted discretely, and the extracted information is used as the numerical information corresponding to the second date-time type field. Wherein the second date-time class field is an 8-bit string date field. And (3) performing discrete extraction on the date and time information of the date field of the 8-bit string to respectively obtain the information of year, month, day and time stamp, and taking the extracted information as numerical information corresponding to the second date and time field. Wherein each information is a numerical quantity. Alternatively, the extraction method may use the pan. In the method, the year, month and day in the character string date are identified based on the regular expression, and finally the return timestamp of the time module is called. For the date discretization extracted timestamp, a value with the midday day of the day, i.e., 12. After the 8-bit string date field is extracted, the extracted numerical value field needs to be marked with a type so as not to be confused with other numerical value fields. After the discrete extraction of dates is completed, the original 8-bit string date field can be deleted.
In some embodiments of the present disclosure, when the field type is a third date-time type field, the date and time information of the third date-time type field in the loan information broad table is extracted discretely, and the extracted information is used as the numerical information corresponding to the third date-time type field. Wherein the third date-time class field is a 10-bit string date field. And (3) discretely extracting date and time information of a date field of the 10-bit string to respectively obtain 'year', 'month', 'date' and 'timestamp' information, and taking the extracted information as numerical information corresponding to a third date-time field. Alternatively, the extraction method may use the pan. In the method, the year, month and day in the character string date are identified based on the regular expression, and finally the return timestamp of the time module is called. For the date discretization extracted timestamp, a value with the midday day of the day, i.e., 12. After the 10-bit string date field is extracted, the extracted value field needs to be marked with a type so as not to be confused with other value field. After the discrete extraction of dates is completed, the original 10-bit string date field can be deleted.
In some embodiments of the present disclosure, when the field type is a type field, the value of the type field in the loan information broad table is encoded, and the encoded value is used as the value information of the corresponding type field. Optionally, the type field may be encoded numerically using a one-hot encoding mode, and the encoded numerical value is an integer. It should be noted that, the one-hot encoding needs to reserve "0" for null values, so the non-null field is encoded as a positive integer greater than 0. In addition, it is necessary to determine whether the number of codes is excessive. As an example, 300 may be used as an upper limit value for the number of codes, and once a field identified as a type class contains more than 300 different values, feedback is made and further manual processing is required.
Optionally, in some embodiments of the present disclosure, missing values may be filled in by averaging all values within consecutive value class fields.
And step 204, performing feature extraction on the processed loan information broad table to obtain a feature vector of the personal loan sample.
And step 205, inputting the feature vector of the personal loan sample into the double-layer nested model to obtain a recovery rate predicted value output by the double-layer nested model. The double-layer nested model is a double-layer model nested based on a gradient lifting decision tree LightGBM binary model and a LightGBM regression model.
And step 206, training the double-layer nested model according to the recovery rate prediction value and the real label information of the personal loan sample, and determining the trained double-layer nested model as a bad asset recovery rate prediction model.
It should be noted that, in the embodiment of the present disclosure, step 201, step 205, and step 206 may be implemented in any manner of the embodiments of the present disclosure, and the present disclosure does not limit this.
According to the training method of the bad property recovery rate prediction model, the fields in the loan information wide table are identified, at least part of the fields in the loan information wide table are processed based on the field types, the sample quality of extracting the characteristic vectors of the personal loan samples can be improved, and therefore the accuracy of prediction results is improved. Through a LightGBM binary classification model and a LightGBM regression model in the double-layer nested model, the condition that data samples are distributed unevenly can be effectively processed, particularly when data are distributed in an obvious double peak mode, the prediction accuracy of a trained model is improved, and the training speed of a bad asset recovery rate prediction model is improved. In addition, both the LightGBM binary model and the LightGBM regression model depend on the LightGBM algorithm, and the model can be supported to process mass data.
Fig. 3 is a flow chart illustrating a further method of training a bad asset recovery prediction model according to an exemplary embodiment, which method of training a bad asset recovery prediction model, as shown in fig. 3, comprises the following steps.
Step 301, constructing a loan information broad table of the personal loan sample according to the comprehensive information of the personal loan sample and the multiple data sources.
Step 302, performing feature extraction on the loan information wide table to obtain a feature vector of the personal loan sample.
And step 303, inputting the feature vector of the personal loan sample into the double-layer nested model to obtain a recovery rate predicted value output by the double-layer nested model. The double-layer nested model is a double-layer model nested based on a gradient lifting decision tree LightGBM binary model and a LightGBM regression model.
And step 304, acquiring real label information of the personal loan sample.
As one possible implementation, loan account information, roll-out information, and risk classification information for a sample of a personal loan may be obtained. And acquiring the bad contract transfer-in and transfer-out information according to the loan account information, the transfer-in and transfer-out information and the risk classification information. And calculating the recovery rate of the personal loan sample according to the earliest transfer information and the latest transfer information in the bad contract transfer-in information, and taking the recovery rate as the real label information of the personal loan sample. For example, the earliest transfer information is the balance at the earliest transfer, the latest transfer information is the balance at the latest transfer, and the recovery rate of the individual loan sample is determined based on (earliest transfer information — latest transfer information)/earliest transfer information.
And 305, calculating a loss function of the double-layer nested model according to the recovery rate predicted value and the real label information.
And step 306, calculating a residual error of the double-layer nested model based on the loss function, fitting a new decision tree based on the residual error to realize the training of the double-layer nested model, and determining the trained double-layer nested model as a prediction model of the recovery rate of the undesirable assets.
It should be noted that the gradient lifting tree algorithm uses the negative gradient of the loss function as an approximation of the residual in the lifting tree algorithm. The negative gradient of the loss function for the ith sample of round t is:
Figure BDA0004001894000000081
different loss functions will result in different negative gradients if the squared loss function is chosen:
Figure BDA0004001894000000082
the negative gradient is then:
Figure BDA0004001894000000083
at the moment, the negative gradient of the gradient lifting tree GBDT is the residual error of the double-layer nested model, a new decision tree is fitted based on the residual error so as to realize the training of the double-layer nested model, and the trained double-layer nested model is determined as a prediction model of the recovery rate of the bad assets.
It should be noted that, in the embodiment of the present disclosure, steps 301 to 303 may be implemented in any manner in various embodiments of the present disclosure, and the present disclosure does not limit this.
According to the training method of the bad asset recovery rate prediction model, the condition that the data samples are not uniformly distributed can be effectively processed through the LightGBM binary classification model and the LightGBM regression model in the double-layer nested model. And calculating a loss function according to the recovery rate predicted value and the real label information, further calculating the residual error of the double-layer nested model, fitting a new decision tree, training the double-layer nested model, and further improving the accuracy of model prediction. In addition, both the LightGBM binary model and the LightGBM regression model depend on the LightGBM algorithm, and the model can be supported to process mass data.
Fig. 4 is a flow diagram illustrating a method for constructing a wide table of loan information for a sample of a personal loan, according to an exemplary embodiment, which may include, but is not limited to, the following steps, as shown in fig. 4.
Step 401, obtaining client related information of the personal loan sample, and obtaining the amount contract information and the borrowing contract information of the personal loan sample.
Optionally, in some embodiments of the present disclosure, the customer-related information of the personal loan sample may include recent customer basic information, personal supplementary information, customer financial information, customer contact location information, and customer interpersonal relationship information. It should be noted that the client financial information is the client financial information after discrete processing.
It should be noted that the borrowing contract information may include basic borrowing contract information and conditional borrowing contract information.
Step 402, obtaining master-slave contract information of the personal loan sample according to the guarantee and/or mortgage contract information of the personal loan sample.
It is noted that, in some embodiments of the present disclosure, the vouching contract information may include a guarantee contract and/or a guaranty contract. The mortgage contract information may include a pledge contract and/or a mortgage contract.
It is further noted that the master-slave contract information includes the relationship between the vouching and/or mortgage contract information and the contract type.
And step 403, acquiring resource information of the personal loan sample, and acquiring loan account information, transfer-in/out information and risk classification information.
Optionally, in some embodiments of the present disclosure, the resource information of the personal loan sample may be used as a main table to separately associate the right information, the house information, and the vehicle information, so that the social resource owned by the customer behind can be reflected through the resource dimension, thereby showing the contract repayment capability of the customer.
It should be noted that principal and loan information can be obtained according to the loan account information; calculating the transferring-in and transferring-out times according to the transferring-in and transferring-out information, and finding out the balance of the earliest transferring-in time point and the latest transferring-out time point of each contract; bad contracts can be screened based on the risk information and classified according to five grades.
And step 404, constructing a loan information broad table of the personal loan sample according to the client related information, the amount contract information, the borrowing contract information, the master-slave contract information, the resource information, the loan account information, the transfer-in/out information and the risk classification information.
According to the embodiment of the disclosure, a loan information wide table of the personal loan sample is constructed according to the omnibearing information of the personal loan sample and multiple data sources, and the information dimension of the personal loan sample is expanded.
Fig. 5 is a flow chart illustrating a method of bad asset recovery prediction according to an exemplary embodiment, as shown in fig. 5, including the following steps.
Step 501, constructing a loan information broad table of the target object according to the omnibearing information of the multiple data sources of the target object.
Wherein the target object is a bad asset.
Step 502, performing feature extraction on the loan information broad table of the target object to obtain a feature vector of the target object.
And 503, inputting the feature vector into a preset bad asset recovery rate prediction model to obtain a bad asset recovery rate prediction value of the target object.
The bad asset recovery rate prediction model is a model obtained by training based on any one of the embodiments of the related training methods, and is not described herein again.
According to the method for predicting the recovery rate of the undesirable assets, the recovery rate of the undesirable assets of the target object is predicted based on the trained prediction model of the recovery rate of the undesirable assets, the accuracy of the prediction result of the recovery rate can be improved, and the condition that the data samples are not uniformly distributed can be effectively processed.
In order to implement the above embodiments, the present disclosure provides a training apparatus for a bad asset recovery rate prediction model. FIG. 6 is a block diagram of a training apparatus for a bad asset recovery prediction model, according to an exemplary embodiment. As shown in fig. 6, the training device of the bad asset recovery rate prediction model includes: a construction module 601, an extraction module 602, a prediction module 603, and a training module 604.
The building module 601 is configured to build a loan information broad table of the personal loan sample according to the comprehensive information of the personal loan sample from multiple data sources.
In some embodiments of the present disclosure, the building module 601 is specifically configured to obtain client-related information of a personal loan sample, and obtain a line contract information and a borrowing contract information of the personal loan sample; obtaining master-slave contract information of the personal loan sample according to the guarantee and/or mortgage contract information of the personal loan sample; acquiring resource information of the personal loan sample, and acquiring loan account information, transfer-in and transfer-out information and risk classification information; and constructing a loan information broad table of the personal loan sample according to the client-related information, the amount contract information and the loan contract information, the master-slave contract information, the resource information, the loan account information, the transfer-in/transfer-out information and the risk classification information.
An extracting module 602, configured to perform feature extraction on the loan information broad table to obtain a feature vector of the personal loan sample.
In some embodiments of the present disclosure, the extracting module 602 is specifically configured to identify fields in the loan information width table to obtain field types of the fields in the loan information width table; processing at least part of fields in the loan information width table based on the field types to change the field types of all the fields in the loan information width table into numerical value type fields; and performing feature extraction on the processed loan information broad table to obtain a feature vector of the personal loan sample.
In some embodiments of the present disclosure, the extracting module 602 is specifically configured to, in response to that the field type is a long character string field, extract a length of each long character string in the long character string field in the loan information wide table, and use the extracted length of each long character string as numerical information of the corresponding long character string field; and/or, in response to the field type being a first date and time field, discretely extracting date and time information of the first date and time field in the loan information broad table, and taking the extracted information as numerical information corresponding to the first date and time field; the first date and time class field is a date and time class field with a standard format; and/or, in response to the field type being a second date-time field, discretely extracting date and time information of the second date-time field in the loan information wide table, and taking the extracted information as numerical information corresponding to the second date-time field; wherein the second date-time class field is an 8-bit string date field; and/or, responding to the field type being a third date-time field, discretely extracting date and time information of the third date-time field in the loan information wide table, and taking the extracted information as numerical information corresponding to the third date-time field; wherein the third date-time class field is a 10-bit string date field; and/or, responding to the field type being the type field, carrying out numerical value coding on the type field in the loan information broad table, and taking the numerical value obtained after coding as the numerical value information of the corresponding type field.
The prediction module 603 is configured to input the feature vector of the personal loan sample into a double-layer nested model to obtain a recovery rate prediction value output by the double-layer nested model; the double-layer nested model is a double-layer model nested based on a gradient lifting decision tree LightGBM binary model and a LightGBM regression model.
In some embodiments of the present disclosure, the prediction module 603 is specifically configured to input the feature vector of the personal loan sample into a two-layer nested model; classifying and identifying the feature vectors of the personal loan sample based on the LightGBM two classification model in the double-layer nested model to obtain a classification result output by the LightGBM two classification model; in response to the classification result being a first classification category, taking the recovery rate corresponding to the first classification category as a recovery rate predicted value output by the double-layer nested model; wherein the first classification category is used to indicate that the personal loan sample is fully reclaimed; or, in response to the classification result being a second classification category, obtaining a feature vector corresponding to the second classification category from feature vectors of the personal loan sample, and performing regression prediction on the feature vector corresponding to the second classification category based on the LightGBM regression model in the double-layer nested model to obtain a predicted recovery value output by the double-layer nested model.
And the training module 604 is configured to train the double-layer nested model according to the recovery prediction value and the real tag information of the personal loan sample, and determine the trained double-layer nested model as the poor asset recovery prediction model.
In some embodiments of the present disclosure, the training module 604 is specifically configured to obtain real tag information of the personal loan sample; calculating a loss function of the double-layer nested model according to the recovery rate predicted value and the real label information; and calculating residual errors of the double-layer nested model based on the loss function, and fitting a new decision tree based on the residual errors so as to train the double-layer nested model.
In some embodiments of the present disclosure, the training module 604 is specifically configured to obtain loan account information, roll-out information, and risk classification information of the personal loan sample; acquiring bad contract transfer-in and transfer-out information according to the loan account information, the transfer-in and transfer-out information and the risk classification information; and calculating the recovery rate of the personal loan sample according to the earliest transfer-in information and the latest transfer-out information in the bad contract transfer-in and transfer-out information, and taking the recovery rate as the real label information of the personal loan sample.
With respect to the models in the above embodiments, the specific manner in which the respective modules perform operations has been described in detail in the embodiments related to the method, and will not be elaborated herein.
According to the training device of the bad asset recovery rate prediction model, the condition that data samples are not uniformly distributed can be effectively processed through the LightGBM binary classification model and the LightGBM regression model in the double-layer nested model, especially when data are obviously distributed in a double peak mode, the prediction accuracy of the trained model is improved, and the training speed of the bad asset recovery rate prediction model is improved. In addition, both the LightGBM binary classification model and the LightGBM regression model depend on the LightGBM algorithm, and the LightGBM regression model can support the model to process mass data.
In order to realize the above embodiment, the present disclosure further provides a device for predicting a recovery rate of a poor asset. Fig. 7 is a schematic diagram illustrating the configuration of a bad asset recovery prediction device, according to an exemplary embodiment. As shown in fig. 7, the bad asset recovery rate prediction apparatus includes: a construction module 701, an extraction module 702 and a prediction module 703.
The building module 701 is configured to build a wide loan information table of the target object according to the multi-data-source omnibearing information of the target object.
The extracting module 702 is configured to perform feature extraction on the loan information broad table of the target object to obtain a feature vector of the target object.
The prediction module 703 is configured to input the feature vector into a preset bad asset recovery rate prediction model to obtain a predicted value of the bad asset recovery rate of the target object.
The bad asset recovery rate prediction model is a model obtained by training based on any one of the above embodiments of the related training method, and is not described herein again.
With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.
According to the device for predicting the recovery rate of the poor assets, the recovery rate of the poor assets of the target object is predicted based on the trained prediction model of the recovery rate of the poor assets, the accuracy of the prediction result of the recovery rate can be improved, and the condition that data samples are not uniformly distributed can be effectively processed.
Fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure. As shown in fig. 8, the electronic device may include: a transceiver 801, a processor 802, a memory 803.
The processor 802 executes computer-executable instructions stored in the memory, causing the processor 802 to perform the aspects of the embodiments described above. The processor 802 may be a general-purpose processor including a central processing unit CPU, a Network Processor (NP), and the like; but also a digital signal processor DSP, an application specific integrated circuit ASIC, a field programmable gate array FPGA or other programmable logic device, discrete gate or transistor logic, discrete hardware components.
A memory 803 is coupled to the processor 802 via the system bus and communicates with the processor, and the memory 803 is used for storing computer program instructions.
The transceiver 801 may be used to obtain the task to be executed and the configuration information of the task to be executed.
The system bus may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The system bus may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown, but this is not intended to represent only one bus or type of bus. The transceiver is used to enable communication between the database access device and other computers (e.g., clients, read-write libraries, and read-only libraries). The memory may include Random Access Memory (RAM) and may also include non-volatile memory (non-volatile memory).
The electronic device provided by the embodiment of the present disclosure may be the terminal device of the above embodiment.
The embodiment of the disclosure further provides a chip for operating the instructions, and the chip is used for executing the technical scheme of the training method of the bad asset recovery rate prediction model or the bad asset recovery rate prediction method in the embodiment.
The embodiment of the present disclosure further provides a computer-readable storage medium, where a computer instruction is stored in the computer-readable storage medium, and when the computer instruction runs on a computer, the computer is caused to execute the training method of the bad asset recovery rate prediction model or the technical solution of the bad asset recovery rate prediction method in the foregoing embodiment.
The embodiment of the present disclosure further provides a computer program product, where the computer program product includes a computer program stored in a computer-readable storage medium, and the computer program can be read by at least one processor from the computer-readable storage medium, and the at least one processor can implement the method for training a bad asset recovery rate prediction model or the technical solution of the method for predicting a bad asset recovery rate in the foregoing embodiments when executing the computer program.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It will be understood that the present disclosure is not limited to the precise arrangements that have been described above and shown in the drawings, and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims (13)

1. A training method of a bad asset recovery rate prediction model is characterized by comprising the following steps:
constructing a loan information wide table of the personal loan sample according to the omnibearing information of the personal loan sample and multiple data sources;
performing feature extraction on the loan information broad table to obtain a feature vector of the personal loan sample;
inputting the characteristic vector of the personal loan sample into a double-layer nested model to obtain a recovery rate predicted value output by the double-layer nested model; the double-layer nested model is a double-layer model nested based on a gradient lifting decision tree LightGBM binary classification model and a LightGBM regression model;
and training the double-layer nested model according to the recovery prediction value and the real label information of the personal loan sample, and determining the trained double-layer nested model as the bad asset recovery prediction model.
2. The method of claim 1, wherein said extracting features from said loan information wide table to obtain a feature vector for said personal loan sample, comprises:
identifying fields in the loan information wide table to obtain the field types of the fields in the loan information wide table;
processing at least part of fields in the loan information width table based on the field types to change the field types of all the fields in the loan information width table into numerical value type fields;
and performing feature extraction on the processed loan information wide table to obtain a feature vector of the personal loan sample.
3. The method of claim 2, wherein said processing at least a portion of the fields in the wide table of loan information based on the field type comprises:
in response to the field type being a long character string field, extracting the length of each long character string in the long character string field in the loan information wide table, and taking the extracted length of each long character string as numerical value information of the corresponding long character string field;
and/or discretely extracting date and time information of the first date and time field in the loan information wide table in response to the field type being the first date and time field, and taking the extracted information as numerical information corresponding to the first date and time field; the first date and time class field is a date and time class field with a standard format;
and/or, in response to the field type being a second date-time field, discretely extracting date and time information of the second date-time field in the loan information wide table, and taking the extracted information as numerical information corresponding to the second date-time field; wherein the second date-time class field is an 8-bit string date field;
and/or, responding to the field type being a third date-time field, discretely extracting date and time information of the third date-time field in the loan information wide table, and taking the extracted information as numerical information corresponding to the third date-time field; wherein the third date-time class field is a 10-bit string date field;
and/or, responding to the field type being the type field, carrying out numerical value coding on the type field in the loan information broad table, and taking the numerical value obtained after coding as the numerical value information of the corresponding type field.
4. The method of claim 1, wherein the inputting the feature vector of the personal loan sample into a two-level nested model to obtain a recovery prediction value output by the two-level nested model comprises:
inputting the feature vector of the personal loan sample into a double-layer nested model;
classifying and identifying the feature vectors of the personal loan sample based on the LightGBM two classification model in the double-layer nested model to obtain a classification result output by the LightGBM two classification model;
in response to the classification result being a first classification category, taking the recovery rate corresponding to the first classification category as a recovery rate predicted value output by the double-layer nested model; wherein the first classification category is used to indicate that the personal loan sample is fully reclaimed;
or, in response to the classification result being a second classification category, obtaining a feature vector corresponding to the second classification category from feature vectors of the personal loan sample, and performing regression prediction on the feature vector corresponding to the second classification category based on the LightGBM regression model in the double-layer nested model to obtain a predicted recovery value output by the double-layer nested model.
5. The method of claim 1, wherein the training of the two-tier nested model based on the recovery prediction value and the true tag information of the personal loan sample comprises:
acquiring real label information of the personal loan sample;
calculating a loss function of the double-layer nested model according to the recovery rate predicted value and the real label information;
and calculating residual errors of the double-layer nested model based on the loss function, and fitting a new decision tree based on the residual errors so as to realize the training of the double-layer nested model.
6. The method of claim 5, wherein said obtaining the true tag information of the sample of the personal loan comprises:
obtaining loan account information, transfer-in and transfer-out information and risk classification information of the personal loan sample;
acquiring bad contract transferring-in and transferring-out information according to the loan account information, the transferring-in and transferring-out information and the risk classification information;
and calculating the recovery rate of the personal loan sample according to the earliest transfer information and the latest transfer information in the bad contract transfer-in and transfer-out information, and taking the recovery rate as the real label information of the personal loan sample.
7. The method according to any one of claims 1 to 6, wherein the constructing a loan information broad table of the personal loan sample based on the comprehensive information of the personal loan sample from multiple data sources comprises:
obtaining client related information of a personal loan sample, and obtaining the amount contract information and the borrowing contract information of the personal loan sample;
acquiring master-slave contract information of the personal loan sample according to the guarantee and/or mortgage contract information of the personal loan sample;
acquiring resource information of the personal loan sample, and acquiring loan account information, transfer-in and transfer-out information and risk classification information;
and constructing a loan information broad table of the personal loan sample according to the client-related information, the amount contract information and the loan contract information, the master-slave contract information, the resource information, the loan account information, the transfer-in/transfer-out information and the risk classification information.
8. A method for predicting recovery of an undesirable asset, comprising:
constructing a loan information broad table of a target object according to the omnibearing information of multiple data sources of the target object;
performing feature extraction on the loan information broad table of the target object to obtain a feature vector of the target object;
inputting the characteristic vector into a preset bad asset recovery rate prediction model to obtain a bad asset recovery rate prediction value of the target object;
wherein the bad asset recovery prediction model is a model trained based on the training method according to any one of claims 1 to 7.
9. A training apparatus for a bad asset recovery rate prediction model, comprising:
the construction module is used for constructing a loan information broad table of the personal loan sample according to the omnibearing information of the personal loan sample and the multiple data sources;
the extraction module is used for extracting the characteristics of the loan information broad table to obtain the characteristic vector of the personal loan sample;
the prediction module is used for inputting the feature vector of the personal loan sample into a double-layer nested model to obtain a recovery rate prediction value output by the double-layer nested model; the double-layer nested model is a double-layer model nested based on a gradient lifting decision tree LightGBM binary classification model and a LightGBM regression model;
and the training module is used for training the double-layer nested model according to the recovery prediction value and the real label information of the personal loan sample, and determining the trained double-layer nested model as the bad asset recovery prediction model.
10. An apparatus for predicting recovery rate of an undesirable asset, comprising:
the construction module is used for constructing a loan information broad table of the target object according to the omnibearing information of multiple data sources of the target object;
the extraction module is used for carrying out feature extraction on the loan information broad table of the target object to obtain a feature vector of the target object;
the prediction module is used for inputting the feature vector to a preset bad asset recovery rate prediction model so as to obtain a bad asset recovery rate prediction value of the target object;
wherein the bad asset recovery prediction model is a model trained based on the training method according to any one of claims 1 to 7.
11. An electronic device, comprising: a processor, and a memory communicatively coupled to the processor;
the memory stores computer-executable instructions;
the processor executes computer-executable instructions stored by the memory to implement the method of any one of claims 1-7 or to implement the method of claim 8.
12. A computer-readable storage medium having computer-executable instructions stored therein, which when executed by a processor, are configured to implement the method of any one of claims 1-7, or the method of claim 8.
13. A computer program product, comprising a computer program which, when executed by a processor, implements the method of any one of claims 1-7, or implements the method of claim 8.
CN202211620585.8A 2022-12-15 2022-12-15 Training method of bad asset recovery rate prediction model and recovery rate prediction method Pending CN115880056A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211620585.8A CN115880056A (en) 2022-12-15 2022-12-15 Training method of bad asset recovery rate prediction model and recovery rate prediction method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211620585.8A CN115880056A (en) 2022-12-15 2022-12-15 Training method of bad asset recovery rate prediction model and recovery rate prediction method

Publications (1)

Publication Number Publication Date
CN115880056A true CN115880056A (en) 2023-03-31

Family

ID=85755035

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211620585.8A Pending CN115880056A (en) 2022-12-15 2022-12-15 Training method of bad asset recovery rate prediction model and recovery rate prediction method

Country Status (1)

Country Link
CN (1) CN115880056A (en)

Similar Documents

Publication Publication Date Title
CN107945024B (en) Method for identifying internet financial loan enterprise operation abnormity, terminal equipment and storage medium
CN110378786B (en) Model training method, default transmission risk identification method, device and storage medium
CN110930038A (en) Loan demand identification method, loan demand identification device, loan demand identification terminal and loan demand identification storage medium
CN112750029A (en) Credit risk prediction method, device, electronic equipment and storage medium
CN112561320A (en) Training method of mechanism risk prediction model, mechanism risk prediction method and device
CN107622326A (en) User's classification, available resources Forecasting Methodology, device and equipment
CN113822488A (en) Risk prediction method and device for financing lease, computer equipment and storage medium
CN110689437A (en) Communication construction project financial risk prediction method based on random forest
CN115545886A (en) Overdue risk identification method, overdue risk identification device, overdue risk identification equipment and storage medium
CN117235608B (en) Risk detection method, risk detection device, electronic equipment and storage medium
CN114139490A (en) Method, device and equipment for automatic data preprocessing
CN112860672A (en) Method and device for determining label weight
CN112950347A (en) Resource data processing optimization method and device, storage medium and terminal
CN115880056A (en) Training method of bad asset recovery rate prediction model and recovery rate prediction method
CN115600818A (en) Multi-dimensional scoring method and device, electronic equipment and storage medium
CN115689713A (en) Abnormal risk data processing method and device, computer equipment and storage medium
CN115439079A (en) Item classification method and device
CN115422028A (en) Credibility evaluation method and device for label portrait system, electronic equipment and medium
CN114861163A (en) Abnormal account identification method, device, equipment, storage medium and program product
CN110570301B (en) Risk identification method, device, equipment and medium
CN113850669A (en) User grouping method and device, computer equipment and computer readable storage medium
CN110544165B (en) Credit risk score card creating method and device and electronic equipment
CN113256351A (en) User service demand identification method and device and computer readable storage medium
CN111932131A (en) Service data processing method and device
CN113901817A (en) Document classification method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination