CN118071486A

CN118071486A - Method and device for determining default probability, storage medium and electronic device

Info

Publication number: CN118071486A
Application number: CN202311790509.6A
Authority: CN
Inventors: 刘律康; 胡光琪; 张东朔; 姜蕴璐; 刘清泉
Original assignee: China Construction Bank Corp
Current assignee: China Construction Bank Corp
Priority date: 2023-12-22
Filing date: 2023-12-22
Publication date: 2024-05-24

Abstract

The application discloses a method and a device for determining default probability, a storage medium and an electronic device, wherein the method comprises the following steps: collecting object data of a plurality of objects, wherein each object data comprises a plurality of variables; establishing a plurality of sub-models according to the categories of the plurality of variables, determining the modeling variable of each sub-model, and establishing a fusion model according to the plurality of sub-models, wherein the fusion model is used for evaluating the default probability of a target object; and inputting target object data of the target object into the plurality of sub-models for processing, and inputting the obtained plurality of processing results into the fusion model for processing to obtain the default probability of the target object, wherein the target object data comprises the modeling variables of the plurality of sub-models.

Description

Method and device for determining default probability, storage medium and electronic device

Technical Field

The application relates to the field of intelligent monitoring, in particular to a method and a device for determining default probability, a storage medium and an electronic device.

Background

Breach prediction by big data machine learning models is a more common way in the field of credit risk. However, the dependent variables of the machine learning model need to be a certain data amount and a clearly defined sample, and the sample of the default of the urban bond is missing (no substantial default occurs so far), and the default cannot be directly taken as a target value, so that the general model can select similar risk events and identifiers as dependent variable substitution values of model training, such as the credit ratio of the issuing time (the difference between the ticket face interest rate and the no-risk interest rate of the bond) or the identified "bad clients" below a certain rating. The former intelligently provides a solution to credit risk ordering and has the problem of "risk definition time too far ahead" (with poor credit on issue, failing to take account of its credit variation at critical points in time during life, especially before expiration); while the latter is more experienced by the expert, too wealth, and lacks data support, the sample size is still very limited.

Aiming at the problems that in the prior art, a prediction method for potential credit risk of a city casting enterprise is too simple, a data range is relatively single, and the default probability of the city casting enterprise cannot be accurately estimated, no effective solution is proposed at present.

Accordingly, there is a need for improvements in the related art to overcome the drawbacks of the related art.

Disclosure of Invention

The embodiment of the application provides a method and a device for determining the default probability, a storage medium and an electronic device, which at least solve the problems that the prior art is too simple in a method for predicting the potential credit risk of a urban casting enterprise, the data range is relatively single, and the default probability of the urban casting enterprise cannot be accurately estimated.

According to an aspect of the embodiment of the application, there is provided a method for determining a default probability, including: collecting object data of a plurality of objects, wherein each object data comprises a plurality of variables; establishing a plurality of sub-models according to the categories of the plurality of variables, determining the modeling variable of each sub-model, and establishing a fusion model according to the plurality of sub-models, wherein the fusion model is used for evaluating the default probability of a target object; and inputting target object data of the target object into the plurality of sub-models for processing, and inputting the obtained plurality of processing results into the fusion model for processing to obtain the default probability of the target object, wherein the target object data comprises the modeling variables of the plurality of sub-models.

Further, before collecting object data of the plurality of objects, the method further comprises: adding object labels to the plurality of objects according to class violation rules, wherein the object labels comprise: a class violation object, a normal object, wherein the class violation object is an object that meets the class violation rules.

Further, after collecting object data of the plurality of objects, the method further comprises: performing data cleaning on the plurality of object data according to a preset cleaning rule to obtain a plurality of first object data, wherein the preset cleaning rule is used for cleaning abnormal variables in the plurality of variables; performing variable derivation on the plurality of first object data according to a variable derivation rule to obtain a plurality of second object data, wherein the number of variables contained in the first object data is smaller than that of variables contained in the second object data; and carrying out variable binning on the plurality of second object data, screening out a plurality of qualified variables contained in the plurality of second object data according to a binning result, and establishing a plurality of sub-models according to the categories of the plurality of qualified variables.

Further, variable binning the plurality of second object data, including: sorting the first variables in the plurality of second object data; calculating a target value of each of the ordered first variables, wherein the target value is used for indicating a difference value between a first proportion and a second proportion in a plurality of second variables which are ordered before the first variables, wherein the first proportion is used for indicating a ratio of a first number of the second variables with object labels being similar to default objects in the plurality of second variables to a second number of the first variables with object labels being similar to default objects in the plurality of second object data, and the second proportion is used for indicating a ratio of a third number of the second variables with object labels being normal objects in the plurality of second variables to a fourth number of the first variables with object labels being normal objects in the plurality of second object data; determining a target first variable corresponding to a target value with the largest absolute value among a plurality of target values as a tangent point in the sequenced first variable, and dividing the sequenced first variable into two sub-boxes according to the tangent point; and continuously carrying out variable box division on the obtained two boxes to obtain a plurality of boxes with preset quantity.

Further, screening out a plurality of qualified variables contained in the plurality of second object data according to the binning result, including: calculating an evidence weight value for each bin of the plurality of bins by: Wherein WOE is the evidence weight value, bi is the number of first variables of which object labels are class default objects in the ith sub-box, gi is the number of first variables of which object labels are normal objects in the ith sub-box, B is the second number, and G is the fourth number; and screening the first variables according to the evidence weight values of each first variable in the second object data to obtain qualified variables.

Further, establishing a plurality of sub-models according to the categories of the plurality of variables, determining the modeling variable of each sub-model, and establishing a fusion model according to the plurality of sub-models, wherein the method comprises the following steps: dividing the qualified variables into a plurality of categories according to a preset classification rule, wherein the categories are in one-to-one correspondence with the sub-models; selecting N qualified variables from M qualified variables of a target class to establish a corresponding target sub-model, and determining the N qualified variables as the modeling variables of the target sub-model, wherein M and N are positive integers, and M is greater than or equal to N; determining model output results of the plurality of sub-models, and performing linear processing on the plurality of model output results to obtain a plurality of processed model output results; and establishing the fusion model according to the output results of the plurality of processed models.

Further, after obtaining the default probability of the target object, the method further includes: establishing a scoring model of the plurality of objects; and inputting the default probability into the scoring model for processing to obtain the credit score of the target object.

According to another aspect of the embodiment of the present application, there is also provided a device for determining a probability of default, including: the system comprises an acquisition module, a storage module and a control module, wherein the acquisition module is used for acquiring object data of a plurality of objects, and each object data comprises a plurality of variables; the establishing module is used for establishing a plurality of sub-models according to the categories of the plurality of variables, determining the modeling variable of each sub-model and establishing a fusion model according to the plurality of sub-models, wherein the fusion model is used for evaluating the default probability of a target object; and the processing module is used for inputting target object data of the target object into the plurality of sub-models for processing, and inputting the obtained plurality of processing results into the fusion model for processing to obtain the default probability of the target object, wherein the target object data comprises the modeling variables of the plurality of sub-models.

According to yet another aspect of the embodiments of the present application, there is also provided a computer-readable storage medium having a computer program stored therein, wherein the computer program is configured to perform the above-described method of determining the probability of breach when run.

According to still another aspect of the embodiments of the present application, there is further provided an electronic device including a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor executes the method for determining the probability of breach through the computer program.

According to the application, object data of a plurality of objects are collected firstly, wherein each object data comprises a plurality of variables; establishing a plurality of sub-models according to the categories of the plurality of variables, determining the modeling variable of each sub-model, and establishing a fusion model according to the plurality of sub-models, wherein the fusion model is used for evaluating the default probability of a target object; inputting target object data of a target object into the plurality of sub-models, and inputting a plurality of obtained processing results into a fusion model for processing to obtain the default probability of the target object; the target object data includes at least these modulo variables; by adopting the scheme, the credit level of the urban investment enterprises (target objects) is deeply, objectively and comprehensively analyzed from multiple dimensions, the credit qualification of the urban investment enterprises is comprehensively evaluated by using the fusion model, and the risk identification capacity of the bond investment transaction business of the financial institutions is improved; therefore, the problem that the prediction method for potential credit risk of the urban casting enterprise is too simple, the data range is relatively single, and the default probability of the urban casting enterprise cannot be accurately estimated in the related technology is solved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:

FIG. 1 is a block diagram of the hardware architecture of a computer terminal of a method for determining the probability of breach of the contract according to an embodiment of the present application;

FIG. 2 is a flow chart of a method of determining a probability of breach in accordance with an embodiment of the present application;

FIG. 3 is a flow chart of a method for determining a probability of breach in accordance with an embodiment of the present application;

Fig. 4 is a block diagram of a configuration of a determination apparatus of an offending probability according to an embodiment of the present application.

Detailed Description

In order that those skilled in the art will better understand the present application, a technical solution in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present application without making any inventive effort, shall fall within the scope of the present application.

It should be noted that the terms and "first," "second," and the like in the description and claims of the present application and the above-described drawings are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The method embodiments provided in the embodiments of the present application may be executed in a computer terminal or similar computing device. Taking the operation on a computer terminal as an example, fig. 1 is a block diagram of a hardware structure of a computer terminal of a method for determining a probability of breach according to an embodiment of the present application. As shown in fig. 1, the computer terminal may include one or more (only one is shown in fig. 1) processors 102 (the processor 102 may include, but is not limited to, a microprocessor (Microprocessor Unit, abbreviated MPU) or a programmable logic device (Programmable logic device, abbreviated PLD)) and a memory 104 for storing data, and in an exemplary embodiment, the above-described dome camera may further include a transmission device 106 for communication functions and an input-output device 108. It will be appreciated by those skilled in the art that the configuration shown in fig. 1 is merely illustrative and is not intended to limit the configuration of the computer terminal described above. For example, a computer terminal may also include more or fewer components than shown in FIG. 1, or have a different configuration than the equivalent functions shown in FIG. 1 or more than the functions shown in FIG. 1.

The memory 104 may be used to store a computer program, for example, a software program of application software and a module, such as a computer program corresponding to a method for determining a probability of breach in the embodiment of the present application, and the processor 102 executes the computer program stored in the memory 104 to perform various functional applications and data processing, which corresponds to implementing the method described above. Memory 104 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory remotely located relative to the processor 102, which may be connected to the computer terminal via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission device 106 is used to receive or transmit data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of a computer terminal. In one example, the transmission device 106 includes a network adapter (Network Interface Controller, simply referred to as a NIC) that can connect to other network devices through a base station to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module, which is configured to communicate with the internet wirelessly.

In this embodiment, a method for determining a probability of default is provided, and fig. 2 is a flowchart of a method for determining a probability of default according to an embodiment of the present application, where the flowchart includes the following steps:

step S202, collecting object data of a plurality of objects, wherein each object data comprises a plurality of variables;

step S204, a plurality of sub-models are established according to the categories of the variables, in-mold variables of each sub-model are determined, and a fusion model is established according to the plurality of sub-models, wherein the fusion model is used for evaluating the default probability of a target object;

and S206, inputting target object data of the target object into the plurality of sub-models for processing, and inputting the obtained plurality of processing results into the fusion model for processing to obtain the default probability of the target object, wherein the target object data comprises the modeling variables of the plurality of sub-models.

It should be noted that, through the modeling process, when the breach probability prediction is finally performed on the target object, only the acquired data is as follows: the data corresponding to the modulus-in variables of the submodels saves time and labor consumed by statistics of unnecessary data, and the model can more accurately predict the probability of the violation of the target object by accurately positioning variables influencing the probability of the violation.

Through the steps, object data of a plurality of objects are collected firstly, wherein each object data comprises a plurality of variables; establishing a plurality of sub-models according to the categories of the plurality of variables, determining the modeling variable of each sub-model, and establishing a fusion model according to the plurality of sub-models, wherein the fusion model is used for evaluating the default probability of a target object; inputting target object data of a target object into the plurality of sub-models, and inputting a plurality of obtained processing results into a fusion model for processing to obtain the default probability of the target object; the target object data includes at least these modulo variables; by adopting the scheme, the credit level of the urban investment enterprises (target objects) is deeply, objectively and comprehensively analyzed from multiple dimensions, the credit qualification of the urban investment enterprises is comprehensively evaluated by using the fusion model, and the risk identification capacity of the bond investment transaction business of the financial institutions is improved; therefore, the problem that the prediction method for potential credit risk of the urban casting enterprise is too simple, the data range is relatively single, and the default probability of the urban casting enterprise cannot be accurately estimated in the related technology is solved.

The method for predicting the potential credit risk of the urban projection enterprise is formed in a complete mode through data acquisition, model training and probability prediction.

In an alternative embodiment, step S202 described above is performed: before collecting object data of the plurality of objects, the method further comprises: adding object labels to the plurality of objects according to class violation rules, wherein the object labels comprise: a class violation object, a normal object, wherein the class violation object is an object that meets the class violation rules.

Before collecting object data, the object needs to be labeled, the object is marked as a normal object and a class default object, and 3 classes of clients with 'non-standard default', 'low in-line rating', 'low in-market main rating' are defined as 'class default objects' in consideration of diversity and look-ahead attribute of urban credit risk display because the domestic urban bidding debt platform has no substantial default cases so far.

Based on the above steps, after collecting object data of a plurality of objects, the method further comprises: performing data cleaning on the plurality of object data according to a preset cleaning rule to obtain a plurality of first object data, wherein the preset cleaning rule is used for cleaning abnormal variables in the plurality of variables; performing variable derivation on the plurality of first object data according to a variable derivation rule to obtain a plurality of second object data, wherein the number of variables contained in the first object data is smaller than that of variables contained in the second object data; and carrying out variable binning on the plurality of second object data, screening out a plurality of qualified variables contained in the plurality of second object data according to a binning result, and establishing a plurality of sub-models according to the categories of the plurality of qualified variables.

After the object data is collected, the object data needs to be cleaned, specifically, variables with a missing rate greater than or equal to 70%, multiple errors (mainly logic errors and abnormal values), and a concentration (the concentration of the risk variable a=the number of the highest occurrence frequency of the risk variable a/the number of the total samples is 100%) greater than or equal to 75% (corresponding to the preset cleaning rule) may be deleted. And then carrying out variable derivation on the basic variable according to a variable derivation rule, namely processing the basic variable into a 'dynamic' new variable according to service experience: such as the changing condition (the direction in which the same variable changes in values at different times), the changing degree (the amplitude in which the same variable changes in values at different times), the comparison with a reference value (the deviation degree calculated by comparing a certain variable with the reference value), and the missing value filling processing of a part of the variables. And then variable binning is carried out on the plurality of second object data, a plurality of qualified variables contained in the plurality of second object data are screened out according to the binning result, and finally a plurality of sub-models are built according to the categories of the plurality of qualified variables.

The above-mentioned reference value is understood to mean, median, etc. of the variable in the field as a reference standard.

Further, the steps are as follows: variable binning of the plurality of second object data may be achieved by: sorting the first variables in the plurality of second object data; calculating a target value of each of the ordered first variables, wherein the target value is used for indicating a difference value between a first proportion and a second proportion in a plurality of second variables which are ordered before the first variables, wherein the first proportion is used for indicating a ratio of a first number of the second variables with object labels being similar to default objects in the plurality of second variables to a second number of the first variables with object labels being similar to default objects in the plurality of second object data, and the second proportion is used for indicating a ratio of a third number of the second variables with object labels being normal objects in the plurality of second variables to a fourth number of the first variables with object labels being normal objects in the plurality of second object data; determining a target first variable corresponding to a target value with the largest absolute value among a plurality of target values as a tangent point in the sequenced first variable, and dividing the sequenced first variable into two sub-boxes according to the tangent point; and continuously carrying out variable box division on the obtained two boxes to obtain a plurality of boxes with preset quantity.

Because the dimensions of each variable are inconsistent, the variables are usually required to be standardized, standardized variable values are modeled, the model fitting degree is better, the model is quickly converged, and the efficiency and the accuracy of the model are improved when the variable is converted; the variable normalization and conversion in the application mainly adopts a box division and WOE (Weight of Evidence, i.e. evidence weight) method: the variable box division adopts an optimal KS box division method, and comprises the following steps: the variables are ordered from small to large according to the numerical values, the value (namely the target value) with the largest absolute value of the difference between the accumulated bad client duty ratio (namely the first proportion) and the accumulated good client duty ratio (namely the second proportion) is calculated, namely the variable value corresponding to the KS value is taken as the tangent point and is marked as D, and then the data is segmented into two parts. And repeating the steps on the two cut sub-boxes, recursing, and further cutting the data about D until the variable sub-boxes reach the preset condition.

It should be noted that, the preset condition may include a plurality of bins of the preset number, and may further include: the duty ratio of each bin sample number needs to be greater than a first threshold, and the number of samples per bin is not less than a second threshold, etc.; the preset conditions may be satisfied only by one of them, or may be set so as to be satisfied entirely, which is not limited by the present application.

The WOE value calculation formula for each bin of the variable is: bi is the number of first variables in the ith sub-box, the object label of which is a class default object, gi is the number of first variables in the ith sub-box, the object label of which is a normal object, B is the total number of first variables in the class default object, namely the second number, and G is the total number of first variables in the class default object, namely the fourth number; from the above calculation formula, if the WOE is equal to 0, it indicates that the bad proportion of the sub-box is consistent with the whole sample, if the WOE is smaller than 0, it indicates that the bad proportion of the sub-box is smaller than the whole sample level, the qualification of the sub-box client is better, and if the WOE is larger than 0, it indicates that the bad proportion of the sub-box is higher than the whole sample level, the qualification of the sub-box client is worse.

On the basis of the above, validity analysis and business rationality adjustment are continued. Combining the business significance of the variables, considering the direction between the independent variables and the dependent variables, and eliminating the variables with the results being violated with the business significance; only the variables for which the binned WOE values exhibit monotonic trend are retained. That is, each variable has its actual business meaning, since the variable is sorted before binning, the WOE value of each bin should be monotonic, if not monotonic, the description data is problematic, and the multiple qualified variables are obtained finally, so that the sub-models can be built according to the categories of the multiple qualified variables.

Further, the steps are as follows: establishing a plurality of sub-models according to the categories of the plurality of variables, determining the modeling variable of each sub-model, and establishing a fusion model according to the plurality of sub-models, wherein the method comprises the following steps of: dividing the qualified variables into a plurality of categories according to a preset classification rule, wherein the categories are in one-to-one correspondence with the sub-models; selecting N qualified variables from M qualified variables of a target class to establish a corresponding target sub-model, and determining the N qualified variables as the modeling variables of the target sub-model, wherein M and N are positive integers, and M is greater than or equal to N; determining model output results of the plurality of sub-models, and performing linear processing on the plurality of model output results to obtain a plurality of processed model output results; and establishing the fusion model according to the output results of the plurality of processed models.

Dividing the qualified variables into four types of data according to the actual meanings of different variables, namely enterprise financial report data, local government financial data, bond market data and in-line data, wherein the four types of data correspond to five different sub-models, namely an enterprise financial sub-model, an enterprise business sub-model, a local government sub-model, a bond market sub-model, a bank-enterprise market sub-model and a bank-enterprise relationship sub-model; training the five sub-models by using the classified qualified variables respectively, wherein as shown in table 1, the initial basic variables corresponding to the sub-models are 53,24,57,28,20 respectively; after the variable derivation is carried out by the variable derivation rule, the number of candidate variables is 179,38,151,106,98 respectively; after screening through a box, 26,13,47,20,84 qualified variables are obtained; taking an enterprise financial sub-model as an example, not every qualified modeling variable can produce a gain effect on model training, so that variables which produce negative effects on model training can be removed one by one in the model training process, and when model training is finished finally, the number of variables adopted by a training model is determined to be 7, namely the model entering variables of the model are determined to be 7.

It should be noted that the multiple sub-models and the fusion model are both logistic regression models.

TABLE 1

Because the training process of the logistic regression model is processed by the sigmoid function, the numerical range of the output result (p 1-p 5) is between the probabilities 0 and 1, and therefore, the probabilities (p 1-p 5) are restored to the linear form (Y1-Y5) according to the logistic regression formula before the next implementation.

ln(p/(1-p))＝α+β₁x₁+βx₂+…+β_nx_n＝Y；

And substituting the processed 5 sub-model output results (Y1-Y5) as variables into a new fusion model to obtain 5 new coefficients beta, so as to determine the weight of each sub-model, and obtaining a final fusion model result p _final.

It should be noted that, the model algorithm in the application mainly adopts a logistic regression algorithm, which is a generalized linear regression analysis model and is commonly used in the fields of data mining, disease automatic diagnosis, economic prediction and the like. Logistic regression estimates the probability of event occurrence from a given set of auto-variable data, with the dependent variable ranging between 0-1. Logistic regression is also commonly used to generate credit cards.

The multivariate logistic regression formula is as follows:

it is equivalent to:

p is a dependent variable whose observed value at modeling is 0 or 1 (i.e., y takes on a value of 0 or 1), and the predicted value of p is between 0 and 1, representing the probability that the sample is a bad customer. The independent variable x _i can be substituted by the WOE value after the variable is divided into boxes in the application.

The basic output of each submodel and the final fusion model is the default probability, and the default probability can be converted into the credit score of the client, so that the scoring condition of the client can be displayed more intuitively. The scoring calibration process requires 3 parameters: standard score (P ₀), standard Odds (Odds ₀), and PDO.

Standard score (P ₀): model scores for standard Odds (i.e., model scores corresponding to Odds ₀);

Standard Odds (Odds ₀): odds corresponding to the standard score;

PDO (Points to Double the Odds): odds doubles the fraction of decrease.

The scoring card score may be defined by a linear expression that represents the score as a log of Odds.

The following is indicated:

Score＝A-B ln(Odds)；

The score corresponding to Odds ₀ is set to P ₀. Then, the point score with the ratio of 2Odds ₀ is P ₀ -PDO. Substituting the above formula can solve for A and B:

A＝P₀+Bln(Odds₀)；

B＝PDO/ln(2)；

The final scoring card model score is:

Score＝A-B(α+β₁x₁+βx₂+…+β_nx_n)；

the base scores of the final model were:

P₀+PDO/ln(2)(ln(Odds₀)-α)；

the variable classification score of the model is:

-PDO/ln(2)β_iWOE_ijδ_ij；

Because the bad proportion of the sampling client is different from the bad proportion of the whole sample, constant items are also required to be adjusted in the following way:

the result is rounded off.

It will be apparent that the embodiments described above are merely some, but not all, embodiments of the application. In order to better understand the determination method of the default probability, the following description will explain the above process with reference to the embodiments, but is not intended to limit the technical solution of the embodiments of the present application, specifically:

in an alternative embodiment, the above method for determining the probability of breach may be implemented by the following manner, as shown in fig. 3, including the following procedures:

Step 1, determining an analysis sample:

The samples were labeled as normal issuer (corresponding to the normal object) and default issuer (corresponding to the class default object). Because the domestic city bidding debt platform has no substantial default cases so far, the model in the invention defines 3 classes of clients with 'non-bidding violations', 'low in-line ratings', 'low market subject ratings' as 'class default samples' in consideration of diversity and prospective attributes of city bidding credit risk presentation.

In terms of the time range of the samples, 4556 issues are selected for the urban drop platform with class violations or expired during the period from 10 th 1 st 2017 to 10 th 1 st 2020, wherein 104 "class violations samples" are respectively composed of "history nonstandard violations", "domestic subject rating BBB and below/foreign subject rating B (standard pro, reputation)/B2 (muster) and below", and "in-line on-public rating 12 and below", and the proportion of the total samples is 2.28%.

An effective model needs to have predictive aura, i.e., how long before a violation occurs the issuer can be predictably perceived by the model. The data points used for modeling select violations/expiration times of 3 months (or more) in view of the poor liquidity of bonds, especially bonds at risk of violations.

And 2, obtaining public data from the general population, wherein the public data comprises financial class data, market class data and macroscopic class data. Acquiring in-line data of an issuer (namely an object) from an in-line data warehouse, wherein the range comprises lender transaction details, business handling conditions, early warning record data of an internal early warning system and the like;

in the aspect of data, based on public data such as financial reports, macro economy, market data and the like, internal data of me lines and analysis of default characteristics of urban liability credit risk subjects, a total of 182 basic variables of 5 major classes are compiled in an initial stage.

Step 3, quality control of variable data:

The variables having a loss rate of 70% or more, a large number of errors (mainly logical errors, abnormal values), and a concentration (risk variable a concentration=number of values with highest occurrence frequency of risk variable a/total number of samples×100%) of 75% or more are deleted.

Step 4, variable data derivative processing:

Processing the basic variable into a 'dynamic' new variable according to service experience: such as the changing situation (the direction of the same variable changing in different time values), the changing degree (the changing amplitude of the same variable changing in different time values), the comparison with the reference value (the comparison between a certain variable and the reference value to calculate the deviation degree), and the missing value filling processing of partial variables. After this step is completed, a total of 572 candidate variables are obtained.

Step 5, variable sub-box and WOE conversion:

because the dimensions of each variable are inconsistent, the variables are usually required to be standardized, standardized variable values are modeled, the model fitting degree is better, the model is quickly converged, and the efficiency and the accuracy of the model are improved when the variable is converted.

Referring to the experience of the same industry, the variable normalization and conversion mainly adopts a box division and WOE (Weight of Evidence, i.e. evidence weight) method:

the variable box division adopts an optimal KS box division method, and comprises the following steps:

1) Sorting the variables from small to large according to the values;

2) And calculating a variable value corresponding to the KS value, which is the value with the largest absolute value of the cumulative bad client duty ratio and the cumulative good client duty ratio difference, and taking the variable value as a tangent point and marking the tangent point as D. Then dividing the data into two parts, wherein the accumulated bad clients are classified variables, counting a plurality of data classified in a target variable as a whole by taking a certain target variable as a limit, and determining the ratio of the number of data with object labels being class default objects in the data to the total number of bad clients; the accumulated clients are the ratio of the data quantity of which the object labels in the data are normal objects to the total number of good clients;

3) Repeating the step 2), recursing, and further cutting the data about D until the variable bin reaches a preset threshold condition.

In the modeling process, the preset conditions of variable box division are as follows: 1. the number of the sub-boxes is not more than 7 boxes; 2. the number of samples per bin is not less than 1%; 3. the number of samples per bin is not less than 50.

The WOE value calculation formula of each bin of the variable is:

Wherein:

b _i = number of bad samples in the ith bin;

g _i = number of good samples in the ith bin;

b = total number of bad samples in the samples;

G = good total number of samples in the samples.

From the above calculation formula, if the WOE is equal to 0, it indicates that the bad ratio of the sub-box is consistent with the whole sample, if the WOE is smaller than 0, it indicates that the bad ratio of the sub-box is smaller than the whole sample level, the qualification of the sub-box client is better, and if the WOE is larger than 0, it indicates that the bad ratio of the sub-box is higher than the whole sample level, the qualification of the sub-box client is worse.

On the basis of the above, validity analysis and business rationality adjustment are continued. Combining business significance, considering the direction between the independent variable and the dependent variable, and eliminating the variable with the result being violated with the business significance; only the variable that the binned WOE exhibits monotonic trend is retained. After this step is completed, 190 qualified variables are obtained that can be modeled.

Step 6, building a sub-model:

According to the analysis dimension, 5 sub-models are established, namely an enterprise finance sub-model, an enterprise management sub-model, a local government sub-model, a bond market sub-model and a bank enterprise relationship sub-model. The modulo variable case for each submodel is shown in table 1. The input to each sub-model is the variable's WOE value after the conversion described above.

Step 7, establishing a fusion model and a credit score card:

Since the training process of the logistic regression model is processed by the sigmoid function, the numerical range of the output result (p ₁-p₅) is between 0 and 1, and therefore the probability (p ₁-p₅) needs to be restored to a linear form (Y ₁-Y₅) according to the logistic regression formula before the next implementation.

ln(p/(1-p))＝α+β₁x₁+βx₂+…+β_nx_n＝Y；

And substituting the processed 5 sub-model output results (Y ₁-Y₅) into a new fusion model as variables to obtain 5 new coefficients beta, so as to determine the weight of each sub-model, and obtaining a final fusion model result p _final.

The basic output of each sub-model and the final fusion model is the default probability, and the default probability is converted into the credit score of the client, so that the scoring condition of the client can be displayed more intuitively. The scoring calibration process requires 3 parameters: standard score (P ₀), standard Odds (Odds ₀), and PDO.

Standard Odds (Odds ₀): odds corresponding to the standard score;

PDO (Points to Double the Odds): odds doubles the fraction of decrease.

The following is indicated:

Score＝A-Bln(Odds)；

A＝P₀+Bln(Odds₀)；

B＝PDO/ln(2)；

The final scoring card model score is:

Score＝A-B(α+β₁x₁+βx₂+…+β_nx_n)；

the base scores of the final model were:

P₀+PDO/ln(2)(ln(Odds₀)-α)；

the variable classification score of the model is:

-PDO/ln(2)β_iWOE_ijδ_ij；

because the sampling customer bad proportion is different from the whole sample bad proportion, constant items are required to be adjusted in the following way:

Rounding.

From the description of the above embodiments, it will be clear to a person skilled in the art that the method according to the above embodiments may be implemented by means of software plus the necessary general hardware platform, but of course also by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising several instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method of the various embodiments of the present application.

The embodiment also provides a device for determining the probability of default, which is used for implementing the above embodiment and the preferred implementation, and is not described in detail. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. While the devices described in the following embodiments are preferably implemented in software, implementations in hardware, or a combination of software and hardware, are also possible and contemplated.

Fig. 4 is a block diagram of a configuration of an apparatus for determining a probability of breach according to an embodiment of the present application, the apparatus including:

an acquisition module 42 for acquiring object data of a plurality of objects, wherein each object data comprises a plurality of variables;

A building module 44, configured to build a plurality of sub-models according to the categories of the plurality of variables, determine a modulus-in variable of each sub-model, and build a fusion model according to the plurality of sub-models, where the fusion model is used to evaluate the probability of breach of the target object;

and the processing module 46 is configured to input target object data of the target object into the multiple sub-models for processing, and input the multiple obtained processing results into the fusion model for processing, so as to obtain the default probability of the target object, where the target object data includes the modulus-in variables of the multiple sub-models.

Optionally, the collecting module 42 is further configured to add object tags to the plurality of objects according to a rule of class violation, where the object tags include: a class violation object, a normal object, wherein the class violation object is an object that meets the class violation rules.

Optionally, the collecting module 42 is further configured to perform data cleaning on a plurality of object data according to a preset cleaning rule to obtain a plurality of first object data, where the preset cleaning rule is used to clean abnormal variables in the plurality of variables; performing variable derivation on the plurality of first object data according to a variable derivation rule to obtain a plurality of second object data, wherein the number of variables contained in the first object data is smaller than that of variables contained in the second object data; and carrying out variable binning on the plurality of second object data, screening out a plurality of qualified variables contained in the plurality of second object data according to a binning result, and establishing a plurality of sub-models according to the categories of the plurality of qualified variables.

Optionally, the collecting module 42 is further configured to sort the first variables in the plurality of second object data; calculating a target value of each of the ordered first variables, wherein the target value is used for indicating a difference value between a first proportion and a second proportion in a plurality of second variables which are ordered before the first variables, wherein the first proportion is used for indicating a ratio of a first number of the second variables with object labels being similar to default objects in the plurality of second variables to a second number of the first variables with object labels being similar to default objects in the plurality of second object data, and the second proportion is used for indicating a ratio of a third number of the second variables with object labels being normal objects in the plurality of second variables to a fourth number of the first variables with object labels being normal objects in the plurality of second object data; determining a target first variable corresponding to a target value with the largest absolute value among a plurality of target values as a tangent point in the sequenced first variable, and dividing the sequenced first variable into two sub-boxes according to the tangent point; and continuously carrying out variable box division on the obtained two boxes to obtain a plurality of boxes with preset quantity.

Because the dimensions of the variables are inconsistent, the variables are usually required to be standardized, standardized variable values are modeled, the model fitting degree is better when the variables are subjected to multi-variable regression after conversion, the model is fast in face thinning, and the efficiency and the accuracy of the model are improved; the variable normalization and conversion in the application mainly adopts a box division and WOE (Weight of Evidence, i.e. evidence weight) method: the variable box division adopts an optimal KS box division method, and comprises the following steps: the variables are ordered from small to large according to the numerical values, the value (namely the target value) with the largest absolute value of the difference between the accumulated bad client duty ratio (namely the first proportion) and the accumulated good client duty ratio (namely the second proportion) is calculated, namely the variable value corresponding to the KS value is taken as the tangent point and is marked as D, and then the data is segmented into two parts. And repeating the steps on the two cut sub-boxes, recursing, and further cutting the data about D until the variable sub-boxes reach the preset condition.

Optionally, the collecting module 42 is further configured to calculate the evidence weight value of each of the bins according to the following formula: Wherein WOE is the evidence weight value, bi is the number of first variables of which object labels are class default objects in the ith sub-box, gi is the number of first variables of which object labels are normal objects in the ith sub-box, B is the second number, and G is the fourth number; and screening the first variables according to the evidence weight values of each first variable in the second object data to obtain qualified variables.

The WOE value calculation formula for each bin of the variable is: bi is the number of the first variables of which the object labels are class default objects in the ith sub-box, gi is the number of the first variables of which the object labels are normal objects in the ith sub-box, namely the second number, B is the total number of the first variables of which the object labels are class default objects, and G is the total number of the first variables of which the object labels are normal objects, namely the fourth number; from the above calculation formula, if the WOE is equal to 0, it indicates that the bad proportion of the sub-box is consistent with the whole sample, if the WOE is smaller than 0, it indicates that the bad proportion of the sub-box is smaller than the whole sample level, the qualification of the sub-box client is better, and if the WOE is larger than 0, it indicates that the bad proportion of the sub-box is higher than the whole sample level, the qualification of the sub-box client is worse.

Optionally, the establishing module 44 is further configured to divide the plurality of qualified variables into a plurality of categories according to a preset classification rule, where the plurality of categories are in one-to-one correspondence with the plurality of submodels; selecting N qualified variables from M qualified variables of a target class to establish a corresponding target sub-model, and determining the N qualified variables as the modeling variables of the target sub-model, wherein M and N are positive integers, and M is greater than or equal to N; determining model output results of the plurality of sub-models, and performing linear processing on the plurality of model output results to obtain a plurality of processed model output results; and establishing the fusion model according to the output results of the plurality of processed models.

ln(p/(1-p))＝α+β₁x₁+βx₂+…+β_nx_n＝Y

The multivariate logistic regression formula is as follows:

it is equivalent to:

Optionally, the establishing module 44 is further configured to establish a scoring model of the plurality of objects; and inputting the default probability into the scoring model for processing to obtain the credit score of the target object.

Standard Odds (Odds ₀): odds corresponding to the standard score;

PDO (Points to Double the Odds): odds doubles the fraction of decrease.

The following is indicated:

Score＝A-Bln(Odds)；

A＝P₀+Bln(Odds₀)；

B＝PDO/ln(2)；

The final scoring card model score is:

Score＝A-B(α+β₁x₁+βx₂+…+β_nx_n)；

the base scores of the final model were:

P₀+PDO/ln(2)(ln(Odds₀)-α)；

the variable classification score of the model is:

-PDO/ln(2)β_iWOE_ijδ_ij；

the result is rounded off.

Alternatively, in the present embodiment, the above-described storage medium may be configured to store a computer program for performing the steps of:

S1, collecting object data of a plurality of objects, wherein each object data comprises a plurality of variables;

s2, establishing a plurality of sub-models according to the categories of the plurality of variables, determining the modeling variable of each sub-model, and establishing a fusion model according to the plurality of sub-models, wherein the fusion model is used for evaluating the default probability of a target object;

s3, inputting target object data of the target object into the plurality of sub-models for processing, and inputting the obtained plurality of processing results into the fusion model for processing to obtain the default probability of the target object, wherein the target object data comprises the modeling variables of the plurality of sub-models.

In one exemplary embodiment, the computer readable storage medium may include, but is not limited to: a usb disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory RAM), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing a computer program.

Specific examples in this embodiment may refer to the examples described in the foregoing embodiments and the exemplary implementation, and this embodiment is not described herein.

An embodiment of the application also provides an electronic device comprising a memory having stored therein a computer program and a processor arranged to run the computer program to perform the steps of any of the method embodiments described above.

Alternatively, in the present embodiment, the above-described processor may be configured to execute the following steps by a computer program:

In an exemplary embodiment, the electronic apparatus may further include a transmission device connected to the processor, and an input/output device connected to the processor.

It will be appreciated by those skilled in the art that the modules or steps of the application described above may be implemented in a general purpose computing device, they may be concentrated on a single computing device, or distributed across a network of computing devices, they may be implemented in program code executable by computing devices, so that they may be stored in a storage device for execution by computing devices, and in some cases, the steps shown or described may be performed in a different order than that shown or described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple modules or steps of them may be fabricated into a single integrated circuit module. Thus, the present application is not limited to any specific combination of hardware and software.

The above description is only of the preferred embodiments of the present application and is not intended to limit the present application, but various modifications and variations can be made to the present application by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the principle of the present application should be included in the protection scope of the present application.

Claims

1. A method for determining a probability of breach, comprising:

Collecting object data of a plurality of objects, wherein each object data comprises a plurality of variables; establishing a plurality of sub-models according to the categories of the plurality of variables, determining the modeling variable of each sub-model, and establishing a fusion model according to the plurality of sub-models, wherein the fusion model is used for evaluating the default probability of a target object;

And inputting target object data of the target object into the plurality of sub-models for processing, and inputting the obtained plurality of processing results into the fusion model for processing to obtain the default probability of the target object, wherein the target object data comprises the modeling variables of the plurality of sub-models.

2. The method of determining a probability of breach according to claim 1, wherein prior to collecting object data for a plurality of objects, the method further comprises:

adding object labels to the plurality of objects according to class violation rules, wherein the object labels comprise: a class violation object, a normal object, wherein the class violation object is an object that meets the class violation rules.

3. The method of determining a probability of breach according to claim 2, wherein after collecting object data of a plurality of objects, the method further comprises:

performing data cleaning on the plurality of object data according to a preset cleaning rule to obtain a plurality of first object data, wherein the preset cleaning rule is used for cleaning abnormal variables in the plurality of variables;

performing variable derivation on the plurality of first object data according to a variable derivation rule to obtain a plurality of second object data, wherein the number of variables contained in the first object data is smaller than that of variables contained in the second object data;

And carrying out variable binning on the plurality of second object data, screening out a plurality of qualified variables contained in the plurality of second object data according to a binning result, and establishing a plurality of sub-models according to the categories of the plurality of qualified variables.

4. The method of determining the probability of breach according to claim 3, wherein variable binning the plurality of second object data comprises:

sorting the first variables in the plurality of second object data;

Calculating a target value of each of the ordered first variables, wherein the target value is used for indicating a difference value between a first proportion and a second proportion in a plurality of second variables which are ordered before the first variables, wherein the first proportion is used for indicating a ratio of a first number of the second variables with object labels being similar to default objects in the plurality of second variables to a second number of the first variables with object labels being similar to default objects in the plurality of second object data, and the second proportion is used for indicating a ratio of a third number of the second variables with object labels being normal objects in the plurality of second variables to a fourth number of the first variables with object labels being normal objects in the plurality of second object data;

determining a target first variable corresponding to a target value with the largest absolute value among a plurality of target values as a tangent point in the sequenced first variable, and dividing the sequenced first variable into two sub-boxes according to the tangent point;

And continuously carrying out variable box division on the obtained two boxes to obtain a plurality of boxes with preset quantity.

5. The method of determining the probability of breach according to claim 4, wherein screening out a plurality of qualified variables included in the plurality of second object data according to the binning result comprises:

Calculating an evidence weight value for each bin of the plurality of bins by:

Wherein WOE is the evidence weight value, B i is the number of first variables of which object labels are class default objects in the ith bin, gi is the number of first variables of which object labels are normal objects in the ith bin, B is the second number, and G is the fourth number;

and screening the first variables according to the evidence weight values of each first variable in the second object data to obtain qualified variables.

6. The method of determining the probability of breach according to claim 5, wherein building a plurality of sub-models according to the categories of the plurality of variables and determining the modulo variable of each sub-model, and building a fusion model according to the plurality of sub-models, comprises:

dividing the qualified variables into a plurality of categories according to a preset classification rule, wherein the categories are in one-to-one correspondence with the sub-models;

Selecting N qualified variables from M qualified variables of a target class to establish a corresponding target sub-model, and determining the N qualified variables as the modeling variables of the target sub-model, wherein M and N are positive integers, and M is greater than or equal to N;

determining model output results of the plurality of sub-models, and performing linear processing on the plurality of model output results to obtain a plurality of processed model output results;

and establishing the fusion model according to the output results of the plurality of processed models.

7. The method of determining a probability of breach according to claim 1, wherein after obtaining the probability of breach of the target object, the method further comprises:

Establishing a scoring model of the plurality of objects;

And inputting the default probability into the scoring model for processing to obtain the credit score of the target object.

8. A device for determining a probability of breach, comprising:

The system comprises an acquisition module, a storage module and a control module, wherein the acquisition module is used for acquiring object data of a plurality of objects, and each object data comprises a plurality of variables;

The establishing module is used for establishing a plurality of sub-models according to the categories of the plurality of variables, determining the modeling variable of each sub-model and establishing a fusion model according to the plurality of sub-models, wherein the fusion model is used for evaluating the default probability of a target object;

And the processing module is used for inputting target object data of the target object into the plurality of sub-models for processing, and inputting the obtained plurality of processing results into the fusion model for processing to obtain the default probability of the target object, wherein the target object data comprises the modeling variables of the plurality of sub-models.

9. A computer readable storage medium, characterized in that the computer readable storage medium comprises a stored program, wherein the program when run performs the method of any of the preceding claims 1 to 7.

10. An electronic device comprising a memory and a processor, characterized in that the memory has stored therein a computer program, the processor being arranged to execute the method according to any of the claims 1 to 7 by means of the computer program.