CN117077028A - Method, device, computer equipment and storage medium for constructing risk prediction model - Google Patents
Method, device, computer equipment and storage medium for constructing risk prediction model Download PDFInfo
- Publication number
- CN117077028A CN117077028A CN202311075823.6A CN202311075823A CN117077028A CN 117077028 A CN117077028 A CN 117077028A CN 202311075823 A CN202311075823 A CN 202311075823A CN 117077028 A CN117077028 A CN 117077028A
- Authority
- CN
- China
- Prior art keywords
- sample
- space
- risk prediction
- subspace
- variable
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 76
- 238000013058 risk prediction model Methods 0.000 title claims abstract description 60
- 230000006870 function Effects 0.000 claims abstract description 88
- 230000002159 abnormal effect Effects 0.000 claims abstract description 62
- 238000012545 processing Methods 0.000 claims abstract description 33
- 238000004590 computer program Methods 0.000 claims abstract description 26
- 238000012216 screening Methods 0.000 claims abstract description 17
- 230000004083 survival effect Effects 0.000 claims description 133
- 206010000117 Abnormal behaviour Diseases 0.000 claims description 83
- 238000004458 analytical method Methods 0.000 claims description 22
- 238000011156 evaluation Methods 0.000 claims description 22
- 230000008569 process Effects 0.000 claims description 22
- 238000010276 construction Methods 0.000 claims description 10
- 230000001351 cycling effect Effects 0.000 claims description 4
- 238000012546 transfer Methods 0.000 claims description 4
- 125000004122 cyclic group Chemical group 0.000 claims description 3
- 238000007405 data analysis Methods 0.000 abstract description 2
- 238000010586 diagram Methods 0.000 description 8
- 230000008859 change Effects 0.000 description 6
- 230000000694 effects Effects 0.000 description 6
- 238000004422 calculation algorithm Methods 0.000 description 5
- 238000004891 communication Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 4
- 230000003247 decreasing effect Effects 0.000 description 3
- 230000003121 nonmonotonic effect Effects 0.000 description 3
- 238000000638 solvent extraction Methods 0.000 description 3
- 238000013500 data storage Methods 0.000 description 2
- 238000012797 qualification Methods 0.000 description 2
- 230000009467 reduction Effects 0.000 description 2
- 230000003068 static effect Effects 0.000 description 2
- OKTJSMMVPCPJKN-UHFFFAOYSA-N Carbon Chemical compound [C] OKTJSMMVPCPJKN-UHFFFAOYSA-N 0.000 description 1
- 230000033228 biological regulation Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000012512 characterization method Methods 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 238000004836 empirical method Methods 0.000 description 1
- 230000002349 favourable effect Effects 0.000 description 1
- 229910021389 graphene Inorganic materials 0.000 description 1
- 230000006698 induction Effects 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/243—Classification techniques relating to the number of classes
- G06F18/2431—Multiple classes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/04—Inference or reasoning models
- G06N5/041—Abduction
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q40/00—Finance; Insurance; Tax strategies; Processing of corporate or income taxes
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Business, Economics & Management (AREA)
- Computational Linguistics (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Probability & Statistics with Applications (AREA)
- Accounting & Taxation (AREA)
- Development Economics (AREA)
- Economics (AREA)
- Finance (AREA)
- Marketing (AREA)
- Strategic Management (AREA)
- Technology Law (AREA)
- General Business, Economics & Management (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The application relates to the technical field of big data analysis, in particular to a method and a device for constructing a risk prediction model, computer equipment, a storage medium and a computer program product. The method comprises the following steps: sample data corresponding to a plurality of variables related to risk prediction are obtained, and each sample data is subjected to box division processing respectively to obtain a plurality of boxes corresponding to each variable; screening target variables with up-to-standard risk prediction capability from a plurality of variables; taking the bin with the highest abnormal proportion of the sample object in the bins corresponding to the target variable as a target bin of the target variable; determining a sample space containing each target bin, dividing the sample space into a plurality of subspaces, and respectively determining a risk prediction function of a sample object in each subspace; and constructing a risk prediction model based on the risk prediction functions corresponding to the subspaces. By adopting the method, a risk prediction model which can be closely combined with the business of a financial institution and accurately predict the risk can be constructed.
Description
Technical Field
The present application relates to the field of big data analysis technology, and in particular, to a method, an apparatus, a computer device, a storage medium, and a computer program product for constructing a risk prediction model.
Background
For financial institutions, the risk state (such as the probability of abnormal behavior) of users can be accurately predicted, and the method has great significance for scientific and reasonable management business and support business. In the conventional technology, a financial institution directly mines rules related to risk prediction based on historical data to predict a risk state of a user.
However, in general, only a small number of users have abnormal behaviors, so that the relevant data of users who have abnormal behaviors in the historical data of the financial institution are small, but the relevant data of users who have not abnormal behaviors are large, and if rules related to risk prediction are mined directly based on the historical data, prediction rules biased to non-risk events are easy to generate, and the obtained risk prediction rules cannot effectively predict risks. In addition, the risk prediction rules of the financial institutions are different from the general rules, and the businesses of the financial institutions are also required to be closely combined, so that risk false alarms are avoided, and business approval efficiency of the financial institutions is affected.
Therefore, in the conventional technology, the risk prediction rule is mined, so that the risk prediction rule can be tightly combined with the financial institution business, and the risk can be accurately predicted.
Disclosure of Invention
In view of the foregoing, it is desirable to provide a risk prediction model construction method, apparatus, computer device, computer-readable storage medium, and computer program product that can construct a risk prediction model that accurately predicts risks in close association with financial institution business.
In a first aspect, the present application provides a method for constructing a risk prediction model. The method comprises the following steps:
sample data corresponding to a plurality of variables related to risk prediction are obtained, and box division processing is carried out on the sample data to obtain a plurality of boxes corresponding to each variable; sample data belong to different sample objects;
screening target variables with up-to-standard risk prediction capability from a plurality of variables;
aiming at each target variable, taking the bin with the highest abnormal proportion of the sample object in a plurality of bins corresponding to the target variable as a target bin of the target variable;
determining a sample space containing each target bin, dividing the sample space into a plurality of subspaces, and respectively determining a risk prediction function of a sample object in each subspace;
Constructing a risk prediction model based on the risk prediction functions corresponding to the subspaces; the risk prediction model is used for determining a target subspace of the object to be predicted, and predicting the risk state of the object to be predicted based on a risk prediction function of the target subspace.
In a second aspect, the present application provides a device for constructing a risk prediction model, where the device includes:
the box division processing module is used for acquiring sample data corresponding to each of a plurality of variables related to risk prediction, and respectively carrying out box division processing on each sample data to obtain a plurality of box divisions corresponding to each variable; sample data belong to different sample objects;
the variable screening module is used for screening target variables with up-to-standard risk prediction capability from a plurality of variables;
the box division determining module is used for dividing boxes with highest abnormal proportion of sample objects in a plurality of boxes corresponding to each target variable as target boxes of the target variable;
the space processing module is used for determining a sample space containing each target bin, dividing the sample space into a plurality of subspaces and respectively determining a risk prediction function of a sample object in each subspace;
the model construction module is used for predicting functions based on the risks corresponding to the subspaces respectively; and constructing a risk prediction model, wherein the risk prediction model is used for determining a target subspace to which the object to be predicted belongs and predicting a risk state of the object to be predicted based on a risk prediction function of the target subspace.
In a third aspect, the present application also provides a computer device. The computer device comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the steps of the method for constructing the risk prediction model when executing the computer program.
In a fourth aspect, the present application also provides a computer-readable storage medium. The computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the steps of the method of constructing a risk prediction model described above.
In a fifth aspect, the present application also provides a computer program product. The computer program product comprises a computer program which, when executed by a processor, implements the steps of the method of constructing a risk prediction model described above.
According to the method, the device, the computer equipment, the storage medium and the computer program product for constructing the risk prediction model, firstly, sample data corresponding to a plurality of variables related to risk prediction are acquired, namely, firstly, the sample data related to the risk prediction are acquired by tightly combining with the business of a financial institution, then, the sample data are respectively subjected to box division processing to obtain a plurality of boxes corresponding to each variable, so that the business interpretability of the sample data is improved, then, the target variable with the standard reaching risk prediction capability is screened out from the plurality of variables, the risk prediction effect of the risk prediction model is prevented from being influenced by the variable with the poor risk prediction capability, the box with the highest abnormal proportion of the sample object in the plurality of boxes corresponding to the target variable is used as the target box of the target variable, and the sample space containing each target box is determined, namely, the sample space is constructed based on the sample data with the high risk prediction capability, when the sample space is prevented from reaching standard, a prediction rule of a non-item is generated, effective prediction is facilitated, the risk prediction is further, the risk prediction function is divided into the plurality of boxes based on the corresponding risk prediction functions of the financial institution, and the risk prediction function can be accurately determined based on the corresponding risk prediction function in the financial institution, and the risk prediction function is tightly combined with the risk prediction function.
Drawings
FIG. 1 is an application environment diagram of a method of constructing a risk prediction model in one embodiment;
FIG. 2 is a flow chart of a method of constructing a risk prediction model according to one embodiment;
FIG. 3 is a flow diagram of a binning process in one embodiment;
FIG. 4 is a flow chart of a method of screening target variables in one embodiment;
FIG. 5 is a flow diagram of a method of spatial partitioning in one embodiment;
FIG. 6 is a schematic diagram of a plurality of subspaces in one embodiment;
FIG. 7 is a flow chart of a spatial merging method according to an embodiment;
FIG. 8 is a flow chart of a method of constructing a survival analysis function in one embodiment;
FIG. 9 is a flow chart of a method of constructing a risk prediction function according to one embodiment;
FIG. 10 is a flow chart of a method of constructing a risk prediction model in another embodiment;
FIG. 11 is a block diagram of an apparatus for constructing a risk prediction model in one embodiment;
fig. 12 is an internal structural diagram of a computer device in one embodiment.
Detailed Description
The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.
It should be noted that, the user (sample object) information (including but not limited to user equipment information, user personal information, etc.) and the data (including but not limited to data for analysis, stored data, presented data, etc.) related to the present application are information and data authorized by each end user or sufficiently authorized by each party, and the collection, use and processing of the related data need to comply with the related laws and regulations and standards of the related country and region.
The method for constructing the risk prediction model provided by the embodiment of the application can be applied to an application environment shown in fig. 1. Wherein terminals 102 of a plurality of branches in a financial institution may each communicate with a server 104 of the financial institution via a network. The data storage system may store data that the server 104 needs to process. The data storage system may be integrated on the server 104 or may be located on a cloud or other network server. The server 104 may communicate with the terminals 102 of the branches in the financial institution to obtain sample data corresponding to each of the multiple variables related to risk prediction, so that the data volume of the obtained sample data is sufficiently large, where the sample data belongs to different sample objects, further, the server 104 may respectively perform a binning process on each sample data to obtain multiple bins corresponding to each variable, and select a target variable with up-to-standard risk prediction capability from the multiple variables, so that for each target variable, the server 104 may use a bin with the highest abnormal proportion of the sample object in the multiple bins corresponding to the target variable as a target bin of the target variable, and further determine a sample space containing each target bin, further, the server 104 may divide the sample space into multiple subspaces, respectively determine a risk prediction function of the sample object in each subspace, and then, based on the risk prediction function corresponding to each subspace, construct a risk prediction model, so that the risk prediction model 104 may subsequently determine the target subspace to be predicted based on the risk prediction model, and based on the risk prediction function of the target space to be predicted by the risk prediction function.
The financial institution refers to a financial agency related to the financial service industry. The terminal 102 may be, but not limited to, various personal computers, notebook computers, smartphones, tablet computers, internet of things devices, etc., and sample data of a plurality of sample objects obtained by a branch office when the branch office processes a history service is stored in the terminal 102 of each branch office of the financial institution. The server 104 of the financial institution may be implemented as a stand-alone server or as a server cluster of multiple servers.
In one embodiment, as shown in fig. 2, a method for constructing a risk prediction model is provided, and the method is applied to the server in fig. 1 for illustration, and includes the following steps:
step 202, obtaining sample data corresponding to a plurality of variables related to risk prediction, and respectively carrying out box division on the sample data to obtain a plurality of boxes corresponding to each variable; the sample data belongs to different sample objects.
The sample object specifically comprises: the registration is completed in the financial institution and the user is transacted with the business. The box separation treatment can be specifically as follows: the characteristic box division has the core idea that data is divided into a plurality of boxes according to a certain rule, so that the distribution condition of the data can be better understood in the data mining and analyzing process, and the service interpretability of the data is improved.
Taking the financial institution as an example, the financial institution has resource storage qualification and resource borrowing qualification, variables related to risk prediction include, but are not limited to: continuous and discrete variables associated with risk prediction. The continuous variables may specifically be: the amount of resources held by the sample object, the amount of resources that the sample object has borrowed to the financial institution, the amount of resources that the sample object has returned, and so on. The discrete variables may specifically be: the number of resource storage/borrowing accounts handled by the sample object in the financial institution, the number of times the sample object handles resource borrowing in the financial institution, the resource return condition (whether to return the resource on time) corresponding to each resource borrowing of the sample object, registration information (such as gender) of the sample object, and the like.
Optionally, the server may determine a plurality of variables related to risk prediction, and further collect and acquire sample data corresponding to the plurality of variables related to risk prediction through communication with terminals of a plurality of branches of the financial institution, and then perform binning processing on each sample data according to a set binning rule, so as to obtain a plurality of bins corresponding to each variable. Wherein, for each variable, sample data which belongs to different sample objects and is related to the variable is stored in each sub-box.
For example, the server may mine association rules between abnormal behaviors and variables based on association rule mining algorithms according to historical data stored in a financial institution, such as sample data corresponding to sample objects where abnormal behaviors occur, thereby determining a plurality of variables having high association with the abnormal behaviors, and regarding the determined variables as variables related to risk prediction.
The server may also acquire and analyze various documents and materials for predicting occurrence probability of abnormal behavior of the user through a network, determine a plurality of variables related to risk prediction, or acquire a plurality of variables set by a worker based on an empirical method in response to an operation instruction of the worker in a financial institution, thereby taking the variables set by the worker as variables related to risk prediction.
And 204, screening target variables with up-to-standard risk prediction capability from a plurality of variables.
Wherein, target variable characterization that risk prediction ability reached standard: the value and the change trend of the sample data corresponding to the target variable have larger influence on the probability of abnormal behavior of the sample object, have risk prediction value and have practicability on risk prediction business of financial institutions.
Optionally, the server may closely combine the business of the financial institution, evaluate the risk prediction capability of each of the multiple variables according to a set evaluation rule, and screen out target variables with risk prediction capability reaching standards from the multiple variables.
Step 206, regarding each target variable, taking the bin with the highest abnormal proportion of the sample objects in the bins corresponding to the target variable as the target bin of the target variable.
Wherein, for each variable, each bin has an abnormal proportion of the respective corresponding sample object.
Optionally, for each target variable, the server may determine, according to a set abnormal proportion calculation method, an abnormal proportion of a sample object in each bin corresponding to the target variable, and use a bin with a highest abnormal proportion of a sample object in a plurality of bins corresponding to the target variable as a target bin of the target variable, so as to determine target bins corresponding to the target variables.
In step 208, a sample space containing each target bin is determined, the sample space is divided into a plurality of subspaces, and a risk prediction function of the sample object in each subspace is determined.
Wherein, the probability of abnormal behavior of the sample objects in the same subspace is relatively close. The risk prediction function can be used for predicting the probability of abnormal behavior of the sample object and predicting the change trend of the risk state of the sample object.
Optionally, the server may only reserve the target bins corresponding to the multiple target objects, and construct the sample space based on the target bins corresponding to the multiple target objects, so as to achieve dimension reduction of the data, and improve efficiency of performing iterative division and merging on the sample space in the following process.
Optionally, the server may circularly perform steps of space division and space combination on the sample space according to a set space division rule and a set space combination rule until a circulation stopping condition is reached, so as to obtain a plurality of subspaces after division is completed, and further, the server may determine a risk prediction function of the sample object in each subspace based on a set risk prediction function fitting rule. Only the target bin with the highest bin value is reserved, and a sample space is constructed based on the target bin, so that limitation on distribution of sample data in the sample space can be avoided.
Step 210, constructing a risk prediction model based on the risk prediction functions corresponding to the subspaces; the risk prediction model is used for determining a target subspace of the object to be predicted, and predicting the risk state of the object to be predicted based on a risk prediction function of the target subspace.
Alternatively, the server may construct a risk prediction model based on each subspace and the respective corresponding risk prediction function for each subspace.
For example, if the risk state of the object to be predicted needs to be predicted, the server may determine a target subspace to which the object to be predicted belongs based on data related to the object to be predicted, and predict the probability that the abnormal behavior of the object to be predicted occurs currently based on a risk prediction function of the target subspace.
In the method for constructing the risk prediction model, firstly, sample data corresponding to a plurality of variables related to risk prediction are acquired, namely, the sample data related to risk prediction is acquired by tightly combining with the business of a financial institution, then, the sample data are respectively subjected to box division processing to obtain a plurality of boxes corresponding to each variable, so that the business interpretability of the sample data is improved, then, target variables with standard risk prediction capability are screened out from the plurality of variables, the risk prediction effect of the risk prediction model is prevented from being influenced by the variables with poor risk prediction capability, so that the situation of risk misinformation is reduced, the boxes with highest abnormal proportion of sample objects in the plurality of boxes corresponding to the target variables are used as target boxes of the target variables, the sample space containing the target boxes is determined, namely, the sample data with high risk based on the standard risk prediction capability is constructed, the prediction rule of a non-risk item is avoided when the risk prediction model is constructed based on the sample space, the sample space is further, the sample space is divided into a plurality of subspaces, the risk prediction effect of the risk prediction model is prevented, the risk prediction function is respectively determined, the risk prediction function can be accurately and accurately combined with the financial institution based on the risk prediction function of each risk prediction function, and the risk prediction function is accurately predicted based on the corresponding to the risk prediction function of the financial institution.
In one embodiment, as shown in fig. 3, the binning process is performed on each sample data to obtain a plurality of bins corresponding to each variable, where the processing includes:
step 302, performing primary binning processing on sample data corresponding to each variable related to risk prediction, so as to obtain a plurality of initial bins corresponding to the variables.
Optionally, for each variable related to risk prediction, the server may perform primary binning on sample data corresponding to the variable randomly to obtain a plurality of initial bins corresponding to the variable. Wherein, each initial sub-box stores: sample data associated with the variable and from different sample objects.
For continuous variables, the server may implement the binning of the continuous variables by discretizing the variables, for example. Aiming at the discrete variable, the server can convert the discrete variable into a digital characteristic, so as to realize the box-division processing of the discrete variable.
Step 304, for each initial bin, determining a bin value of the initial bin based on the abnormal proportion and the non-abnormal proportion of the sample object to which the sample data in the initial bin belongs.
Wherein, the case division value specifically is: the boxed WOE value (Weight of Evidence, evidence weight). Each sub-box has an abnormal proportion and a non-abnormal proportion corresponding to each sub-box. For each bin, the abnormal proportion can be understood as the proportion of the sample objects with abnormal behaviors in the bin to all the sample objects with abnormal behaviors; the non-abnormal proportion can be understood as the proportion of the sample objects without abnormal behaviors in the sub-boxes to all the sample objects without abnormal behaviors. That is, for each bin, the anomaly ratio may specifically be: the number of sample objects with abnormal behaviors in the bin/the number of sample objects with abnormal behaviors in all sample objects, and the non-abnormal proportion may be specifically: the number of sample objects in the bin where no abnormal behavior occurs/the number of sample objects in all sample objects where no abnormal behavior occurs.
Optionally, for each initial bin, the server may determine the abnormal proportion corresponding to the initial bin based on the number of sample objects in which abnormal behaviors occur in the sample objects to which the sample data in the initial bin belongs and the number of sample objects in which abnormal behaviors occur in all the sample objects. The server may also determine a non-abnormal proportion corresponding to the initial bin based on a number of sample objects in which no abnormal behavior occurs in the sample objects to which the sample data in the initial bin belongs, and a number of sample objects in which no abnormal behavior occurs in all the sample objects. Further, the server may determine a bin value of the initial bin based on the abnormal proportion and the non-abnormal proportion of the initial bin.
Exemplary, assume that the number of sample objects in which abnormal behavior occurs in the sample object corresponding to the ith bin of a certain variable is Bad i The number of sample objects in which no abnormal behavior occurs is Good i And the number of sample objects with abnormal behaviors in all sample objects is Bad T The number of sample objects in which no abnormal behavior occurs is Good T The abnormal proportion corresponding to the ith sub-box isThe non-abnormal proportion is->The server may determine the bin value of the ith bin by equation (1):
Wherein WOE is as follows i The bin value of the ith bin for a variable.
And 306, adjusting a plurality of initial sub-boxes with the aim of positively correlating the sub-box values of the sub-boxes with the abnormal proportions of the sample objects to which the sample data in the sub-boxes belong, so as to obtain a plurality of sub-boxes corresponding to the variables.
Optionally, for the variable without non-monotonicity service interpretation, the server may adjust the plurality of initial bins with the positive correlation of the bin value of the bin and the abnormal proportion of the sample object to which the sample data in the bin belongs as a target until the variation amplitude of the plurality of bins corresponding to the variable is sufficiently small in the adjustment process, so as to obtain a plurality of bins corresponding to the variable after adjustment.
For example, for variables without non-monotonicity business interpretation, the server may also check monotonicity of WOE values for each variable for multiple bins based on pre-constructed datasets, ensuring that no inversion of the monotonicity of the bins occurs.
For example, for a few parts of variables with explicit non-monotonic business interpretation, the server is not used for carrying out the binning processing on the variables aiming at positively correlating the binning value of the bins with the abnormal proportion of the sample objects to which the sample data in the bins belong. The variables with explicit non-monotonic business interpretation may specifically be: the number of resource storage/borrowing accounts transacted in a financial institution by a sample object, with a clear non-monotonic business interpretation, can be understood as: the probability of abnormal behavior of the sample object does not increase with the increase of the number of the resource storage/borrowing accounts, but rather shows a trend of increasing and decreasing.
In this embodiment, the object is to make the bin division value of the bin division positively correlate with the abnormal proportion of the sample object to which the sample data in the bin division belongs, and perform the bin division operation on the sample data corresponding to the variables, so that it can be ensured that the WOE values of the plurality of bins of each variable have monotonicity, and the proportion of the sample object having abnormal behavior in the bin division is ensured to be greater as the WOE value is greater, so that the bin division with the highest WOE value can be quickly determined as the bin division with the highest abnormal proportion of the sample object in the plurality of bins according to the monotonicity of the WOE value.
In one embodiment, as shown in fig. 4, screening target variables for risk prediction capability achievement from a plurality of variables includes:
step 402, for each variable, based on the respective bin values of the bins corresponding to the variable, evaluating the risk prediction capability of the variable to obtain a prediction capability evaluation value of the variable.
The predictive ability evaluation value of the variable is specifically: IV (information value ) corresponding to a variable is a weighted sum of bin values WOE of a plurality of bins corresponding to the variable, and is an index closely related to WOE, and is mainly used for evaluating the prediction capability of the variable so as to rapidly screen the variable.
Alternatively, for each variable, the server may evaluate the risk prediction capabilities of the variable based on a weighted sum of the bin values of the plurality of bins to which the variable corresponds.
The server may specifically determine the predictive power evaluation value IV of a variable by the formula (2):
wherein n is the total number of sub-boxes corresponding to a certain variable, WOE i As the bin value of the ith bin corresponding to the variable, bad i Good is the number of sample objects in the ith bin of the variable for which abnormal behavior occurs i Bad is the number of sample objects in the ith bin of the variable for which no abnormal behavior occurs T Good for the number of sample objects with abnormal behaviors in all sample objects T Is the number of sample objects in which no abnormal behavior occurs among all sample objects.
And step 404, determining the variables with the predictive ability evaluation values larger than the evaluation threshold value as target variables with the risk predictive ability reaching standards.
The evaluation threshold can be flexibly configured according to actual requirements.
Alternatively, the server may determine a variable whose predictive ability evaluation value is greater than the evaluation threshold as a target variable whose risk predictive ability meets the criteria, so as to screen out a variable having utility for the risk prediction business of the financial institution.
In this embodiment, by screening variables, on one hand, it is ensured that a sample space is constructed based on high-risk sample data corresponding to variables with good prediction capability, and it is ensured that risk prediction rules which are mined subsequently are all rules with practical value in a risk prediction scene, and also low-frequency high-risk rules representing significant risks can be avoided from being omitted, and also risk prediction effects of a risk prediction model can be prevented from being influenced by variables with poor risk prediction capability, and risk false alarm conditions can be reduced. On the other hand, only the target bin corresponding to the target variable is reserved, the number of sample data is reduced, the dimension reduction processing of the data is realized, and the efficiency and the accuracy of dividing the sample space in the follow-up iteration are improved.
In one embodiment, as shown in FIG. 5, dividing the sample space into a plurality of subspaces includes:
step 502, a survival analysis function is constructed for determining the survival rate of sample objects in space.
Wherein the survival rate of the space may characterize the risk level of each sample object in the space. The higher the survival rate of the subspace, the higher the probability of representing abnormal behaviors of each sample object in the subspace, and the higher the risk degree.
Alternatively, the server may introduce an evaluation index, i.e., a survival rate, of a risk prediction service more closely conforming to the financial institution based on a survival analysis algorithm, and construct a survival analysis function for determining the survival rate of each sample object in the space. The survival analysis algorithm may be used to analyze a survival time of a sample object, and in the field of financial science and technology, the survival time of the sample object may be specifically: the sample object lasts for a period of time when no abnormal behavior occurs, so that in this embodiment, a survival analysis function for determining the survival rate of the sample object in the space can be constructed based on a survival analysis algorithm, so that the business of the financial institution is compact, and the probability of occurrence of abnormal behavior of each sample object in the space is analyzed.
In step 504, the sample space is used as a mother space, and the mother space is spatially divided with the goal of maximizing the increase of the survival rate after division, so as to obtain two subspaces.
The PRIM (Patient Rule Induction Method, patient rule inference method) model may be combined with a spatial survival analysis function to achieve spatial partitioning. The PRIM model may divide a sample space into a plurality of rectangular areas by setting thresholds for continuous variables and classification variables. In this embodiment, the division of the sample space may be achieved by setting a survival rate increment threshold and a sample object increment threshold in the mother space, and iteratively performing the steps of "stripping" and "merging". After a certain space is divided, the increase in survival rate=the sum of the survival rates of the two subspaces obtained by division—the survival rate of the divided space.
Optionally, the server may first take the sample space as the mother space, randomly generate a plurality of schemes for space division of the mother space, and then, with the goal of maximizing the increase of the survival rate after division, screen the scheme with the maximum increase of the survival rate from the randomly generated schemes, so as to space-divide the mother space according to the screened scheme, and obtain two subspaces.
And step 506, stripping the subspace with smaller survival rate from the mother space, reserving the subspace with larger survival rate, and taking the subspace with larger survival rate as a new mother space.
Optionally, when two subspaces are obtained, the server can strip the subspace with smaller survival rate from the mother space by comparing the survival rates of the two subspaces, reserve the subspace with larger survival rate, and take the subspace with larger survival rate as a new mother space.
Step 508, cycling the process of space division of the mother space with the goal of maximizing the increase of the survival rate after division; in the cyclic process, if the stripped subspace is adjacent to the parent space and the survival rate difference between the stripped subspace and the parent space is smaller than the difference threshold value, the stripped subspace and the parent space are combined.
The difference threshold can be flexibly configured according to an actual application scene. Difference in survival = survival rate of the combined resulting space-sum of the survival rates of the two spaces before combining.
Alternatively, the server may cycle the process of space division of the parent space with the goal of maximizing the increase in survival rate after division, and once the step of space division is performed during the cycle, a judgment is performed to judge whether there is a stripped child space adjacent to the parent space, and the difference in survival rate with the parent space is less than the difference threshold. If there is a stripped subspace adjacent to the parent space and the difference in survival rate from the parent space is less than the difference threshold, step 510 is performed to merge the stripped subspace with the parent space such that the number of sample objects in the parent space increases, otherwise, step 508 is continued.
When the loop stop condition is reached, a sample space is obtained in which the division is completed, step 512.
The cycle stop conditions may specifically be: setting the maximum number of space divisions, and if the number of space divisions reaches the set maximum number, determining that a circulation stop condition is reached. Alternatively, when the subspaces are divided or combined, the magnitude of the subspace variation is sufficiently small, and it is also determined that the loop stop condition is reached.
Alternatively, when the loop stop condition is reached, the server may output the sample space after division, resulting in a plurality of stripped subspaces, each subspace having a respective survival rate. In the space division process, the space division is performed with the aim of maximizing the increase of the survival rate after division, so that the finally reserved subspace is the subspace with the largest survival rate and the highest risk rate.
For example, taking "after dividing or merging subspaces, the change amplitude of the subspaces is small enough" as a cycle stop condition, the server may preset a survival rate increment threshold and a sample object increment threshold, if during a certain cycle, the increment of the survival rate after space division is smaller than the survival rate increment threshold, and after space merging, the increment of the sample object in the parent space is smaller than the sample object increment threshold, the server may determine that the cycle stop condition is reached, and output the sample space after the division is completed.
Illustratively, taking FIG. 6 as an example, there is provided a method of spatially partitioning a sample spaceAnd then, a schematic diagram of a plurality of subspaces is obtained. First, the sample space can be considered as a large rectangular area, and for the original sample space, the server can randomly generate at least B in FIG. 6 1 And b 11- 、B 1 And b 11+ 、B 1 And b 12- 、B 1 And b 12+ These four spatial divisions, B 1 And b 11+ This division maximizes the increase in survival, for example, and the server may divide the sample space into B 2 And b 1 * Further, due to B 2 Corresponding survival rate is greater than b 1 * Corresponding survival rate, therefore, the server will B 2 Reserved and used as new parent space, b 1 * Stripping. Further, the server may divide B into two or more spaces according to the same space division rule 2 Divided into B 3 And b 2 * And due to B 3 Corresponding survival rate is greater than b 2 * Corresponding survival rate, will B 3 Reserved and used as new parent space, b 2 * And (3) stripping, and finally, obtaining a sample space after division through space division and space combination of multiple rounds. Wherein the sample space is divided into a plurality of subspaces b 1 * 、b 2 * 、b 3 * 、b 4 * 、b 5 * 、b 6 * 、b 7 * 、b 8 * 、B 9 ,b 1 * ~b 8 * Are all subspaces stripped, B 9 All the resulting subspaces are non-overlapping with each other and can be spliced into the original sample space for the final reserved subspace.
In this embodiment, through the steps of dividing and merging the sample space circulation space, the survival rate of each subspace and the number of sample objects covered in each subspace can be continuously adjusted, so that the fitting quality of the PRIM model is improved, the survival rate of each subspace and the number of the covered sample objects tend to be balanced, the prediction precision and coverage amount of each subspace tend to be balanced, a plurality of subspaces corresponding to different survival rates are finally obtained, and the accurate division of sample object groups with different survival rates is realized, so that the risk prediction can be accurately and rapidly performed on the object to be predicted based on the subspace to which the object to be predicted belongs.
In one embodiment, as shown in fig. 7, the method further comprises:
in step 702, if there are multiple stripped subspaces adjacent to the parent space and the survival rate difference between the stripped subspaces and the parent space is smaller than the difference threshold, the number of sample objects in each stripped subspace is determined.
Optionally, after each space division is completed according to step 508, the server performs a determination to determine whether there is a stripped subspace adjacent to the parent space, and the survival rate difference between the stripped subspace and the parent space is less than the difference threshold, and if there are a plurality of stripped subspaces adjacent to the parent space, and the survival rate difference between the stripped subspaces and the parent space is less than the difference threshold, step 702 is performed to determine the number of sample objects in each stripped subspace respectively.
For example, if only one of the stripped subspaces is adjacent to the parent space after the determining step is performed, and the survival rate difference between the stripped subspace and the parent space is smaller than the difference threshold value, the step of merging the stripped subspace and the parent space is directly performed. If the judging step is performed, the stripped subspace is not adjacent to the parent space, and the survival rate difference between the stripped subspace and the parent space is smaller than the difference threshold value, the step 508 is directly returned.
Step 704, merging the most stripped subspace with the parent space.
Alternatively, the server may compare the number of sample objects in the plurality of stripped subspaces determined in step 702, and take the stripped subspace with the largest number of sample objects as the subspace to be merged, so as to perform step 704, merge the stripped subspace with the largest number of sample objects with the parent space, so that the increment (variation) of the sample objects in the parent space is maximized after the space is merged.
In this embodiment, after space merging, space merging is performed with the aim of maximizing the increment of the sample objects in the parent space, which is favorable for locally searching for the global optimal solution, so that the prediction precision and coverage of each subspace which is finally output tend to be balanced, and the accurate division of sample object groups with different survival rates is realized, so that the risk prediction of the object to be predicted can be accurately and rapidly performed based on the subspace to which the object to be predicted belongs.
In one embodiment, as shown in fig. 8, the method further comprises:
step 802, for each sample object, taking the number of days that the sample object continuously does not have abnormal behaviors in a target time period as the number of survival days of the sample object; the target time period is a time period formed from a start time of generating a resource transfer record by the sample object to an end time of acquiring sample data of the sample object.
Optionally, for each sample object, the server may use the number of days that the sample object continues to have no abnormal behavior in the target period as the number of survival days of the sample object, that is, the survival time of the sample object, that is, combine the actual service scenario of the financial institution, and combine the data related to the abnormal behavior of the sample object with the survival analysis algorithm, so as to discover a high-quality rule with practical value, which can predict the probability of occurrence of the abnormal behavior of the sample object based on the service characteristics of the financial institution.
Step 804, determining the value of the survival tag of the sample object by judging whether the sample object has abnormal behavior from the ending time.
Optionally, for each sample object, the server may determine whether the sample object has abnormal behavior from the start time to the end time, and respectively configure different values for the survival tag of the sample object according to the determination result.
For example, if the sample object has abnormal behavior from the start time to the end time, the server may assign a survival tag of 0 to the sample object; if the sample object is from the starting time to the ending time, no abnormal behavior occurs, the server may assign a survival tag of the sample object to 1.
Step 806, constructing a survival analysis function for determining the survival rate of each sample object in the space based on the sum of the survival labels of each sample object in the space and the sum of the survival days of each sample object.
The space may specifically include: original sample space, and mother space and child space generated in the space division process.
Alternatively, the server may construct a survival analysis function for determining the survival rate of each sample object in the space using the sum of the survival labels of each sample object in the space as a numerator and the sum of the survival days of each sample object as a denominator.
Illustratively, taking a space as an example, the survival rate of each sample object in the space can be calculated specifically by the formula (3):
where f (t, δ) is the survival rate of each sample object in space, m is the total number of sample objects in space, δ i Survival tag for ith sample object in space, t i Number of days of survival for the ith sample object in space.
In this embodiment, the risk prediction service of the financial institution is tightly combined, the number of living days and living labels of the sample objects are fused, and a living analysis function is constructed to determine the living rates of the sample objects in different spaces, so that the sample spaces can be spatially divided based on the living rates of the spaces, the sample object groups with different living rates can be divided, the living rates corresponding to each sample object group can be determined, and the risk prediction can be accurately and rapidly performed on the target to be predicted based on the subspace to which the target to be predicted belongs.
In one embodiment, as shown in fig. 9, determining the risk prediction function of the sample object in each subspace separately includes:
step 902, for each subspace, obtaining an abnormal time point of the sample object with abnormal behavior in the subspace, and counting the number of the sample objects with abnormal behavior at each abnormal time point.
Alternatively, for each subspace, the server may acquire an abnormal time point of a sample object in which abnormal behavior occurs in the subspace based on sample data of each sample object in the subspace, and count the number of sample objects in which abnormal behavior occurs at each abnormal time point.
Step 904, fitting the trend of the number of the sample objects with abnormal behaviors along with the time to obtain a risk prediction function of the sample objects in the subspace.
Optionally, the server may determine the number of sample objects in which no abnormal behavior occurs at each abnormal time point based on the number of sample objects in which abnormal behavior occurs at each abnormal time point, and fit a trend of the number of sample objects in which abnormal behavior occurs over time based on the number of sample objects in which abnormal behavior occurs at the abnormal time point and the number of sample objects in which abnormal behavior does not occur, to obtain the risk prediction function of the sample objects in the subspace.
For each subspace, the server may input the number of sample objects with abnormal behaviors and without abnormal behaviors corresponding to each abnormal time point in the subspace into a formula (4), and fit the sample objects to obtain a risk prediction function corresponding to the subspace, where the formula (4) is specifically as follows:
where h (t) is the probability (risk) that the sample object in space experiences abnormal behavior within a time period [ t, t+Δt ]. f (t) is the number of sample objects f (t) in the space where abnormal behavior has occurred within a time period [ t, t+Δt ]. S (t) is the number of sample objects in space for which no abnormal behavior has occurred during the time period [ t, t+Δt ]. t is any point in time and Δt is a short time interval.
The server may determine, for example, a probability of abnormal behavior of the sample object in the space and a trend of time change of the probability of abnormal behavior of the sample object after fitting to obtain a risk prediction function corresponding to the space, and if the risk state of the object to be predicted needs to be predicted at a certain time point, the server may predict the probability of abnormal behavior of the object to be predicted based on the time point and the risk prediction function of the target subspace to which the object to be predicted belongs.
For example, at the time point of risk prediction of the object to be predicted, if the risk prediction function of the target subspace to which the object to be predicted belongs is a non-decreasing function, that is, the value output by the risk prediction function increases with time, this indicates that there is a decreasing trend of the overall credit quality of the sample object group to which the object to be predicted belongs, that is, the probability of occurrence of abnormal behavior of the object to be predicted is high, that is, the object to be predicted is in a high risk state.
For example, if the risk prediction function of the target subspace to which the object to be predicted belongs is a non-increasing function, that is, increases with time, the value output by the risk prediction function decreases, which indicates that the overall credit quality of the sample object population to which the object to be predicted belongs has an increasing tendency, that is, the probability that the object to be predicted has abnormal behavior is smaller, that is, is in a low risk state.
For example, if the risk prediction function of the target subspace to which the object to be predicted belongs is constant, that is, the value output by the risk prediction function is basically unchanged with the increase of time, this means that the overall credit quality of the sample object group to which the object to be predicted belongs is relatively stable.
In this embodiment, a risk prediction function is introduced, so that a risk trend of any sample object group at any time point can be predicted, specifically, a risk state of an object to be predicted at any time point can be accurately and rapidly predicted based on a risk prediction function configured for a subspace and a risk prediction service of a financial institution in close combination according to the risk prediction function of the subspace to which the object to be predicted belongs.
In one embodiment, as shown in fig. 10, another method for constructing a risk prediction model is provided, and the method is applied to the application environment shown in fig. 1 for illustration, and mainly includes the following steps:
after the server obtains the sample data corresponding to each of the plurality of variables related to risk prediction, the server may execute step 1002 to perform primary binning on the sample data corresponding to each of the plurality of variables related to risk prediction, to obtain a plurality of initial bins corresponding to each of the plurality of variables. For each initial bin, the server may execute step 1004, determine a bin value of the initial bin based on an abnormal proportion and a non-abnormal proportion of a sample object to which sample data in the initial bin belongs, so that the bin value of the initial bin is directly related to the abnormal proportion of the sample object to which the sample data in the bin belongs as a target, execute step 1006, adjust a plurality of initial bins to obtain a plurality of bins corresponding to each variable, so that for each variable, the server may execute step 1008, evaluate a risk prediction capability of the variable based on respective bin values of the plurality of bins corresponding to the variable, obtain a prediction capability evaluation value of the variable, and execute step 1010 to determine the variable with the prediction capability evaluation value greater than an evaluation threshold as a target variable with the risk prediction capability reaching the standard.
Further, for each target variable, the server may execute step 1012 to use the bin with the highest abnormal proportion of the sample objects in the bins corresponding to the target variable as the target bin of the target variable, thereby executing step 1014 to determine the sample space containing each target bin.
After the sample space is built, the server may execute step 1016 to take the sample space as a parent space and maximize the increase of the survival rate after division as a target, step 1018 to space-divide the parent space to obtain two subspaces, then execute step 1020 to strip the subspace with smaller survival rate from the parent space and take the subspace with larger survival rate as a new parent space, and then cycle the process of space-dividing the parent space with the increase of the survival rate after division as a target. In the circulation process, if there are multiple stripped subspaces adjacent to the parent space and the survival rate difference between the stripped subspaces and the parent space is smaller than the difference threshold, step 1022 is executed to combine the stripped subspaces with the parent space with the largest number of sample objects until the circulation stop condition is reached, and the sample space with the complete division is obtained.
After the division of the sample space is completed, for each subspace, the server may execute step 1024 to obtain an abnormal time point of the sample object having abnormal behavior in the subspace, count the number of the sample objects having abnormal behavior at each abnormal time point, further execute step 1026 to fit the trend of the number of the sample objects having abnormal behavior with time to obtain a risk prediction function of the sample objects in the subspace, and finally execute step 1028 to construct a risk prediction model based on the risk prediction functions corresponding to the subspaces.
It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.
Based on the same inventive concept, the embodiment of the application also provides a device for constructing the risk prediction model for realizing the method for constructing the risk prediction model. The implementation of the solution provided by the apparatus is similar to the implementation described in the above method, so the specific limitation in the embodiment of the apparatus for constructing one or more risk prediction models provided below may be referred to the limitation of the method for constructing a risk prediction model hereinabove, and will not be repeated herein.
In one embodiment, as shown in fig. 11, there is provided a risk prediction model construction apparatus, including: a binning processing module 1102, a variable screening module 1104, a binning determination module 1106, a spatial processing module 1108, and a model building module 1110, wherein:
the box division processing module is used for acquiring sample data corresponding to each of a plurality of variables related to risk prediction, and respectively carrying out box division processing on each sample data to obtain a plurality of box divisions corresponding to each variable; sample data belong to different sample objects;
the variable screening module is used for screening target variables with up-to-standard risk prediction capability from a plurality of variables;
the box division determining module is used for dividing boxes with highest abnormal proportion of sample objects in a plurality of boxes corresponding to each target variable as target boxes of the target variable;
The space processing module is used for determining a sample space containing each target bin, dividing the sample space into a plurality of subspaces and respectively determining a risk prediction function of a sample object in each subspace;
the model construction module is used for constructing a risk prediction model based on the risk prediction functions corresponding to the subspaces; the risk prediction model is used for determining a target subspace of the object to be predicted, and predicting the risk state of the object to be predicted based on a risk prediction function of the target subspace.
In the construction device of the risk prediction model, firstly, sample data corresponding to a plurality of variables related to risk prediction are acquired, namely, the sample data related to risk prediction is acquired by tightly combining with the business of a financial institution, then, the sample data are respectively subjected to box division processing to obtain a plurality of boxes corresponding to each variable, so that the business interpretability of the sample data is improved, then, target variables with standard risk prediction capability are screened out from the plurality of variables, the risk prediction effect of the risk prediction model is prevented from being influenced by the variables with poor risk prediction capability, so that the situation of risk misinformation is reduced, the boxes with highest abnormal proportion of sample objects in the plurality of boxes corresponding to the target variables are used as target boxes of the target variables, the sample space containing the target boxes is determined, namely, the sample data with high risk based on the standard risk prediction capability is constructed, the prediction rule of a non-risk item is avoided when the risk prediction model is constructed based on the sample space, the sample space is further, the sample space is divided into a plurality of subspaces, the situation that the risk prediction effect of the risk prediction model is influenced by the variables with poor risk prediction capability is respectively, the risk prediction function is accurately determined based on the corresponding risk prediction function of the financial institution, and the risk prediction function is accurately combined with the financial institution, and the risk prediction function is accurately predicted based on the risk prediction function of the corresponding risk prediction function.
In one embodiment, the binning module is further configured to: aiming at each variable related to risk prediction, carrying out primary box division processing on sample data corresponding to the variable to obtain a plurality of initial boxes corresponding to the variable; determining a bin value of the initial bin based on an abnormal proportion and a non-abnormal proportion of a sample object to which the sample data in the initial bin belong for each initial bin; and (3) aiming at positively correlating the bin values of the bins with the abnormal proportions of the sample objects to which the sample data in the bins belong, and adjusting the initial bins to obtain a plurality of bins corresponding to the variables.
In one embodiment, the variable screening module is further configured to: for each variable, based on respective bin values of a plurality of bins corresponding to the variable, evaluating the risk prediction capability of the variable to obtain a prediction capability evaluation value of the variable; and determining the variables with the predictive ability evaluation values larger than the evaluation threshold as target variables with the risk predictive ability reaching standards.
In one embodiment, the spatial processing module is further configured to: constructing a survival analysis function for determining the survival rate of the sample objects in the space; taking the sample space as a mother space, and carrying out space division on the mother space with the aim of maximizing the increase of survival rate after division to obtain two subspaces; stripping the subspace with smaller survival rate from the mother space, reserving the subspace with larger survival rate, and taking the subspace with larger survival rate as a new mother space; cycling the process of space division of the mother space with the goal of maximizing the increase of survival rate after division; in the circulation process, if the stripped subspace is adjacent to the mother space and the survival rate difference between the stripped subspace and the mother space is smaller than the difference threshold value, merging the stripped subspace and the mother space; when the cycle stop condition is reached, a sample space where division is completed is obtained.
In one embodiment, the apparatus for constructing a risk prediction model further includes: the space merging module is used for respectively determining the number of sample objects in each stripped subspace if a plurality of stripped subspaces are adjacent to the mother space and the survival rate difference between the stripped subspaces and the mother space is smaller than a difference value threshold value; the most numerous stripped subspaces are merged with the parent space.
In one embodiment, the apparatus for constructing a risk prediction model further includes: the survival analysis function construction module is used for continuously taking the number of days without abnormal behaviors of the sample object in the target time period as the number of survival days of the sample object for each sample object; the target time period is a time period formed from the starting time of the resource transfer record generated by the sample object to the ending time of the sample data of the sample object; determining the value of a survival tag of the sample object by judging whether the sample object has abnormal behavior from the ending time to the ending time; a survival analysis function for determining the survival rate of each sample object in the space is constructed based on the sum of the survival labels of each sample object in the space and the sum of the survival days of each sample object.
In one embodiment, the model building module is further configured to: for each subspace, acquiring an abnormal time point of a sample object with abnormal behaviors in the subspace, and counting the number of the sample objects with abnormal behaviors at each abnormal time point; fitting the trend of the number of the sample objects with abnormal behaviors along with the time to obtain a risk prediction function of the sample objects in the subspace.
The respective modules in the above-described risk prediction model constructing apparatus may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.
In one embodiment, a computer device is provided, which may be a server, and the internal structure of which may be as shown in fig. 12. The computer device includes a processor, a memory, an Input/Output interface (I/O) and a communication interface. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface is connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used for storing construction data of the risk prediction model. The input/output interface of the computer device is used to exchange information between the processor and the external device. The communication interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of constructing a risk prediction model.
It will be appreciated by those skilled in the art that the structure shown in FIG. 12 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements may be applied, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.
In one embodiment, a computer device is provided, comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of the method embodiments described above when the computer program is executed.
In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored which, when executed by a processor, implements the steps of the method embodiments described above.
In an embodiment, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the steps of the method embodiments described above.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magnetic random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (Phase Change Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like. The databases referred to in the embodiments provided herein may include at least one of a relational database and a non-relational database. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processor referred to in the embodiments provided in the present application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic unit, a data processing logic unit based on quantum computing, or the like, but is not limited thereto.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The foregoing examples illustrate only a few embodiments of the application and are described in detail herein without thereby limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of the application should be assessed as that of the appended claims.
Claims (17)
1. A method of constructing a risk prediction model, the method comprising:
acquiring sample data corresponding to a plurality of variables related to risk prediction, and respectively carrying out box division on the sample data to obtain a plurality of boxes corresponding to each variable; the sample data belong to different sample objects;
screening target variables with up-to-standard risk prediction capability from a plurality of variables;
Aiming at each target variable, taking the bin with the highest abnormal proportion of the sample object in a plurality of bins corresponding to the target variable as a target bin of the target variable;
determining a sample space containing each target bin, dividing the sample space into a plurality of subspaces, and respectively determining a risk prediction function of a sample object in each subspace;
constructing a risk prediction model based on the risk prediction functions corresponding to the subspaces; the risk prediction model is used for determining a target subspace to which an object to be predicted belongs and predicting a risk state of the object to be predicted based on a risk prediction function of the target subspace.
2. The method according to claim 1, wherein the performing the binning process on each of the sample data to obtain a plurality of bins corresponding to each of the variables respectively includes:
performing primary box division processing on sample data corresponding to each variable related to risk prediction to obtain a plurality of initial boxes corresponding to the variable;
determining a box division value of each initial box division based on an abnormal proportion and a non-abnormal proportion of a sample object to which sample data in the initial box division belong;
And adjusting the plurality of initial sub-boxes with the aim of positively correlating the sub-box values of the sub-boxes with the abnormal proportions of the sample objects to which the sample data in the sub-boxes belong, so as to obtain a plurality of sub-boxes corresponding to the variables.
3. The method of claim 2, wherein screening the target variable for risk prediction capability achievement from a plurality of the variables comprises:
for each variable, evaluating the risk prediction capability of the variable based on respective bin values of a plurality of bins corresponding to the variable to obtain a prediction capability evaluation value of the variable;
and determining the variable with the predictive capability evaluation value larger than an evaluation threshold value as a target variable with the risk predictive capability reaching the standard.
4. The method of claim 1, wherein the dividing the sample space into a plurality of subspaces comprises:
constructing a survival analysis function for determining the survival rate of the sample objects in the space;
taking the sample space as a mother space, and carrying out space division on the mother space with the aim of maximizing the increase of the survival rate after division to obtain two subspaces;
stripping the subspace with smaller survival rate from the mother space, reserving the subspace with larger survival rate, and taking the subspace with larger survival rate as a new mother space;
Cycling the process of space division of the mother space with the goal of maximizing the increase of survival rate after division; in the cyclic process, if the stripped subspace is adjacent to the mother space and the survival rate difference between the stripped subspace and the mother space is smaller than the difference threshold value, merging the stripped subspace and the mother space;
when the cycle stop condition is reached, a sample space where division is completed is obtained.
5. The method according to claim 4, wherein the method further comprises:
if a plurality of stripped subspaces are adjacent to the mother space and the survival rate difference between the stripped subspaces and the mother space is smaller than a difference threshold value, respectively determining the number of sample objects in each stripped subspace;
and merging the stripped subspace with the largest subspace with the parent space.
6. The method according to claim 4, wherein the method further comprises:
for each sample object, the number of days that the sample object continuously does not generate abnormal behaviors in a target time period is taken as the number of survival days of the sample object; the target time period is a time period formed from the starting time of generating a resource transfer record of the sample object to the ending time of acquiring sample data of the sample object;
Determining the value of a survival tag of the sample object by judging whether the sample object has abnormal behavior or not when the sample object is stopped to the end time;
the constructing a survival analysis function for determining the survival rate of a sample object in space comprises:
a survival analysis function for determining the survival rate of each sample object in the space is constructed based on the sum of the survival labels of each sample object in the space and the sum of the survival days of each sample object.
7. The method of claim 1, wherein said separately determining risk prediction functions for sample objects in each of said subspaces comprises:
for each subspace, acquiring an abnormal time point of a sample object with abnormal behavior in the subspace, and counting the number of the sample objects with abnormal behavior at each abnormal time point;
fitting the trend of the number of the sample objects with abnormal behaviors along with the time to obtain a risk prediction function of the sample objects in the subspace.
8. A risk prediction model construction apparatus, the apparatus comprising:
the system comprises a box dividing processing module, a box dividing processing module and a storage module, wherein the box dividing processing module is used for acquiring sample data corresponding to a plurality of variables related to risk prediction, and respectively carrying out box dividing processing on the sample data to obtain a plurality of boxes corresponding to each variable; the sample data belong to different sample objects;
The variable screening module is used for screening target variables with the up-to-standard risk prediction capability from a plurality of variables;
the bin determination module is used for regarding each target variable, and taking the bin with the highest abnormal proportion of the sample objects in the bins corresponding to the target variable as a target bin of the target variable;
the space processing module is used for determining a sample space containing each target bin, dividing the sample space into a plurality of subspaces and respectively determining a risk prediction function of a sample object in each subspace;
the model construction module is used for constructing a risk prediction model based on the risk prediction functions corresponding to the subspaces; the risk prediction model is used for determining a target subspace to which an object to be predicted belongs and predicting a risk state of the object to be predicted based on a risk prediction function of the target subspace.
9. The apparatus of claim 8, wherein the binning processing module is further configured to: performing primary box division processing on sample data corresponding to each variable related to risk prediction to obtain a plurality of initial boxes corresponding to the variable; determining a box division value of each initial box division based on an abnormal proportion and a non-abnormal proportion of a sample object to which sample data in the initial box division belong; and adjusting the plurality of initial sub-boxes with the aim of positively correlating the sub-box values of the sub-boxes with the abnormal proportions of the sample objects to which the sample data in the sub-boxes belong, so as to obtain a plurality of sub-boxes corresponding to the variables.
10. The apparatus of claim 8, wherein the variable screening module is further to: for each variable, evaluating the risk prediction capability of the variable based on respective bin values of a plurality of bins corresponding to the variable to obtain a prediction capability evaluation value of the variable; and determining the variable with the predictive capability evaluation value larger than an evaluation threshold value as a target variable with the risk predictive capability reaching the standard.
11. The apparatus of claim 8, wherein the spatial processing module is further configured to: constructing a survival analysis function for determining the survival rate of the sample objects in the space; taking the sample space as a mother space, and carrying out space division on the mother space with the aim of maximizing the increase of the survival rate after division to obtain two subspaces; stripping the subspace with smaller survival rate from the mother space, reserving the subspace with larger survival rate, and taking the subspace with larger survival rate as a new mother space; cycling the process of space division of the mother space with the goal of maximizing the increase of survival rate after division; in the cyclic process, if the stripped subspace is adjacent to the mother space and the survival rate difference between the stripped subspace and the mother space is smaller than the difference threshold value, merging the stripped subspace and the mother space; when the cycle stop condition is reached, a sample space where division is completed is obtained.
12. The apparatus of claim 11, wherein the apparatus further comprises:
the space merging module is used for respectively determining the number of sample objects in each stripped subspace if a plurality of stripped subspaces are adjacent to the mother space and the survival rate difference between the stripped subspaces and the mother space is smaller than a difference value threshold value; and merging the stripped subspace with the largest subspace with the parent space.
13. The apparatus of claim 11, wherein the apparatus further comprises:
the survival analysis function construction module is used for continuously enabling the sample objects to have no abnormal behavior for the target time period for each sample object, and taking the sample objects as survival days of the sample objects; the target time period is a time period formed from the starting time of generating a resource transfer record of the sample object to the ending time of acquiring sample data of the sample object; determining the value of a survival tag of the sample object by judging whether the sample object has abnormal behavior or not when the sample object is stopped to the end time; a survival analysis function for determining the survival rate of each sample object in the space is constructed based on the sum of the survival labels of each sample object in the space and the sum of the survival days of each sample object.
14. The apparatus of claim 8, wherein the model building module is further to: for each subspace, acquiring an abnormal time point of a sample object with abnormal behavior in the subspace, and counting the number of the sample objects with abnormal behavior at each abnormal time point; fitting the trend of the number of the sample objects with abnormal behaviors along with the time to obtain a risk prediction function of the sample objects in the subspace.
15. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1 to 7 when the computer program is executed.
16. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 7.
17. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311075823.6A CN117077028A (en) | 2023-08-24 | 2023-08-24 | Method, device, computer equipment and storage medium for constructing risk prediction model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311075823.6A CN117077028A (en) | 2023-08-24 | 2023-08-24 | Method, device, computer equipment and storage medium for constructing risk prediction model |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117077028A true CN117077028A (en) | 2023-11-17 |
Family
ID=88709431
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311075823.6A Pending CN117077028A (en) | 2023-08-24 | 2023-08-24 | Method, device, computer equipment and storage medium for constructing risk prediction model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117077028A (en) |
-
2023
- 2023-08-24 CN CN202311075823.6A patent/CN117077028A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN115564152A (en) | Carbon emission prediction method and device based on STIRPAT model | |
CN114446019A (en) | Alarm information processing method, device, equipment, storage medium and product | |
CN116894721A (en) | Index prediction method and device and computer equipment | |
CN116191398A (en) | Load prediction method, load prediction device, computer equipment and storage medium | |
CN116861070A (en) | Recommendation model processing method, device, computer equipment and storage medium | |
CN117077028A (en) | Method, device, computer equipment and storage medium for constructing risk prediction model | |
CN111737319B (en) | User cluster prediction method, device, computer equipment and storage medium | |
CN112926803A (en) | Client deposit loss condition prediction method and device based on LSTM network | |
CN111581068A (en) | Terminal workload calculation method and device, storage medium, terminal and cloud service system | |
CN116681203A (en) | Enterprise management consultation method and system based on big data analysis | |
CN118656588A (en) | Data processing method and related device | |
CN117132335A (en) | Method and device for determining resource transaction value in resource contract and computer equipment | |
CN116681164A (en) | Resource information processing method, device, computer equipment and storage medium | |
CN118132091A (en) | Service model processing method, device, computer equipment and storage medium | |
CN117459576A (en) | Data pushing method and device based on edge calculation and computer equipment | |
CN117151884A (en) | Asset management data processing method, device, computer equipment and storage medium | |
CN117828327A (en) | Method and device for constructing safety early warning model of power system and computer equipment | |
CN118764397A (en) | Training method, device, equipment and storage medium of radiation intensity detection model | |
Xiao et al. | Remaining Useful Life Prediction Based on Forward Intensity | |
CN115907969A (en) | Account risk assessment method and device, computer equipment and storage medium | |
CN118446806A (en) | Trusted data processing method, trusted data processing device, computer equipment and storage medium | |
CN117853217A (en) | Financial default rate prediction method, device and equipment for protecting data privacy | |
CN116861273A (en) | Partition parameter determining method, apparatus, computer device and storage medium | |
CN118094411A (en) | Method for identifying abnormal user, electronic device and storage medium | |
CN117575772A (en) | Abnormal user detection method and device, computer equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |