CN112508697A

CN112508697A - Resource recovery risk prediction method and device and electronic equipment

Info

Publication number: CN112508697A
Application number: CN202110161751.1A
Authority: CN
Inventors: 姚聪
Original assignee: Beijing Qiyu Information Technology Co Ltd
Current assignee: Beijing Qiyu Information Technology Co Ltd
Priority date: 2021-02-05
Filing date: 2021-02-05
Publication date: 2021-03-16

Abstract

The invention discloses a resource recovery risk prediction method, a resource recovery risk prediction device and electronic equipment, wherein the method comprises the following steps: acquiring user behavior data as an original sample set, and selecting an original characteristic variable from the original sample set; randomly sampling from the original sample set to obtain a new sample set; generating a binary characteristic variable through a first model based on the new sample set; respectively training a plurality of classification models of different types by adopting characteristic variables; and fusing the plurality of classification models based on the characteristic variables of the training sample set to obtain a prediction model, and inputting the characteristic variables of the test sample set into the prediction model to obtain a risk prediction sequence. Compared with a pure machine learning model, the method can keep a high-precision prediction effect in the resource recovery service, also keeps good stability and generalization, and can effectively prevent and control the resource recovery risk.

Description

Resource recovery risk prediction method and device and electronic equipment

Technical Field

The invention relates to the technical field of computer information processing, in particular to a resource recovery risk prediction method and device, electronic equipment and a computer readable medium.

Background

In internet-based application technology, there is often a need to exchange resources between different parties. Resources, as referred to herein, refer to any available material, information, money, time, etc. Wherein the information resources include computing resources and various types of data resources. The data resources include various private data in various domains. Money-related resources, also commonly referred to as financial resources, require the financial services institution to reclaim the financial resources after the financial resource exchange period expires. Before the exchange period expires, the serious adverse change of the financial business condition of the credit user is likely to influence the performance capability of the credit user, so that the risks of account staying, bad account and the like occur. Therefore, in order to reduce the occurrence probability of such resource reclamation risks, the financial service institution needs to perform resource reclamation risk assessment on the credit user before financial resource exchange.

At present, a machine learning model is mainly adopted to evaluate resource recovery risks, such as a scoring card model based on a Logistic algorithm, an XGboost model and the like. Due to the fact that sample distribution differences of resource recovery services are large, the methods generally have the problems of low evaluation precision and poor stability and generalization, and evaluation effects are affected.

Disclosure of Invention

The invention aims to solve the technical problem of poor effect of the existing resource recovery risk assessment.

In order to solve the above technical problem, a first aspect of the present invention provides a resource recycling risk prediction method, including:

acquiring user behavior data as an original sample set, and selecting an original characteristic variable from the original sample set;

randomly sampling from the original sample set to obtain a new sample set;

generating a binary characteristic variable through a first model based on the new sample set;

respectively training a plurality of classification models of different types by adopting characteristic variables; the characteristic variables comprise the binary characteristic variables, or the characteristic variables are generated by splicing the original characteristic variables and the binary characteristic variables;

respectively acquiring characteristic variables of a training sample set and a test sample set;

fusing the classification models of different types based on the characteristic variables of the training sample set to obtain a prediction model;

and inputting the characteristic variables of the test sample set into a prediction model to obtain a risk prediction sequence.

According to a preferred embodiment of the invention, the first model is a random forest model.

According to a preferred embodiment of the present invention, the generating the binary feature variable by the first model based on the new sample set comprises:

inputting the new sample set into a random forest model, and training a tree model;

acquiring leaf node IDs of the trained tree model;

encoding the leaf node ID to construct leaf node characteristics;

and taking the leaf node characteristics as binary characteristic variables.

According to a preferred embodiment of the present invention, a new sample set is randomly sampled from the original sample set by the BootStrap self-service method.

According to a preferred embodiment of the invention, the classification model comprises: at least two of a random forest model, a logistic regression model, and a fisher model.

According to a preferred embodiment of the present invention, AdaBoost is used to fuse the plurality of different types of classification models based on feature variables of a training sample set.

In order to solve the above technical problem, a second aspect of the present invention provides a resource recovery risk prediction apparatus, including:

the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring user behavior data as an original sample set and selecting an original characteristic variable from the original sample set;

the sampling module is used for randomly sampling the original sample set to obtain a new sample set;

a generating module, configured to generate a binary feature variable through a first model based on the new sample set;

the training module is used for respectively training a plurality of different types of classification models by adopting the characteristic variables; the characteristic variables comprise the binary characteristic variables, or the characteristic variables are generated by splicing the original characteristic variables and the binary characteristic variables;

the second acquisition module is used for respectively acquiring the characteristic variables of the training sample set and the test sample set;

the fusion module is used for fusing the classification models of different types based on the characteristic variables of the training sample set to obtain a prediction model;

and the prediction module is used for inputting the characteristic variables of the test sample set into a prediction model to obtain a risk prediction sequence.

According to a preferred embodiment of the present invention, the generating module includes:

the sub-training module is used for inputting the new sample set into a random forest model and training a tree model;

the sub-acquisition module is used for acquiring leaf node IDs of the trained tree model;

the coding module is used for coding the leaf node ID to construct leaf node characteristics;

and the determining module is used for taking the leaf node characteristics as binary characteristic variables.

According to a preferred embodiment of the present invention, the sampling module randomly samples the original sample set to obtain a new sample set by using a BootStrap self-service method.

According to a preferred embodiment of the present invention, the fusion module fuses the plurality of different types of classification models by using AdaBoost based on feature variables of a training sample set.

To solve the above technical problem, a third aspect of the present invention provides an electronic device, comprising:

a processor; and

a memory storing computer executable instructions that, when executed, cause the processor to perform the method described above.

To solve the above technical problems, a fourth aspect of the present invention provides a computer-readable storage medium, wherein the computer-readable storage medium stores one or more programs which, when executed by a processor, implement the above method.

The method comprises the steps of constructing an original characteristic variable based on an original sample set, generating a binary characteristic variable through a first model in a random sampling mode, and respectively training a plurality of different types of classification models based on the binary characteristic variable or the binary characteristic variable and the original characteristic variable; and fusing the plurality of classification models to obtain a final prediction model. The original characteristic variables constructed by the method ensure the interpretability and stability of the prediction model in the resource recovery service; the generalization and stability of the prediction model in the resource recovery service are improved by the binary characteristic variables; the binary characteristic variables used for training the classification model, or the number of the binary characteristic variables and the original characteristic variables can reach as much as 100+, so that the phenomenon of user credit in resource recovery business can be avoided from the source, and the public confidence of the prediction model is improved. Compared with a pure machine learning model, the method can keep a high-precision prediction effect in the resource recovery service, also keeps good stability and generalization, and can effectively prevent and control the resource recovery risk.

Drawings

In order to make the technical problems solved by the present invention, the technical means adopted and the technical effects obtained more clear, the following will describe in detail the embodiments of the present invention with reference to the accompanying drawings. It should be noted, however, that the drawings described below are only illustrations of exemplary embodiments of the invention, from which other embodiments can be derived by those skilled in the art without inventive step.

FIG. 1 is a schematic flow chart of a resource recovery risk prediction method according to the present invention;

FIG. 2 is a schematic diagram of the present invention for generating binary features;

FIG. 3 is a schematic diagram of a structural framework of a resource recycling risk prediction apparatus according to the present invention;

FIG. 4 is a block diagram of an exemplary embodiment of an electronic device in accordance with the present invention;

FIG. 5 is a schematic diagram of one embodiment of a computer-readable medium of the present invention.

Detailed Description

Exemplary embodiments of the present invention will now be described more fully hereinafter with reference to the accompanying drawings, in which exemplary embodiments of the invention may be embodied in many specific forms, and should not be construed as limited to the embodiments set forth herein. Rather, these exemplary embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the invention to those skilled in the art.

The structures, properties, effects or other characteristics described in a certain embodiment may be combined in any suitable manner in one or more other embodiments, while still complying with the technical idea of the invention.

In describing particular embodiments, specific details of structures, properties, effects, or other features are set forth in order to provide a thorough understanding of the embodiments by one skilled in the art. However, it is not excluded that a person skilled in the art may implement the invention in a specific case without the above-described structures, performances, effects or other features.

The flow chart in the drawings is only an exemplary flow demonstration, and does not represent that all the contents, operations and steps in the flow chart are necessarily included in the scheme of the invention, nor does it represent that the execution is necessarily performed in the order shown in the drawings. For example, some operations/steps in the flowcharts may be divided, some operations/steps may be combined or partially combined, and the like, and the execution order shown in the flowcharts may be changed according to actual situations without departing from the gist of the present invention.

The block diagrams in the figures generally represent functional entities and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.

The same reference numerals denote the same or similar elements, components, or parts throughout the drawings, and thus, a repetitive description thereof may be omitted hereinafter. It will be further understood that, although the terms first, second, third, etc. may be used herein to describe various elements, components, or sections, these elements, components, or sections should not be limited by these terms. That is, these phrases are used only to distinguish one from another. For example, a first device may also be referred to as a second device without departing from the spirit of the present invention. Furthermore, the term "and/or", "and/or" is intended to include all combinations of any one or more of the listed items.

Referring to fig. 1, fig. 1 is a flowchart illustrating a resource recycling risk prediction method according to the present invention. As shown in fig. 1, the method includes:

s1, acquiring user behavior data as an original sample set, and selecting an original characteristic variable from the original sample set;

the user behavior data acquisition must be established on the basis of user authorization and acquired in a legal manner. The user behavior data includes: the information processing system comprises position information, communication information, equipment information, operator information, social information, behavior information related to resources, credit information acquired through a third-party platform and the like. The communication information can comprise address list information and call record information; the equipment information refers to the model of the terminal equipment used by the user; the social information can be data information in user social software; the resource-related behavior information mainly refers to behavior information related to financial resources, including but not limited to: registration information, resource request information, resource overdue return information, and the like. The credit information comprises: general debt information, credit rating information, etc.

After user behavior data is acquired, data cleaning operation is required, which mainly comprises: data mapping, time variable processing, one-hot encoder, deletion processing, special character string variable analysis and the like, so that the data becomes an original sample set capable of constructing a model.

After the original sample set is obtained, feature variables need to be screened, feature variable sequencing can be obtained through methods such as IV (Information Value) analysis, correlation, hypothesis testing and the like according to a univariate analysis method, and then feature variables with high predictability are selected according to a preset variable threshold. In addition, the variables can be further subjected to cross analysis, and the variables related to the business can be subjected to cross transformation. For example, the city field and the multi-head common bond field can be subjected to cross transformation, whether certain city multi-head distribution shows a certain rule or not is analyzed, and therefore the cross variable is also used as an original characteristic variable to be applied to a machine learning process.

S2, randomly sampling from the original sample set to obtain a new sample set;

in the invention, a new sample set is randomly sampled from the original sample set by a BootStrap self-help method. The BootStrap self-help method is a simulated sampling statistical inference method based on original data, can be used for researching the distribution characteristics of certain statistic of a group of data, and is particularly suitable for the problems that interval estimation, hypothesis test and the like of parameters are difficult to derive by a conventional method. The basic idea is as follows: and performing re-sampling on the original data within the range, wherein the sample capacity is still n, the probability that each observation unit in the original data is extracted every time is equal and is 1/n, and the obtained sample is called a Bootstrap sample. An estimate of the parameter theta is then obtained

This is repeated several times.

Let the original sample set X = [ X1, X2., xn]Is independent and distributed samples xi-F (x), i =1, 2. R (X, F) is some preselected random variable that is a function of X and F. Now require to be based on the original sample set [ x1, x2]To estimate the distribution characteristics of R (X, F). For example, provide

Is a certain parameter of the overall distribution F,

relating to parameters for global distribution

Is the original sample set [ x1, x 2., xn]The empirical distribution function of (a) is,

is an estimate of θ, noting the estimation error as:

now, the distribution characteristics of R (X, F) are estimated from an original sample set X = [ X1, X2., xn ], the essence of boottrap Bootstrap is a resampling process, and the basic steps for calculating the distribution characteristics of R (X, F) are as follows:

s21, constructing an empirical distribution function Fn according to the original sample set X = [ X1, X2,.. times, xn ];

s22, extracting samples from the empirical distribution function Fn

It is called Bootstrap sample;

s23, calculating a corresponding bootrap statistic R (X, Fn), whose expression is:

wherein,

is the empirical distribution function of Bootstrap samples; rn is Bootstrap statistic for Tn.

S24, repeating the steps S22-S23 for N times to obtain N possible values of Bootstrap statistic R (X, Fn);

s25, approximating the distribution of R (X, F) by the distribution of R (X, Fn), i.e. approximating the distribution of Tn by the distribution of Rn, obtaining N possible values of the parameter theta (F), and then calculating the distribution of the parameter theta and the characteristic value thereof.

S3, generating a binary characteristic variable through a first model based on the new sample set;

in the invention, the first model can adopt any classification model, preferably a random forest model. The random forest model is a classifier comprising a plurality of decision trees, and the output class of the random forest model is determined by the mode of the class output by each tree.

Specifically, the method comprises the following steps:

s31, inputting the new sample set into a random forest model, and training a tree model;

s32, acquiring leaf node IDs of the trained tree models;

s33, encoding the leaf node ID to construct leaf node characteristics;

and S34, taking the leaf node characteristics as binary characteristic variables.

For example, as shown in fig. 2, a random forest classifier carried by a phthon algorithm library sklern may be used as a trainer, a Tree model may be trained in a manner of Tree Splits through a new sample set Y, a Transformed Features may be constructed by using the trained Tree model, an application interface carried by the random forest classifier may be used to take an ID of a leaf node, and then onehot encoding may be performed on the ID of the leaf node by using an onehot encoder to construct a leaf node feature W_iAnd applying the leaf node characteristic W_iAs a binary characteristic variable. Where i = 1.. and m, m is the total dimension of the leaf node feature. onehot encoding, also known as "one-hot encoding," encodes N states with an N-bit state register, each state having a separate register bit, and only one of the register bits being valid.

S4, respectively training a plurality of classification models of different types by using the characteristic variables;

wherein the feature variables include the binary feature variables, for example, 254-dimensional leaf node features can be trained through step S3 based on a new sample set, and then the 254-dimensional leaf node features W are shown in fig. 2_iCan be straightAnd then input into the classification model G as a characteristic variable. Or the characteristic variables are generated by splicing the original characteristic variables and the binary characteristic variables; for example, 254-dimensional leaf node features can be trained through step S3 based on a new sample set, and then 115-dimensional original feature variables extracted in step S1 are added, so that all 369-dimensional variables can be input into the classification model G as feature variables.

In the present invention, the classification model includes: at least two of a random forest model, a logistic regression model, and a fisher model. And the characteristic variables can be further screened by analyzing the collinearity and the correlation of the characteristic variables before training so as to improve the distinguishing capability of the classification model.

S5, respectively obtaining characteristic variables of the training sample set and the testing sample set;

the training sample may be behavior data of a historical user, and the testing sample may be behavior data of a current user. The user behavior data may be the same as the behavior data in step S1.

S6, fusing the classification models of different types based on the characteristic variables of the training sample set to obtain a prediction model;

in the invention, AdaBoost is adopted to fuse the classification models of different types based on the characteristic variables of the training sample set. Adaboost is an iterative algorithm, and the core idea thereof is to train different classifiers (weak classifiers) aiming at the same training set, and then to assemble the weak classifiers to form a stronger final classifier (strong classifier). The algorithm is realized by changing data distribution, and determines the weight of each sample according to whether the classification of each sample in each training set is correct and the accuracy of the last overall classification. And (4) sending the new data set with the modified weight value to a lower-layer classifier for training, and finally fusing the classifiers obtained by each training as a final decision classifier. The method comprises the following specific steps:

s60, inputting a training data set T = (x)_i，y_i) Wherein x is_iIs a feature vector, y_iIs a category label with a value of 1 or-1, i =1, 2 … N, number of iterations M.

S61, initializing the weight of the training sample

；

S62, according to the current weight u of the training sample_iLearning the mth classification model G_m(x)；

S63, obtaining the m classification model G_m(x) Current error rate in the training sample set:

；

wherein: p represents a probability.

S64, calculating the m classification model G_m(x) Weights in a fusion model

：

S65, calculating a scaling factor

；

S66, updating the weight of the training sample

(ii) a Wherein Z is_mTo normalize the factor, i.e., decrease the weight of correctly classified samples, the weight of incorrectly classified samples is increased.

S67, repeating the steps S62-S66 and executing M times to obtain a final fused prediction model G (x).

And S7, inputting the characteristic variables of the test sample set into a prediction model to obtain a risk prediction sequence.

Table 1 is a comparison of the model evaluation results after inputting the actual samples into the prediction model of the present invention and the model evaluation results after inputting the actual samples into the Logistic model. Table 2 is a comparison of the model evaluation results after inputting the actual samples into the prediction model of the present invention and the model evaluation results after inputting the actual samples into the XgBoost model. The greater the value of K-S, the greater the degree to which the model can distinguish between positive and negative customers. The ar (accuracy ratio) value is a relatively common index in the evaluation of the resource recycling wind control model. AR = (area between actual CAP curve and random curve)/(area between ideal CAP curve and random). The CAP (relative Accuracy profile) curve measures the ability of the risk prediction model to detect risk (i.e., bad users). The larger the AR value is, the more discriminative the model is, and the better the positive and negative samples can be separated. In tables 1 and 2, the K-S and AR values are values enlarged by 100 times from the actual values.

TABLE 1 comparison of model evaluation results of prediction model and Logistic model

TABLE 2 comparison of model evaluation results for the prediction model and the XgBoost model

As can be seen by combining the tables 1 and 2, compared with a purely machine learning model Logitics model, the prediction model has the advantages that the distinguishing capability is greatly improved, the KS/AR value can be improved by about 0.05, and the stability and the interpretability of the Logitics model are prolonged; compared with a pure machine learning model XgBooSt model, the prediction model has absolute advantages in stability and generalization and has more excellent distinguishing capability.

The invention forms a new prediction model by fusing the characteristic generation technology of the random forest model, the logistic two-classification technology and the model fusion technology of AdaBoost, so that the final risk model not only keeps higher model effect, but also keeps good stability and generalization,

fig. 3 is a schematic diagram of an architecture of a resource recycling risk prediction apparatus according to the present invention, as shown in fig. 3, the apparatus includes:

the first obtaining module 31 is configured to obtain user behavior data as an original sample set, and select an original feature variable from the original sample set;

a sampling module 32, configured to randomly sample the original sample set to obtain a new sample set;

a generating module 33, configured to generate a binary feature variable through a first model based on the new sample set;

a training module 34, configured to train a plurality of different types of classification models respectively by using the feature variables; the characteristic variables comprise the binary characteristic variables, or the characteristic variables are generated by splicing the original characteristic variables and the binary characteristic variables; wherein the classification model comprises: at least two of a random forest model, a logistic regression model, and a fisher model.

A second obtaining module 35, configured to obtain feature variables of the training sample set and the test sample set respectively;

a fusion module 36, configured to fuse the multiple classification models of different types based on the feature variables of the training sample set to obtain a prediction model;

and the prediction module 37 is configured to input the characteristic variables of the test sample set into a prediction model to obtain a risk prediction sequence.

In the present invention, the first model is preferably a random forest model. The generating module 33 comprises:

In one embodiment, the sampling module 32 randomly samples the original sample set by BootStrap self-service to obtain a new sample set.

And the fusion module adopts AdaBoost to fuse the classification models of different types based on the characteristic variables of the training sample set.

Those skilled in the art will appreciate that the modules in the above-described embodiments of the apparatus may be distributed as described in the apparatus, and may be correspondingly modified and distributed in one or more apparatuses other than the above-described embodiments. The modules of the above embodiments may be combined into one module, or further split into multiple sub-modules.

In the following, embodiments of the electronic device of the present invention are described, which may be regarded as an implementation in physical form for the above-described embodiments of the method and apparatus of the present invention. Details described in the embodiments of the electronic device of the invention should be considered supplementary to the embodiments of the method or apparatus described above; for details which are not disclosed in embodiments of the electronic device of the invention, reference may be made to the above-described embodiments of the method or the apparatus.

Fig. 4 is a block diagram of an exemplary embodiment of an electronic device according to the present invention. The electronic device shown in fig. 4 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.

As shown in fig. 4, the electronic device 400 of the exemplary embodiment is represented in the form of a general-purpose data processing device. The components of electronic device 400 may include, but are not limited to: at least one processing unit 410, at least one memory unit 420, a bus 430 connecting different electronic device components (including the memory unit 420 and the processing unit 410), a display unit 440, and the like.

The storage unit 420 stores a computer-readable program, which may be a code of a source program or a read-only program. The program may be executed by the processing unit 410 such that the processing unit 410 performs the steps of various embodiments of the present invention. For example, the processing unit 410 may perform the steps as shown in fig. 1.

The storage unit 420 may include readable media in the form of volatile storage units, such as a random access memory unit (RAM) 4201 and/or a cache memory unit 4202, and may further include a read only memory unit (ROM) 4203. The storage unit 420 may also include a program/utility 4204 having a set (at least one) of program modules 4205, such program modules 4205 including, but not limited to: operating the electronic device, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

Bus 430 may be any bus representing one or more of several types of bus structures, including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.

The electronic device 400 may also communicate with one or more external devices 300 (e.g., keyboard, display, network device, bluetooth device, etc.), enable a user to interact with the electronic device 400 via the external devices 300, and/or enable the electronic device 400 to communicate with one or more other data processing devices (e.g., router, modem, etc.). Such communication may occur via input/output (I/O) interfaces 450, and may also occur via a network adapter 460 with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN) and/or a public network such as the Internet). The network adapter 460 may communicate with other modules of the electronic device 400 via the bus 430. It should be appreciated that although not shown in FIG. 4, other hardware and/or software modules may be used in the electronic device 400, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID electronics, tape drives, and data backup storage electronics, among others.

FIG. 5 is a schematic diagram of one computer-readable medium embodiment of the present invention. As shown in fig. 5, the computer program may be stored on one or more computer readable media. The computer readable medium may be a readable signal medium or a readable storage medium. The readable storage medium may be, for example, but not limited to, an electronic device, apparatus, or device that is electronic, magnetic, optical, electromagnetic, infrared, or semiconductor, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. The computer program, when executed by one or more data processing devices, enables the computer-readable medium to implement the above-described method of the invention, namely: acquiring user behavior data as an original sample set, and selecting an original characteristic variable from the original sample set; randomly sampling from the original sample set to obtain a new sample set; generating a binary characteristic variable through a first model based on the new sample set; respectively training a plurality of classification models of different types by adopting characteristic variables; the characteristic variables comprise the binary characteristic variables, or the characteristic variables are generated by splicing the original characteristic variables and the binary characteristic variables; respectively acquiring characteristic variables of a training sample set and a test sample set; fusing the classification models of different types based on the characteristic variables of the training sample set to obtain a prediction model; and inputting the characteristic variables of the test sample set into a prediction model to obtain a risk prediction sequence.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments of the present invention described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiment of the present invention can be embodied in the form of a software product, which can be stored in a computer-readable storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to make a data processing device (which can be a personal computer, a server, or a network device, etc.) execute the above-mentioned method according to the present invention.

The computer readable storage medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable storage medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution electronic device, apparatus, or device. Program code embodied on a readable storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations of the present invention may be written in any combination of one or more programming languages, including object oriented programming languages such as Java, C + + or the like and conventional procedural programming languages, such as "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

In summary, the present invention can be implemented as a method, an apparatus, an electronic device, or a computer-readable medium executing a computer program. Some or all of the functions of the present invention may be implemented in practice using a general purpose data processing device such as a microprocessor or a Digital Signal Processor (DSP).

While the foregoing embodiments have described the objects, aspects and advantages of the present invention in further detail, it should be understood that the present invention is not inherently related to any particular computer, virtual machine or electronic device, and various general-purpose machines may be used to implement the present invention. The invention is not to be considered as limited to the specific embodiments thereof, but is to be understood as being modified in all respects, all changes and equivalents that come within the spirit and scope of the invention.

Claims

1. A resource recovery risk prediction method, the method comprising:

randomly sampling from the original sample set to obtain a new sample set;

2. A method as claimed in claim 1, wherein the first model is a random forest model.

3. The method of claim 2, wherein generating a binary feature variable based on the new sample set via a first model comprises:

acquiring leaf node IDs of the trained tree model;

encoding the leaf node ID to construct leaf node characteristics;

and taking the leaf node characteristics as binary characteristic variables.

4. The method of claim 3, wherein the new sample set is randomly sampled from the original sample set by BootStrap BootStrap.

5. The method of claim 3, wherein the classification model comprises: at least two of a random forest model, a logistic regression model, and a fisher model.

6. The method of claim 3, wherein AdaBoost is employed to fuse the plurality of different types of classification models based on feature variables of a training sample set.

7. A resource recovery risk prediction apparatus, the apparatus comprising:

8. An electronic device, comprising:

a processor; and

a memory storing computer-executable instructions that, when executed, cause the processor to perform the method of any of claims 1-6.

9. A computer readable storage medium, wherein the computer readable storage medium stores one or more programs which, when executed by a processor, implement the method of any of claims 1-6.