CN117408736A

CN117408736A - Enterprise fund demand mining method and medium based on improved Stacking fusion algorithm

Info

Publication number: CN117408736A
Application number: CN202311296560.1A
Authority: CN
Inventors: 姜树明; 贾其辉; 刘向阳; 韩露; 张艳青
Original assignee: Shandong Credit Jinqiao Small And Medium Sized Enterprise Development Service Co ltd; Qilu University of Technology
Current assignee: Shandong Credit Jinqiao Small And Medium Sized Enterprise Development Service Co ltd; Qilu University of Technology
Priority date: 2023-10-09
Filing date: 2023-10-09
Publication date: 2024-01-16

Abstract

The invention provides an enterprise fund demand mining method and medium based on an improved Stacking fusion algorithm, and belongs to the technical field of machine learning. The method specifically comprises the following steps: basic information of an enterprise to be mined is acquired, three prediction models are established, differential modeling is carried out, and the prediction models are trained; an improved Stacking model is established, three prediction models are used as a base learning model of a first layer of the Stacking model, and a kernel ridge regression model is used as an estimation model of a second layer of the Stacking model; training an improved Stacking model through training set data to obtain a fund demand prediction model; and inputting the test set data into a trained fund demand prediction model, setting the threshold value of the model to be 0.7, and setting the prediction result larger than 0.7 as a potential client with the fund demand. The condition of the fund demand of the current enterprise is mined through machine learning and training on the condition of the fund demand of the historical enterprise.

Description

Enterprise fund demand mining method and medium based on improved Stacking fusion algorithm

Technical Field

The invention relates to an enterprise fund demand mining method and medium based on an improved Stacking fusion algorithm, and belongs to the technical field of machine learning.

Background

Stacking is a model fusion algorithm, the basic idea is to fuse the prediction results of several single models by one model, in order to reduce the generalization error of the single models, an efficient integration method, in which predictions generated using various machine learning algorithms are used as input to a second layer learning algorithm. The second layer algorithm is trained to optimally combine model predictions to form a new set of predictions. The current enterprise fund demand lacks a reliable and accurate mining mode, and the fusion effect of the Stacking algorithm can well solve the problem of prediction accuracy.

Disclosure of Invention

The invention aims to provide an enterprise fund demand mining method and medium based on an improved Stacking fusion algorithm, which are used for mining the fund demand situation of a current enterprise through machine learning and training on the situation of historical enterprise fund demands.

The invention aims to achieve the aim, and the aim is achieved by the following technical scheme:

step 1: basic information of an enterprise to be mined, including enterprise business information, recruitment information, judicial risk conditions, news public opinion, government purchasing information and project detail information, is obtained, a feature data set is preprocessed and constructed, and the feature data set is divided into a test set and a training set;

step 2: three prediction models are established, differential modeling is carried out, and the prediction models are trained; the predictive model includes: the system comprises a random forest model, a lightGBM model and an XGBoost model, wherein the characteristic screening mode of the random forest model is based on RFE characteristic screening, single model training is carried out through grid search optimization, the characteristic screening mode of the lightGBM model is based on the lightGBM characteristic screening, single model training is carried out through a Bayesian optimizer, the characteristic screening mode of the XGBoost model is based on XGBoost characteristic screening, and single model training is carried out through the Bayesian optimizer;

step 3: an improved Stacking model is established, three prediction models are used as a base learning model of a first layer of the Stacking model, and a kernel ridge regression model is used as an estimation model of a second layer of the Stacking model;

because the training results of different training samples are different under the same prediction model, weighting and summing are carried out according to the prediction accuracy of the base model, and model parameters are determined;

the method comprises the steps of obtaining a verification set from a first layer of base learning model results through five-fold cross verification, splicing 5 predicted output result longitudinal items on the verification set to serve as input features of a second layer, fusing a Stacking model and a single model Catboost to serve as an improved additional layer model of the Stacking model, carrying out weighted summation on estimation results of all models, and carrying out weighted summation on estimation results of all models; the model weights are distributed by using an exhaustion method, the prediction accuracy of the models of the two additional layer models under different weights is calculated respectively, the weight with the highest model prediction accuracy is selected as the weight coefficient of the model, and the sum of the weight coefficients of the two additional layer models is 1;

step 4: training an improved Stacking model through training set data to obtain a fund demand prediction model;

step 5: and inputting the test set data into a trained fund demand prediction model, setting the threshold value of the model to be 0.7, and setting the prediction result larger than 0.7 as a potential client with the fund demand.

Preferably, the specific mode of performing single model training by the grid search tuning is as follows:

determining tuning parameters and setting a parameter search space, wherein the parameter names and the initial value space are respectively the minimum samples contained in leaf nodes, the minimum sample number which can be divided by the nodes, the maximum leaf node number, the maximum depth of a decision tree, the proportion of evaluation samples, the number of classifiers and the maximum feature number;

the minimum sample parameter search space contained in the leaf node is (1-3), the minimum sample parameter search space separable by the node is (6-8), the maximum leaf node parameter search space is (None, 1, 5, 10), the decision tree maximum depth parameter search space is (10-15), and the estimated sample proportion parameter search space is (0.5, 0.6, 0.7);

model training, namely instantiating a model and an evaluator, taking the set parameter search space into a grid search to train the model, obtaining a model prediction result through an optimal result, and obtaining an optimal value of the search super parameter of the round through the optimal parameter;

adjusting the parameter search space, adjusting the parameter search space according to the value of the super parameter of the previous round, if the value is the maximum value of the parameter search space, increasing the value of the parameter search space, otherwise, reducing the value of the parameter search space, continuing model training, continuously iterating the process, and recording the optimal value of the super parameter and the prediction score of each iteration until the optimal solution of all the parameters is contained in the parameter space to stop iteration;

and substituting the optimal solutions of all the parameters into the model.

Preferably, the bayesian optimizer uses a TPE algorithm as a probabilistic proxy model and EI as an acquisition function.

Preferably, the specific way of performing single model training through the Bayesian optimizer is as follows:

defining a parameter space through a special dictionary form, wherein keys on key value pairs are arbitrarily set, the values of the key value pairs are hp functions, and parameters comprise learning rate, a mode of constructing decision trees, the number of leaves on each tree, maximum depth, regularization coefficients, the minimum number of records possibly possessed by the leaves, minimum gain for describing splitting and data proportion used in each iteration;

inputting the hp function into a TPE algorithm for optimization, training a fund demand prediction model by using training set data, obtaining a prediction result, and correcting the TPE algorithm according to the prediction result;

selecting the most potential super-parameter combination point from the corrected TPE algorithm by using an EI acquisition function;

setting the iteration number of the algorithm as 100, stopping algorithm execution after iteration is completed, and outputting the optimal super-parameter combination and the optimal value of the objective function.

Preferably, the formula of the TPE algorithm is specifically as follows:

，

where y represents the observed or measured objective function value,representing a threshold in the observation domain, +.>The value of the observation is represented by a value,representing observations +.>Less than->Density estimation of->Representing observations +.>Is greater than or equal to->Is a density composition of (a).

Preferably, the specific formula of the acquisition function EI is as follows:

wherein,a certain quantile representing the TPE algorithm for dividing +.>And->The range is between (0, 1), p (y) is the edge probability distribution;

preferably, the specific way of weighting the prediction accuracy according to the base model of the prediction model trained by different training samples under the same prediction model is as follows:

training the base model through the training set to obtain a training resultAnd calculating the prediction accuracy from the true value tags in the training set +.>Repeating training for five times and recording training result and corresponding prediction accuracy +.>；

Prediction accuracy according to pre-trainingAccuracy prediction accuracy in five exercises>The proportion of the total sum of the precision values is used as the precision weight of the training;

training results for each base modelGiving weight, outputting result after giving weight->。

The invention has the advantages that: the invention predicts enterprises with current time demand for funds by using a multilayer weighting fusion Stacking algorithm through the advantages of the Stacking fusion algorithm. Firstly, in the first layer of Stacking fusion, three models which adopt different optimizers and different feature screening are fused, and the learners trained by different training samples are weighted according to prediction accuracy under the same learner model; secondly, carrying out second-layer Stacking fusion, taking a Stacking model and a single model as additional layer models of the improved Stacking model, carrying out weighted summation on estimation results of the models, and mining potential customers with current demands on funds.

Drawings

The accompanying drawings are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate the invention and together with the embodiments of the invention, serve to explain the invention.

FIG. 1 is a schematic flow chart provided in an embodiment of the present method application;

FIG. 2 is a schematic diagram of a conventional Stacking integration;

FIG. 3 is a specific training flow and a weighted calculation schematic diagram of training each base learner by using a single model according to an embodiment of the present application;

FIG. 4 is a schematic overall flow chart of multi-layer model fusion according to an embodiment of the present disclosure;

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Example 1

The enterprise fund demand mining method based on the improved Stacking fusion algorithm comprises the following steps:

the feature screening method includes an embedding method and a packaging method (RFE), the embedding method including a LightGBM-based feature screening and an XGBoost-based feature screening.

Specifically, the embedding method is to perform feature selection for the performance of the model, and is a method for the algorithm to decide which features to use, i.e. feature selection and model training are performed simultaneously. Model training is carried out by using the LightGBM and XGBoost algorithm, and weight coefficients (between 0 and 1) of the features are obtained according to the performance of the model. The magnitude of the weight coefficient directly reflects the contribution degree of the feature to the model, and the larger the coefficient is, the more important the feature is. The 40 most important features were chosen for the performance based on the LightGBM model and XGBoost model, respectively.

The algorithm used for screening the features by the wrapping method is not an algorithm used for modeling, but a function specially used for feature screening, and the function is used for selecting the optimal feature subset, and the invention selects a recursive feature elimination method (Recursive feature elimination, abbreviated as RFE). The main idea of RFE is to select a set of initial features, train a model and calculate the importance of each feature in each iteration. The less important features are then deleted and the resulting subset is used as input for the next iteration, and the process is repeated until the desired number of features is reached. In the above procedure, the order in which the features are eliminated is the ordering of the features, and the present invention selects the 40 most important features in the dataset.

The grid search tuning is to sequentially adjust parameters according to steps in a designated parameter range, train a learner by using the adjusted parameters, and find the parameter with the highest precision on the verification set from all the parameters; the specific mode of the grid search tuning for single model training is as follows:

and substituting the optimal solutions of all the parameters into the model.

The Bayesian optimizer adopts a TPE algorithm as a probability proxy model and EI as an acquisition function.

The specific mode of single model training by the Bayesian optimizer is as follows:

The TPE algorithm formula is specifically as follows:

，

The specific formula of the acquisition function EI is as follows:

the invention carries out 5-fold cross validation on the basic learning model. Because the present invention employs 5-fold cross-validation, the training set will be divided into 5 shares. To be randomTraining of forest is exemplified by training random forest 5 times, selecting one time as verification set, namely training set each timeLine, verification set +.>And (3) row. After the first training of the random forest, the output on the validation set is denoted as a1 and the output on the test set is denoted as b1. The above procedure will be carried out 5 times, eventually yielding a1, a2, a3, a4, a5 and b1, b2, b3, b4, b 5. a1, a2, a3, a4 and a5 are output results of the random forest on the verification set after training, and the output results are spliced together to obtain A1, namely the result predicted on the complete original training set after training the random forest. b1, B2, B3, B4 and B5 are output results of the random forest on the test set after training, and B1 is obtained after weighting calculation is carried out on the random forest according to the prediction precision, namely the prediction result of the random forest on the complete original test set after training.

Wherein, the invention uses three basic models, and A1, A2, A3 and B1, B2 and B3 are obtained after the above operation. A1, A2, A3 are combined together as the training set of the next layer, and B1, B2, B3 are the test set of the next layer.

The specific mode for weighting the prediction accuracy of the prediction model trained by different training samples according to the base model under the same prediction model is as follows:

Example 2

Embodiments of the present disclosure provide an enterprise fund demand mining apparatus based on an improved Stacking fusion algorithm, including a processor (processor) and a memory (memory). Optionally, the apparatus may further comprise a communication interface (Communication Interface) and a bus. The processor, the communication interface and the memory can complete communication with each other through the bus. The communication interface may be used for information transfer. The processor may invoke logic instructions in memory to perform the enterprise fund demand mining method of the above embodiments based on the modified Stacking fusion algorithm.

Further, the logic instructions in the memory described above may be implemented in the form of software functional units and stored in a computer-readable storage medium when sold or used as a stand-alone product.

The memory is used as a computer readable storage medium for storing a software program, a computer executable program, and program instructions/modules corresponding to the methods in the embodiments of the present disclosure. The processor executes the program instructions/modules stored in the memory to perform the function applications and data processing, i.e., to implement the enterprise fund demand mining method based on the improved Stacking fusion algorithm in the above embodiments.

The memory may include a program storage area and a data storage area, wherein the program storage area may store an operating system, at least one application program required for a function; the storage data area may store data created according to the use of the terminal device, etc. Further, the memory may include a high-speed random access memory, and may also include a nonvolatile memory.

Embodiments of the present disclosure provide a computer readable storage medium storing computer executable instructions configured to perform the above-described enterprise fund demand mining method based on an improved Stacking fusion algorithm.

The computer readable storage medium may be a transitory computer readable storage medium or a non-transitory computer readable storage medium.

Embodiments of the present disclosure may be embodied in a software product stored on a storage medium, including one or more instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of a method according to embodiments of the present disclosure. And the aforementioned storage medium may be a non-transitory storage medium including: a plurality of media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or a transitory storage medium.

Finally, it should be noted that: the foregoing description is only a preferred embodiment of the present invention, and the present invention is not limited thereto, but it is to be understood that modifications and equivalents of some of the technical features described in the foregoing embodiments may be made by those skilled in the art, although the present invention has been described in detail with reference to the foregoing embodiments. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. An enterprise fund demand mining method based on an improved Stacking fusion algorithm is characterized by comprising the following specific steps:

2. The enterprise fund demand mining method based on the improved Stacking fusion algorithm of claim 1, wherein the specific manner of performing single model training for grid search tuning is as follows:

and substituting the optimal solutions of all the parameters into the model.

3. The enterprise fund demand mining method based on the improved Stacking fusion algorithm of claim 1, wherein the bayesian optimizer employs TPE algorithm as a probabilistic proxy model and EI as an acquisition function.

4. The method for mining enterprise fund requirements based on the improved Stacking fusion algorithm of claim 3, wherein the single model training by the bayesian optimizer is specifically as follows:

5. The enterprise fund demand mining method based on the improved Stacking fusion algorithm of claim 4, wherein the TPE algorithm formula is specifically as follows:

，

wherein y representsThe observed or measured value of the objective function,representing a threshold in the observation domain, +.>Representing observations->Representing observations +.>Less than->Density estimation of->Representing observations +.>Is greater than or equal to->Is a density composition of (a).

6. The method for mining enterprise fund requirements based on the improved Stacking fusion algorithm of claim 5, wherein the specific formula of the collection function EI is as follows:

;

wherein,a certain quantile representing the TPE algorithm for dividing +.>And->Ranging between (0, 1), p (y) is the edge probability distribution.

7. The enterprise fund demand mining method based on the improved Stacking fusion algorithm of claim 1, wherein the specific way of weighting the prediction accuracy according to the base model of the prediction model trained by different training samples under the same prediction model is as follows:

8. An enterprise fund demand mining apparatus based on an improved Stacking fusion algorithm, comprising a processor and a memory storing program instructions, characterized in that the processor is configured to execute the enterprise fund demand mining method based on an improved Stacking fusion algorithm as claimed in any one of claims 1 to 7 when running the program instructions.

9. A storage medium storing program instructions which, when executed, perform the enterprise fund demand mining method based on the improved Stacking fusion algorithm of any one of claims 1 to 7.