CN111242358A - Enterprise information loss prediction method with double-layer structure - Google Patents

Enterprise information loss prediction method with double-layer structure Download PDF

Info

Publication number
CN111242358A
CN111242358A CN202010011877.6A CN202010011877A CN111242358A CN 111242358 A CN111242358 A CN 111242358A CN 202010011877 A CN202010011877 A CN 202010011877A CN 111242358 A CN111242358 A CN 111242358A
Authority
CN
China
Prior art keywords
prediction model
data set
training
layer
evaluation index
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010011877.6A
Other languages
Chinese (zh)
Inventor
陈海峰
杨冬豪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Cezhitong Technology Co Ltd
Original Assignee
Hangzhou Cezhitong Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Cezhitong Technology Co Ltd filed Critical Hangzhou Cezhitong Technology Co Ltd
Priority to CN202010011877.6A priority Critical patent/CN111242358A/en
Publication of CN111242358A publication Critical patent/CN111242358A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0639Performance analysis of employees; Performance analysis of enterprise or organisation operations
    • G06Q10/06393Score-carding, benchmarking or key performance indicator [KPI] analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Theoretical Computer Science (AREA)
  • Human Resources & Organizations (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Strategic Management (AREA)
  • Economics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Development Economics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Game Theory and Decision Science (AREA)
  • Educational Administration (AREA)
  • Marketing (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses an enterprise information loss prediction method of a double-layer structure, which comprises the steps of obtaining a data set by a system, dividing the data set into a training set and a testing set, then carrying out double-layer training on the training set by utilizing XGboost, LightGBM, AdaBoost and a weighted voting algorithm, outputting evaluation indexes of a classification prediction model, and finally carrying out result analysis and comparison on the evaluation indexes of the classification prediction model and a comparison object. By using a double-layer fusion method and a corresponding algorithm, the accuracy and precision of the customer loss prediction model are improved, and the customer loss prediction model is further improved.

Description

Enterprise information loss prediction method with double-layer structure
Technical Field
The invention relates to the field of data processing, in particular to an enterprise information loss prediction method with a double-layer structure.
Background
Today, various markets are increasingly saturated and competitive, the market share of the industry is getting bigger and bigger, and enterprises in various industries pay attention to the development of novel customized services to attract new customers and convert owned customers into loyalty customers. Research shows that the cost of developing a new customer is much higher than the cost of maintaining an old customer, so preventing the loss of the old customer is a problem that enterprises and families must pay attention to.
Therefore, customer churn prediction techniques are important to businesses retaining old customers and launching a variety of customized services. For example, a telecom enterprise, a churned customer can no longer generate any profit if he is no longer using the services provided by the operator, which would result in a considerable profit increase for an operator with tens of millions of customers if the churning rate of the customer could be reduced by one percent. Timely and accurate identification of potential attrition customers is becoming the focus of research by major entrepreneurs in various industries.
In the field of customer loss prediction, the accuracy of the model is greatly improved by applying a machine learning algorithm such as a reinforcement learning algorithm, but the improvement of the prediction accuracy of a single algorithm is limited, so that the improvement of the accuracy and the precision are places where the customer loss prediction model is in urgent need of improvement. The invention adopts a double-layer fusion structure and a suitable algorithm, improves the accuracy and precision of the customer loss prediction model, and further perfects the customer loss prediction model.
Disclosure of Invention
The invention provides a method for predicting information loss of an enterprise with a double-layer structure, and aims to solve the problems of low accuracy and precision in the prior art.
In order to achieve the purpose, the invention adopts the following technical scheme:
the invention discloses an enterprise information loss prediction method with a double-layer structure, which comprises the following steps of:
acquiring a data set, and dividing the data set into a training set and a test set;
performing double-layer training on the training set by using the XGboost, the LightGBM, the AdaBoost and a weighted voting algorithm, and outputting an evaluation index of a classification prediction model;
and analyzing and comparing the evaluation index of the classification prediction model with a comparison object.
The method comprises the steps of obtaining a data set, dividing the data set into a training set and a testing set, then carrying out double-layer training on the training set by utilizing XGboost, LightGBM, AdaBoost and a weighted voting algorithm, outputting an evaluation index of a classification prediction model, and finally carrying out result analysis and comparison on the evaluation index of the classification prediction model and a comparison object. By using a double-layer fusion method and a corresponding algorithm, the accuracy and precision of the customer loss prediction model are improved, and the customer loss prediction model is further improved.
Preferably, the performing double-layer training on the training set by using the XGBoost, the LightGBM, the AdaBoost and the weighted voting algorithm, and outputting an evaluation index of the classification prediction model includes:
building a classification prediction model double-layer structure, and training a training set by a first layer through a corresponding algorithm to obtain a first-layer data set;
the second layer trains the first layer data set through a corresponding algorithm to obtain an evaluation index of the classification prediction model, wherein a calculation formula of a strong classifier in the AdaBoost algorithm is as follows:
Figure BDA0002357430360000021
where x is the input vector, F (x) is the strong classifier, ft(x) Is a weak classifier, αtIs the weight value of the weak classifier, which is a positive number, and T is the number of weak classifiers. The output value of the weak classifier is +1 or-1, corresponding to the positive and negative samples, respectively.
Preferably, the comparing the evaluation index of the classification prediction model with the result of the comparison object includes:
calculating an evaluation index of a comparison object;
and comparing the evaluation index of the classification prediction model with the evaluation index of the comparison object, and analyzing and comparing the results.
Preferably, the acquiring the data set divides the data set into a training set and a testing set, and outputs a corresponding evaluation set and a corresponding testing set through a training, verifying and testing method.
An enterprise information loss prediction device with a double-layer structure comprises:
the acquisition module acquires a data set and divides the data set into a training set and a test set;
the computing module is used for performing double-layer training on the training set by utilizing the XGboost, the LightGBM, the AdaBoost and a weighted voting algorithm and outputting evaluation indexes of the classification prediction model;
and the analysis module is used for analyzing and comparing the evaluation index of the classification prediction model with the comparison object.
Preferably, the calculation module includes:
the first layer training unit is used for building a classification prediction model double-layer structure, and the first layer trains a data set through a corresponding algorithm to obtain a first layer data set;
and the second layer training unit is used for training the first layer data set through a corresponding algorithm to obtain an evaluation index of the classification prediction model, wherein a calculation formula of a strong classifier in the AdaBoost algorithm is as follows:
Figure BDA0002357430360000031
where x is the input vector, F (x) is the strong classifier, ft(x) Is a weak classifier, αtIs the weight value of the weak classifier, which is a positive number, and T is the number of weak classifiers. The output value of the weak classifier is +1 or-1, corresponding to the positive and negative samples, respectively.
Preferably, the analysis module comprises:
a calculation unit that calculates an evaluation index of the comparison object;
and a comparison unit for comparing the evaluation index of the classification prediction model with the evaluation index of the comparison object and analyzing and comparing the results.
Preferably, the acquiring module includes:
the dividing unit is used for acquiring a data set and dividing the data set into a training set and a test set;
and the data set is trained, verified and tested, and a corresponding evaluation set and a corresponding test set are output.
An electronic device comprising a memory and a processor, the memory storing one or more computer instructions, wherein the one or more computer instructions are executed by the processor to implement a two-tier enterprise intelligence loss prediction method as described in any of the above.
A computer readable storage medium storing a computer program which, when executed by a computer, implements a method for enterprise intelligence loss prediction for a two-tier architecture as described in any one of the above.
The invention has the following beneficial effects:
the system acquires a data set, divides the data set into a training set and a testing set, then performs double-layer training on the training set by utilizing XGboost, LightGBM, AdaBoost and a weighted voting algorithm, outputs an evaluation index of a classification prediction model, and finally performs result analysis and comparison on the evaluation index of the classification prediction model and a comparison object. By using a double-layer fusion method and a corresponding algorithm, the accuracy and precision of the customer loss prediction model are improved, and the customer loss prediction model is further improved.
Drawings
FIG. 1 is a first flowchart of a method for predicting loss of information in an enterprise with a two-layer structure according to an embodiment of the present invention;
FIG. 2 is a second flowchart of a method for predicting loss of information in an enterprise with a two-layer structure according to an embodiment of the present invention;
FIG. 3 is a third flowchart of a method for predicting information loss of an enterprise with a two-layer structure according to an embodiment of the present invention;
FIG. 4 is a flowchart illustrating an embodiment of a method for predicting loss of information in an enterprise with a two-layer structure according to the present invention;
FIG. 5 is a schematic diagram of an enterprise information loss prediction apparatus with a two-layer structure according to an embodiment of the present invention;
FIG. 6 is a schematic diagram of a computing module of an enterprise information loss prediction apparatus with a two-layer structure according to an embodiment of the present invention;
FIG. 7 is a schematic diagram of an analysis module of an enterprise information loss prediction apparatus with a two-layer structure according to an embodiment of the present invention;
FIG. 8 is a schematic diagram of an acquisition module of an enterprise information loss prediction apparatus with a two-layer structure according to an embodiment of the present invention;
FIG. 9 is a flowchart illustrating an embodiment of an enterprise information loss prediction apparatus with a two-layer structure according to the present invention;
fig. 10 is a schematic diagram of an electronic device implementing a method for predicting loss of intelligence of an enterprise with a two-layer structure according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Before the technical solution of the present invention is introduced, a scenario to which the technical solution of the present invention may be applicable is exemplarily described.
The following are exemplary: training the training set in the customer churn prediction model is a process in the customer churn prediction model. And carrying out corresponding calculation processing on the training set to obtain an evaluation index of the model classification prediction model so as to facilitate subsequent analysis and comparison.
The training unit in the customer loss prediction model is indispensable, and exemplarily, the training set outputs corresponding data through the training unit for analysis and comparison, so that the accuracy and precision of the customer loss prediction model are improved, and the customer loss prediction model is perfected.
Illustratively, the training unit adopts different algorithms to influence the accuracy and precision of the customer loss prediction model to a certain extent, and in order to improve the accuracy and precision of the customer loss prediction model, a double-layer fusion model, XGboost, LightGBM, AdaBoost and a weighted voting algorithm are adopted for processing.
Obviously, in the prior art, the customer loss prediction model adopts a random forest algorithm, so that overfitting can be performed on some classification or regression problems with high noise, the accuracy and precision of the customer loss prediction model are low, and the prediction result is influenced.
The XGboost, the LightGBM, the AdaBoost, the weighted voting algorithm and the double-layer fusion model are used for processing, so that the accuracy and precision of the customer loss prediction model can be improved, and the customer loss prediction model is further perfected.
Example 1
As shown in fig. 1, a method for predicting information loss of an enterprise with a double-layer structure includes the following steps:
s110, acquiring a data set, and dividing the data set into a training set and a test set;
s120, performing double-layer training on the training set by utilizing the XGboost, the LightGBM, the AdaBoost and a weighted voting algorithm, and outputting an evaluation index of a classification prediction model;
and S130, analyzing and comparing the evaluation index of the classification prediction model with the comparison object.
According to the embodiment 1, a system acquires a data set, divides the data set into a training set and a testing set, performs double-layer training on the training set by using XGboost, LightGBM, AdaBoost and a weighted voting algorithm, outputs evaluation indexes of a classification prediction model, and finally performs result analysis and comparison on the evaluation indexes of the classification prediction model and a comparison object. The method can improve the accuracy and precision of the customer loss prediction model and further improve the customer loss prediction model.
Example 2
As shown in fig. 2, a method for predicting information loss of an enterprise with a two-layer structure includes:
s210, acquiring a data set, and dividing the data set into a training set and a test set;
s220, building a classification prediction model double-layer structure, and training a data set through a corresponding algorithm in a first layer to obtain a first-layer data set;
s230, the second layer trains the first layer data set through a corresponding algorithm to obtain an evaluation index of the classification prediction model, wherein a calculation formula of a strong classifier in the AdaBoost algorithm is as follows:
Figure BDA0002357430360000071
where x is the input vector, F (x) is the strong classifier, ft(x) Is a weak classifier, αtIs the weight value of the weak classifier, which is a positive number, and T is the number of weak classifiers. The output value of the weak classifier is +1 or-1, corresponding to the positive and negative samples, respectively.
According to embodiment 2, a classification prediction model double-layer structure is built, a training set sequentially passes through a first layer and a second layer, the first layer trains the training set through a corresponding algorithm to obtain a second training set, meanwhile, a test set predicts to obtain a second test set, a first layer data set comprises the second training set and the second test set, and a calculation formula of a prediction model of an XGboost algorithm can be as follows:
Figure BDA0002357430360000072
where K is the total number of trees, fkA (k) th tree is shown,
Figure BDA0002357430360000074
represents a sample xiThe predicted result of (1).
The second layer trains the first layer data set through a corresponding algorithm to obtain an evaluation index of the classification prediction model, wherein a calculation formula of a strong classifier in the AdaBoost algorithm is as follows:
Figure BDA0002357430360000073
where x is the input vector, F (x) is the strong classifier, ft(x) Is a weak classifier, αtIs the weight value of the weak classifier, which is a positive number, and T is the number of weak classifiers. The output value of the weak classifier is +1 or-1, corresponding to the positive and negative samples, respectively.
The evaluation indexes of the classification prediction model comprise four indexes of accuracy, precision, recall rate and F1 value. By using a double-layer fusion method and a corresponding algorithm, the accuracy and precision of the customer loss prediction model are improved, and the customer loss prediction model is further improved.
Example 3
As shown in fig. 3, a method for predicting information loss of an enterprise with a two-layer structure includes:
s310, acquiring a data set, and dividing the data set into a training set and a test set;
s320, performing double-layer training on the training set by utilizing the XGboost, the LightGBM, the AdaBoost and a weighted voting algorithm, and outputting an evaluation index of a classification prediction model;
s330, calculating an evaluation index of a comparison object;
s340, comparing the evaluation index of the classification prediction model with the evaluation index of the comparison object, and analyzing and comparing the results;
the calculation of the evaluation index of the comparison object mentioned in example 3 is merely exemplary and is not a limitation on the calculation of the evaluation index of the comparison object. And calculating the evaluation indexes of the MLP, the MLP fused with the self-encoder, the MLP embedded with the fusion entity, the KNN, the Logistic regression and the Bagging comparison objects, and comparing the evaluation indexes with the evaluation indexes of the classification prediction model. The accuracy and precision of the customer churn prediction model are improved, and the customer churn prediction model can be well played in the application of binary prediction such as credit assessment, disaster prediction and the like.
Example 4
As shown in fig. 4, one specific embodiment may be:
s410, acquiring a data set, and dividing the data set into a training set and a test set;
the data set was divided into training set and test set (assuming training set is 999 pieces of data and test set is 210 pieces of data), then single base classifier 1 in the primary classification model was 3-fold cross validated using 666 pieces in training set as feeding set and the remaining 333 pieces as validation set.
S420, building a classification prediction model double-layer structure, and training a data set by a first layer through a corresponding algorithm to obtain a first-layer data set;
the first Stacking layer trains the dataset through XGboost, LightGBM and AdaBoost algorithms. 666 pieces of data are used for training a model for each verification, 333 pieces of data are obtained by verifying a verification set through the trained model, and meanwhile, 210 pieces of data are obtained by predicting a test set. Thus, after 3 cross-tests, new features, namely 3 × 333 predictors and 3 × 210 predictors of the test data set, were obtained.
The 3 x 333 predictions are then spliced into a 999 row by 1 column matrix labeled training data set a 1. And the predicted results for the test data set of 3 x 210 rows are weighted averaged to obtain a matrix of 210 rows and 1 column, test data set B1. This is the prediction result of a single basic classifier on the data set, and if two basic classifiers, such as basic classifier 2 and basic classifier 3, are further integrated, a total of six matrices, a1, a2, A3, B1, B2 and B3, will be obtained.
Finally, a matrix of A1, A2 and A3 are combined together to form 999 rows and 3 columns as a second training data set, a matrix of B1, B2 and B3 are combined together to form 210 rows and 3 columns as a second test data set, the first layer data set comprises the second training data set and the second test data set, and the secondary classification model is retrained based on the first layer data set.
The prediction model of the XGBoost algorithm may have a calculation formula as follows:
Figure BDA0002357430360000091
where K is the total number of trees, fkA (k) th tree is shown,
Figure BDA0002357430360000092
represents a sample xiThe predicted result of (1).
The main techniques of the LightGBM algorithm are as follows:
the Gradient-based One-Side Sampling (GOSS) technology removes a large part of data with small Gradient, only uses the rest data to estimate information gain, and avoids the influence of a low-Gradient long tail part.
The Exclusive Feature Bundling (EFB) technique refers to Bundling mutually Exclusive features to reduce the number of features.
The histogram algorithm replaces the traditional Pre-Sorted. The basic idea is to discretize the continuous floating-point eigenvalues into k integers and construct a k-wide histogram. The method is characterized in that a discrete value is used as an index to accumulate statistics in a histogram at the beginning, after data is traversed for one time, the histogram accumulates statistics required by the discretization, and when nodes are split, the optimal dividing point can be found from the k buckets according to the discrete value on the histogram, so that the optimal dividing point can be found more quickly, and because the histogram algorithm does not need to store a Pre-ordering result like Pre-Sorted, but only stores a characteristic discrete value, the consumption of a memory can be reduced by using the histogram.
And S430, the second layer trains the first layer data set through a corresponding algorithm to obtain an evaluation index of the classification prediction model.
After the second Voting layer is built, the accuracy under different weights of 3 basic classifiers selected by the article is compared, and finally the weight of the basic classifier is determined to be set as { { AdaBoost:1}, { XGBoost:1}, and { LightGBM:2 }. A base classifier with a higher accuracy may have a larger weight value. And finally determining the judgment result of the sample according to the class with the highest probability after calculation. The Voting layer trains the first layer data set through a weighted Voting algorithm to obtain an evaluation index of the classification prediction model, and the evaluation index of the classification prediction model generally uses four indexes of accuracy, precision, recall rate and F1 value.
The calculation formula of the strong classifier in the AdaBoost algorithm is as follows:
Figure BDA0002357430360000101
where x is the input vector, F (x) is the strong classifier, ft(x) Is a weak classifier, αtIs the weight value of the weak classifier, which is a positive number, and T is the number of weak classifiers. The output value of the weak classifier is +1 or-1, corresponding to the positive and negative samples, respectively.
S440, calculating an evaluation index of a comparison object;
using the formula:
Figure BDA0002357430360000111
Figure BDA0002357430360000112
Figure BDA0002357430360000113
Figure BDA0002357430360000114
wherein accuracy is the accuracy, precision is the accuracy, recall is the recall, and TP is the number of samples correctly divided into lost clients; TN is the number of samples correctly divided into non-attrition customers; FP is the number of samples wrongly divided into lost customers; FN is the number of samples that are wrongly classified as non-churning clients. And calculating evaluation indexes of the comparison object MLP, the MLP fused with the self-encoder, and the MLP, KNN, Logistic Regression and Bagging embedded in the fusion entity.
S650, comparing the evaluation index of the classification prediction model with the evaluation index of the comparison object, and analyzing and comparing the results;
the model presented here performs well on the published data set used for the experiments. 5 strong models based on the tree model are fused, meanwhile, the problems of dimensionality disaster and data sparseness are avoided, the relevance among the characteristics is guaranteed, the accuracy and the precision are greatly improved under the improvement of the time complexity in an acceptable range, the accuracy is averagely higher than 8.81% compared with other selected customer loss prediction models, and the accuracy is higher than 1.7% compared with two improved models based on MLP. In terms of precision and recall rate, although the recall rate of the model is generally realized, the precision rate is improved by about 23%. In comprehensive comparison, the performance of the model provided by the method is superior to that of various models in comparison experiments. Can be well played in the application of binary prediction such as credit assessment, disaster prediction and the like.
The method comprises the steps of obtaining a data set, dividing the data set into a training set and a testing set, then carrying out double-layer training on the training set by utilizing XGboost, LightGBM, AdaBoost and a weighted voting algorithm, outputting an evaluation index of a classification prediction model, and finally carrying out result analysis and comparison on the evaluation index of the classification prediction model and a comparison object. By using a double-layer fusion method and a corresponding algorithm, the accuracy and precision of the customer loss prediction model are improved, and the customer loss prediction model is further improved.
Example 5
As shown in fig. 5, an enterprise information loss prediction apparatus with a two-layer structure includes:
the acquisition module 10 acquires a data set, and divides the data set into a training set and a test set;
the calculation module 20 performs double-layer training on the training set by using the XGBoost, the LightGBM, the AdaBoost and the weighted voting algorithm, and outputs an evaluation index of the classification prediction model;
and the analysis module 30 is used for analyzing and comparing the evaluation index of the classification prediction model with the result of the comparison object.
One embodiment of the above apparatus may be: the obtaining module 10 obtains a data set, divides the data set into a training set and a testing set, the calculating module 20 performs double-layer training on the training set by using XGBoost, LightGBM, AdaBoost and a weighted voting algorithm, outputs an evaluation index of a classification prediction model, and finally, the analyzing module 30 performs result analysis and comparison on the evaluation index of the classification prediction model and a comparison object.
Example 6
As shown in fig. 6, a calculation module 20 of an enterprise intelligence loss prediction apparatus with a two-layer structure includes:
the first layer training unit 22 is used for building a classification prediction model double-layer structure, and the first layer trains a training set through a corresponding algorithm to obtain a first layer training data set;
and the second layer training unit 24 trains the training data set through a corresponding algorithm to obtain the evaluation index of the classification prediction model.
One embodiment of the computing module 20 of the above apparatus may be: the first layer training unit 22 builds a classification prediction model double-layer structure, the training set sequentially passes through a first layer and a second layer, the first layer trains the training set through corresponding algorithms to obtain a second training set, meanwhile, the test set predicts to obtain a second test set, the first layer data set comprises the second training set and the second test set, and a calculation formula of a prediction model of the XGboost algorithm can be as follows:
Figure BDA0002357430360000131
where K is the total number of trees, fkA (k) th tree is shown,
Figure BDA0002357430360000133
represents a sample xiThe predicted result of (1).
Second-layer training unit 24 trains the first-layer data set by the second layer through a corresponding algorithm, where a calculation formula of a strong classifier in the AdaBoost algorithm is as follows:
Figure BDA0002357430360000132
where x is the input vector, F (x) is the strong classifier, ft(x) Is a weak classifier, αtIs the weight value of the weak classifier, which is a positive number, and T is the number of weak classifiers. The output value of the weak classifier is +1 or-1, corresponding to the positive and negative samples, respectively.
And obtaining evaluation indexes of the classification prediction model, wherein the evaluation indexes of the classification prediction model comprise four indexes of accuracy, precision, recall rate and F1 value, so that the accuracy and precision of the customer loss prediction model are improved, and the customer loss prediction model is further perfected.
Example 7
As shown in fig. 7, an analysis module 30 of an enterprise information loss prediction apparatus having a two-layer structure includes:
a calculation unit 32 that calculates an evaluation index of the comparison object;
and a comparison unit 34 for comparing the evaluation index of the classification prediction model with the evaluation index of the comparison object, and analyzing and comparing the results.
One embodiment of the analysis module 30 of the above apparatus may be: the calculating unit 32 calculates the evaluation indexes of the MLP, the MLP fused with the self-encoder, the MLP embedded with the fused entity, the KNN, the Logistic Regression, and the Bagging comparison object, and the comparing unit 34 compares the evaluation indexes of the classification prediction model with the evaluation indexes of the comparison object, and performs result analysis and comparison.
Example 8
As shown in fig. 8, an obtaining module 10 of an enterprise information loss prediction apparatus with a two-layer structure includes:
a dividing unit 12, which acquires a data set and divides the data set into a training set and a test set;
and the output unit 14 outputs the corresponding evaluation set and test set after the data set is trained, verified and tested.
One embodiment of the acquisition module 10 of the above apparatus may be: the data set is acquired by the recognition unit 12, divided into a training set and a test set, and then subjected to a training, verifying and testing method by the selection unit 14, and a corresponding evaluation set and test set are output.
Example 9
As shown in fig. 9, one specific implementation may be:
s910, acquiring a data set, and dividing the data set into a training set and a test set;
the data set was divided into training set and test set (assuming training set is 999 pieces of data and test set is 210 pieces of data), then single base classifier 1 in the primary classification model was 3-fold cross validated using 666 pieces in training set as feeding set and the remaining 333 pieces as validation set.
S920, building a classification prediction model double-layer structure, and training a data set by a first layer through a corresponding algorithm to obtain a first-layer data set;
the first Stacking layer trains the dataset through XGboost, LightGBM and AdaBoost algorithms. 666 pieces of data are used for training a model for each verification, 333 pieces of data are obtained by verifying a verification set through the trained model, and meanwhile, 210 pieces of data are obtained by predicting a test set. Thus, after 3 cross-tests, new features, namely 3 × 333 predictors and 3 × 210 predictors of the test data set, were obtained.
The 3 x 333 predictions are then spliced into a 999 row by 1 column matrix labeled training data set a 1. And the predicted results for the test data set of 3 x 210 rows are weighted averaged to obtain a matrix of 210 rows and 1 column, test data set B1. This is the prediction result of a single basic classifier on the data set, and if two basic classifiers, such as basic classifier 2 and basic classifier 3, are further integrated, a total of six matrices, a1, a2, A3, B1, B2 and B3, will be obtained.
Finally, a matrix of A1, A2 and A3 are combined together to form 999 rows and 3 columns as a second training data set, a matrix of B1, B2 and B3 are combined together to form 210 rows and 3 columns as a second test data set, the first layer data set comprises the second training data set and the second test data set, and the secondary classification model is retrained based on the first layer data set.
The prediction model of the XGBoost algorithm may have a calculation formula as follows:
Figure BDA0002357430360000151
where K is the total number of trees, fkA (k) th tree is shown,
Figure BDA0002357430360000152
represents a sample xiThe predicted result of (1).
The main techniques of the LightGBM algorithm are as follows:
the Gradient-based One-Side Sampling (GOSS) technology removes a large part of data with small Gradient, only uses the rest data to estimate information gain, and avoids the influence of a low-Gradient long tail part.
The Exclusive Feature Bundling (EFB) technique refers to Bundling mutually Exclusive features to reduce the number of features.
The histogram algorithm replaces the traditional Pre-Sorted. The basic idea is to discretize the continuous floating-point eigenvalues into k integers and construct a k-wide histogram. The method is characterized in that a discrete value is used as an index to accumulate statistics in a histogram at the beginning, after data is traversed for one time, the histogram accumulates statistics required by the discretization, and when nodes are split, the optimal dividing point can be found from the k buckets according to the discrete value on the histogram, so that the optimal dividing point can be found more quickly, and because the histogram algorithm does not need to store a Pre-ordering result like Pre-Sorted, but only stores a characteristic discrete value, the consumption of a memory can be reduced by using the histogram.
S930, the second layer trains the first layer data set through a corresponding algorithm to obtain an evaluation index of the classification prediction model.
After the second Voting layer is built, the accuracy under different weights of 3 basic classifiers selected by the article is compared, and finally the weight of the basic classifier is determined to be set as { { AdaBoost:1}, { XGBoost:1}, and { LightGBM:2 }. A base classifier with a higher accuracy may have a larger weight value. And finally determining the judgment result of the sample according to the class with the highest probability after calculation. The Voting layer trains the first layer data set through a weighted Voting algorithm to obtain an evaluation index of the classification prediction model, and the evaluation index of the classification prediction model generally uses four indexes of accuracy, precision, recall rate and F1 value.
The calculation formula of the strong classifier in the AdaBoost algorithm is as follows:
Figure BDA0002357430360000161
where x is the input vector, F (x) is the strong classifier, ft(x) Is a weak classifier, αtIs the weight value of the weak classifier, which is a positive number, and T is the number of weak classifiers. The output value of the weak classifier is +1 or-1, corresponding to the positive and negative samples, respectively.
S940, calculating an evaluation index of a comparison object;
using the formula:
Figure BDA0002357430360000162
Figure BDA0002357430360000163
Figure BDA0002357430360000164
Figure BDA0002357430360000165
wherein accuracy is the accuracy, precision is the accuracy, recall is the recall, and TP is the number of samples correctly divided into lost clients; TN is the number of samples correctly divided into non-attrition customers; FP is the number of samples wrongly divided into lost customers; FN is the number of samples that are wrongly classified as non-churning clients. And calculating evaluation indexes of the comparison object MLP, the MLP fused with the self-encoder, and the MLP, KNN, Logistic Regression and Bagging embedded in the fusion entity.
S950, comparing the evaluation index of the classification prediction model with the evaluation index of the comparison object, and analyzing and comparing the results;
the model presented here performs well on the published data set used for the experiments. 5 strong models based on the tree model are fused, meanwhile, the problems of dimensionality disaster and data sparseness are avoided, the relevance among the characteristics is guaranteed, the accuracy and the precision are greatly improved under the improvement of the time complexity in an acceptable range, the accuracy is averagely higher than 8.81% compared with other selected customer loss prediction models, and the accuracy is higher than 1.7% compared with two improved models based on MLP. In terms of precision and recall rate, although the recall rate of the model is generally realized, the precision rate is improved by about 23%. In comprehensive comparison, the performance of the model provided by the method is superior to that of various models in comparison experiments. Can be well played in the application of binary prediction such as credit assessment, disaster prediction and the like.
The method comprises the steps of obtaining a data set, dividing the data set into a training set and a testing set, then carrying out double-layer training on the training set by utilizing XGboost, LightGBM, AdaBoost and a weighted voting algorithm, outputting an evaluation index of a classification prediction model, and finally carrying out result analysis and comparison on the evaluation index of the classification prediction model and a comparison object. By using a double-layer fusion method and a corresponding algorithm, the accuracy and precision of the customer loss prediction model are improved, and the customer loss prediction model is further improved.
Example 10
As shown in fig. 10, an electronic device includes a memory 1001 and a processor 1002, where the memory 1001 is used for storing one or more computer instructions, and the one or more computer instructions are executed by the processor 1002 to implement the above-mentioned enterprise intelligence loss prediction method with a two-layer structure.
It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process of the electronic device described above may refer to the corresponding process in the foregoing method embodiment, and is not described herein again.
A computer-readable storage medium storing a computer program, the computer program enabling a computer to execute the method for predicting loss of intelligence of an enterprise having a two-layer structure as described above.
Illustratively, the computer programs may be partitioned into one or more modules/units, which are stored in memory 1001 and executed by processor 1002 to implement the present invention. One or more modules/units may be a series of computer program instruction segments capable of performing certain functions, the instruction segments being used to describe the execution of a computer program in a computer device.
The computer device may be a desktop computer, a notebook, a palm computer, a cloud server, or other computing devices. The computer device may include, but is not limited to, memory 1001, processor 1002. Those skilled in the art will appreciate that the present embodiments are merely exemplary of a computing device and are not intended to limit the computing device, and may include more or fewer components, or some of the components may be combined, or different components, e.g., the computing device may also include input output devices, network access devices, buses, etc.
The processor 1002 may be a Central Processing Unit (CPU), or may be other general-purpose processor 1002, a digital signal processor 1002 (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or the like. The general purpose processor 1002 may be a microprocessor 1002 or the processor 1002 may be any conventional processor 1002 or the like.
The storage 1001 may be an internal storage unit of the computer device, such as a hard disk or a memory of the computer device. The memory 1001 may also be an external storage device of the computer device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) card, a flash card (FlashCard), and the like provided on the computer device. Further, the memory 1001 may also include both an internal storage unit and an external storage device of the computer device. The memory 1001 is used for storing computer programs and other programs and data required by the computer apparatus. The memory 1001 may also be used to temporarily store data that has been output or is to be output.
The above description is only an embodiment of the present invention, but the technical features of the present invention are not limited thereto, and any changes or modifications within the technical field of the present invention by those skilled in the art are covered by the claims of the present invention.

Claims (10)

1. A method for predicting enterprise information loss with a double-layer structure is characterized by comprising the following steps:
acquiring a data set, and dividing the data set into a training set and a test set;
performing double-layer training on the training set by using the XGboost, the LightGBM, the AdaBoost and a weighted voting algorithm, and outputting an evaluation index of a classification prediction model;
and analyzing and comparing the evaluation index of the classification prediction model with a comparison object.
2. The method for predicting information loss of an enterprise with a double-layer structure according to claim 1, wherein the double-layer training is performed on the training set by using the XGBoost, the LightGBM, the AdaBoost and the weighted voting algorithm, and evaluation indexes of a classification prediction model are output, and the method comprises the following steps:
building a classification prediction model double-layer structure, and training a data set by a first layer through a corresponding algorithm to obtain a first-layer data set;
the second layer trains the first layer data set through a corresponding algorithm to obtain an evaluation index of the classification prediction model, wherein a calculation formula of a strong classifier in the AdaBoost algorithm is as follows:
Figure FDA0002357430350000011
where x is the input vector, F (x) is the strong classifier, ft (x) is the weak classifier, α T is the weight value of the weak classifier, which is a positive number, T is the number of weak classifiers, the output value of the weak classifier is +1 or-1, corresponding to the positive and negative samples, respectively.
3. The method of claim 2, wherein the comparing the evaluation index of the classification prediction model with the result of the comparison object comprises:
calculating an evaluation index of a comparison object;
and comparing the evaluation index of the classification prediction model with the evaluation index of the comparison object, and analyzing and comparing the results.
4. The method according to claim 3, wherein the obtaining of the data set divides the data set into a training set and a testing set, and outputs the corresponding evaluation set and the testing set through the training, verifying and testing method.
5. The utility model provides a bilayer structure's enterprise information loss prediction device which characterized in that includes:
the acquisition module acquires a data set and divides the data set into a training set and a test set;
the computing module is used for performing double-layer training on the training set by utilizing the XGboost, the LightGBM, the AdaBoost and a weighted voting algorithm and outputting evaluation indexes of the classification prediction model;
and the analysis module is used for analyzing and comparing the evaluation index of the classification prediction model with the comparison object.
6. The apparatus of claim 5, wherein the computing module comprises:
the first layer training unit is used for building a classification prediction model double-layer structure, and the first layer trains a data set through a corresponding algorithm to obtain a first layer data set;
and the second layer training unit is used for training the first layer data set through a corresponding algorithm to obtain an evaluation index of the classification prediction model, wherein a calculation formula of a strong classifier in the AdaBoost algorithm is as follows:
Figure FDA0002357430350000021
where x is the input vector, F (x) is the strong classifier, ft (x) is the weak classifier, α T is the weight value of the weak classifier, which is a positive number, T is the number of weak classifiers, the output value of the weak classifier is +1 or-1, corresponding to the positive and negative samples, respectively.
7. The apparatus of claim 6, wherein the analysis module comprises:
a calculation unit that calculates an evaluation index of the comparison object;
and a comparison unit for comparing the evaluation index of the classification prediction model with the evaluation index of the comparison object and analyzing and comparing the results.
8. The apparatus of claim 7, wherein the obtaining module comprises:
the dividing unit is used for acquiring a data set and dividing the data set into a training set and a test set;
and the data set is trained, verified and tested, and a corresponding evaluation set and a corresponding test set are output.
9. An electronic device comprising a memory and a processor, the memory configured to store one or more computer instructions, wherein the one or more computer instructions are executed by the processor to implement a two-tier enterprise intelligence loss prediction method as claimed in any one of claims 1-4.
10. A computer-readable storage medium storing a computer program, wherein the computer program is configured to enable a computer to execute the method for predicting loss of information in an enterprise having a two-tier structure according to any one of claims 1 to 4.
CN202010011877.6A 2020-01-07 2020-01-07 Enterprise information loss prediction method with double-layer structure Pending CN111242358A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010011877.6A CN111242358A (en) 2020-01-07 2020-01-07 Enterprise information loss prediction method with double-layer structure

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010011877.6A CN111242358A (en) 2020-01-07 2020-01-07 Enterprise information loss prediction method with double-layer structure

Publications (1)

Publication Number Publication Date
CN111242358A true CN111242358A (en) 2020-06-05

Family

ID=70876036

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010011877.6A Pending CN111242358A (en) 2020-01-07 2020-01-07 Enterprise information loss prediction method with double-layer structure

Country Status (1)

Country Link
CN (1) CN111242358A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111796343A (en) * 2020-06-15 2020-10-20 兰州中心气象台(兰州干旱生态环境监测预测中心) Strong convection weather short-time approaching recognition method based on artificial intelligence algorithm
CN111931648A (en) * 2020-08-10 2020-11-13 成都思晗科技股份有限公司 Hiwari 8 waveband data-based mountain fire real-time monitoring method
CN112070535A (en) * 2020-09-03 2020-12-11 常州微亿智造科技有限公司 Electric vehicle price prediction method and device
CN112153636A (en) * 2020-10-29 2020-12-29 浙江鸿程计算机系统有限公司 Method for predicting number portability and roll-out of telecommunication industry user based on machine learning
CN112199417A (en) * 2020-09-30 2021-01-08 中国平安人寿保险股份有限公司 Data processing method, device, terminal and storage medium based on artificial intelligence
CN112330050A (en) * 2020-11-20 2021-02-05 国网辽宁省电力有限公司营口供电公司 Power system load prediction method considering multiple features based on double-layer XGboost
CN113674087A (en) * 2021-08-19 2021-11-19 工银科技有限公司 Enterprise credit rating method, apparatus, electronic device and medium
CN113827979A (en) * 2021-08-17 2021-12-24 杭州电魂网络科技股份有限公司 LightGBM-based game churn user prediction method and system

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107832581A (en) * 2017-12-15 2018-03-23 百度在线网络技术(北京)有限公司 Trend prediction method and device
CN108876034A (en) * 2018-06-13 2018-11-23 重庆邮电大学 A kind of improved Lasso+RBF neural network ensemble prediction model
CN109389143A (en) * 2018-06-19 2019-02-26 北京九章云极科技有限公司 A kind of Data Analysis Services system and method for automatic modeling
CN109544197A (en) * 2017-09-22 2019-03-29 中兴通讯股份有限公司 A kind of customer churn prediction technique and device
CN109934341A (en) * 2017-11-13 2019-06-25 埃森哲环球解决方案有限公司 The model of training, verifying and monitoring artificial intelligence and machine learning
CN110147803A (en) * 2018-02-08 2019-08-20 北大方正集团有限公司 Customer churn early-warning processing method and device
CN110322085A (en) * 2018-03-29 2019-10-11 北京九章云极科技有限公司 A kind of customer churn prediction method and apparatus
CN110472817A (en) * 2019-07-03 2019-11-19 西北大学 A kind of XGBoost of combination deep neural network integrates credit evaluation system and its method

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109544197A (en) * 2017-09-22 2019-03-29 中兴通讯股份有限公司 A kind of customer churn prediction technique and device
CN109934341A (en) * 2017-11-13 2019-06-25 埃森哲环球解决方案有限公司 The model of training, verifying and monitoring artificial intelligence and machine learning
CN107832581A (en) * 2017-12-15 2018-03-23 百度在线网络技术(北京)有限公司 Trend prediction method and device
CN110147803A (en) * 2018-02-08 2019-08-20 北大方正集团有限公司 Customer churn early-warning processing method and device
CN110322085A (en) * 2018-03-29 2019-10-11 北京九章云极科技有限公司 A kind of customer churn prediction method and apparatus
CN108876034A (en) * 2018-06-13 2018-11-23 重庆邮电大学 A kind of improved Lasso+RBF neural network ensemble prediction model
CN109389143A (en) * 2018-06-19 2019-02-26 北京九章云极科技有限公司 A kind of Data Analysis Services system and method for automatic modeling
CN110472817A (en) * 2019-07-03 2019-11-19 西北大学 A kind of XGBoost of combination deep neural network integrates credit evaluation system and its method

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111796343A (en) * 2020-06-15 2020-10-20 兰州中心气象台(兰州干旱生态环境监测预测中心) Strong convection weather short-time approaching recognition method based on artificial intelligence algorithm
CN111931648A (en) * 2020-08-10 2020-11-13 成都思晗科技股份有限公司 Hiwari 8 waveband data-based mountain fire real-time monitoring method
CN111931648B (en) * 2020-08-10 2023-08-01 成都思晗科技股份有限公司 Mountain fire real-time monitoring method based on Himaware 8-band data
CN112070535A (en) * 2020-09-03 2020-12-11 常州微亿智造科技有限公司 Electric vehicle price prediction method and device
CN112199417A (en) * 2020-09-30 2021-01-08 中国平安人寿保险股份有限公司 Data processing method, device, terminal and storage medium based on artificial intelligence
CN112153636A (en) * 2020-10-29 2020-12-29 浙江鸿程计算机系统有限公司 Method for predicting number portability and roll-out of telecommunication industry user based on machine learning
CN112330050A (en) * 2020-11-20 2021-02-05 国网辽宁省电力有限公司营口供电公司 Power system load prediction method considering multiple features based on double-layer XGboost
CN113827979A (en) * 2021-08-17 2021-12-24 杭州电魂网络科技股份有限公司 LightGBM-based game churn user prediction method and system
CN113674087A (en) * 2021-08-19 2021-11-19 工银科技有限公司 Enterprise credit rating method, apparatus, electronic device and medium

Similar Documents

Publication Publication Date Title
CN111242358A (en) Enterprise information loss prediction method with double-layer structure
CN112100387B (en) Training method and device of neural network system for text classification
WO2017133188A1 (en) Method and device for determining feature set
US9218531B2 (en) Image identification apparatus, image identification method, and non-transitory computer readable medium
CN112633419A (en) Small sample learning method and device, electronic equipment and storage medium
CN113177700B (en) Risk assessment method, system, electronic equipment and storage medium
CN112883990A (en) Data classification method and device, computer storage medium and electronic equipment
US20210342707A1 (en) Data-driven techniques for model ensembles
CN112434884A (en) Method and device for establishing supplier classified portrait
Zhang et al. Feature relevance term variation for multi-label feature selection
CN111611796A (en) Hypernym determination method and device for hyponym, electronic device and storage medium
CN113435499B (en) Label classification method, device, electronic equipment and storage medium
US20200342287A1 (en) Selective performance of deterministic computations for neural networks
CN112632000B (en) Log file clustering method, device, electronic equipment and readable storage medium
CN111784246A (en) Logistics path estimation method
CN108830302B (en) Image classification method, training method, classification prediction method and related device
CN113609948B (en) Method, device and equipment for detecting video time sequence action
CN117523218A (en) Label generation, training of image classification model and image classification method and device
CN111242449A (en) Enterprise information loss prediction method
CN115345248A (en) Deep learning-oriented data depolarization method and device
CN114091458A (en) Entity identification method and system based on model fusion
CN113656354A (en) Log classification method, system, computer device and readable storage medium
CN115393914A (en) Multitask model training method, device, equipment and storage medium
CN116894209B (en) Sampling point classification method, device, electronic equipment and readable storage medium
CN116383390B (en) Unstructured data storage method for management information and cloud platform

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200605