CN111178675A - LR-Bagging algorithm-based electric charge recycling risk prediction method, system, storage medium and computer equipment - Google Patents
LR-Bagging algorithm-based electric charge recycling risk prediction method, system, storage medium and computer equipment Download PDFInfo
- Publication number
- CN111178675A CN111178675A CN201911232092.5A CN201911232092A CN111178675A CN 111178675 A CN111178675 A CN 111178675A CN 201911232092 A CN201911232092 A CN 201911232092A CN 111178675 A CN111178675 A CN 111178675A
- Authority
- CN
- China
- Prior art keywords
- model
- indexes
- index
- clients
- electric charge
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000004422 calculation algorithm Methods 0.000 title claims abstract description 56
- 238000000034 method Methods 0.000 title claims abstract description 31
- 238000004064 recycling Methods 0.000 title claims abstract description 24
- 238000007477 logistic regression Methods 0.000 claims abstract description 35
- 238000012549 training Methods 0.000 claims abstract description 34
- 230000005611 electricity Effects 0.000 claims abstract description 26
- 238000007781 pre-processing Methods 0.000 claims abstract description 16
- 238000012216 screening Methods 0.000 claims abstract description 12
- 230000008859 change Effects 0.000 claims abstract description 10
- 238000004458 analytical method Methods 0.000 claims abstract description 8
- 238000012545 processing Methods 0.000 claims description 12
- 230000006399 behavior Effects 0.000 claims description 9
- 238000004364 calculation method Methods 0.000 claims description 8
- 230000008569 process Effects 0.000 claims description 7
- 238000004590 computer program Methods 0.000 claims description 6
- 230000010354 integration Effects 0.000 claims description 6
- 238000010276 construction Methods 0.000 claims description 4
- 230000009467 reduction Effects 0.000 claims description 4
- 230000002159 abnormal effect Effects 0.000 claims description 3
- 238000000605 extraction Methods 0.000 claims description 3
- 230000006870 function Effects 0.000 description 6
- 238000005070 sampling Methods 0.000 description 5
- 238000013058 risk prediction model Methods 0.000 description 4
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 238000003066 decision tree Methods 0.000 description 2
- 238000009826 distribution Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 239000011159 matrix material Substances 0.000 description 2
- 238000011084 recovery Methods 0.000 description 2
- 238000012795 verification Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 230000003111 delayed effect Effects 0.000 description 1
- 238000000491 multivariate analysis Methods 0.000 description 1
- 230000002265 prevention Effects 0.000 description 1
- 230000001932 seasonal effect Effects 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 230000007306 turnover Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/06—Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
- G06Q10/063—Operations research, analysis or management
- G06Q10/0635—Risk analysis of enterprise or organisation activities
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/06—Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
- G06Q10/063—Operations research, analysis or management
- G06Q10/0639—Performance analysis of employees; Performance analysis of enterprise or organisation operations
- G06Q10/06393—Score-carding, benchmarking or key performance indicator [KPI] analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/06—Energy or water supply
Landscapes
- Business, Economics & Management (AREA)
- Human Resources & Organizations (AREA)
- Engineering & Computer Science (AREA)
- Economics (AREA)
- Strategic Management (AREA)
- Entrepreneurship & Innovation (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Marketing (AREA)
- Educational Administration (AREA)
- Development Economics (AREA)
- Tourism & Hospitality (AREA)
- Physics & Mathematics (AREA)
- General Business, Economics & Management (AREA)
- Game Theory and Decision Science (AREA)
- Health & Medical Sciences (AREA)
- Operations Research (AREA)
- Quality & Reliability (AREA)
- Public Health (AREA)
- Water Supply & Treatment (AREA)
- Primary Health Care (AREA)
- General Health & Medical Sciences (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention relates to an electric charge recycling risk prediction method, system, storage medium and computer equipment based on LR-Bagging algorithm, which can accurately predict default probability of arrearage according to a customer behavior track, change the passive situation of after-the-fact arrearage management and reduce the risk of electric customer arrearage. The method comprises the following steps: establishing an analysis target, and extracting positive and negative samples; collecting indexes, selecting indexes related to the arrearage risk of the electricity consumer, and deriving to form an index system according to the related indexes; index preprocessing, namely screening indexes with high forecasting power to enter a model; constructing an LR-Bagging algorithm-based model, training the LR-Bagging algorithm model according to the subset of randomly selected samples and the indexes screened in the step 3, wherein the termination iteration condition of the model is determined by the change rate of AUC statistics; and taking the plurality of logistic regression models obtained after training as base classifiers, and taking the weighted average of the prediction results of the plurality of base classifiers as the final prediction probability.
Description
Technical Field
The invention relates to the field of electric power, in particular to an electric charge recycling risk prediction method and system based on an LR-Bagging algorithm, a storage medium and computer equipment.
Background
The power supply unit always inherits the market rule of 'first power consumption and then payment', and the rule is established on the basis that the power supply company pays the money cost for the consumed electric energy within the specified time limit after enjoying the use of the electric energy. However, some users violate the promised payment refusal or delayed payment of the electric charge, thereby influencing the fund recovery of the power supply enterprise; and the failure to recover the electric charge in time severely limits the capital turnover of power grid enterprises, thereby influencing the power supply and causing vicious circle. The effective prevention and avoidance of the arrearage risk play an important role in the effective operation of the power enterprise, a power grid generates a large amount of data during the operation period, the data can be analyzed by adopting a data mining technology, and valuable information can be extracted. However, improvements in connection with the actual service are still needed in the selection of the index system.
The existing arrearage risk prediction model designs key influence variables in an arrearage risk recognition model by using available data on the basis of analyzing the cause of arrearage of an electric power customer, establishes a model capable of recognizing the possibility of arrearage of the electric power customer by using a Logistic regression theory and method, and predicts the default probability in advance according to the latest information of the customer which can be mastered, so that the passive situation of after-the-fact arrearage management is changed, and the purpose of reducing the arrearage risk of the electric power customer is achieved. However, in the model, the arrearage reasons of the customer group need to be found, and in actual work, the arrearage reasons of the customers are various and it is difficult to find out all the reasons which can influence the arrearage of the customers, so the model has certain limitation on index selection.
Disclosure of Invention
Aiming at the problems in the prior art, the invention provides an electric charge recycling risk prediction method, system, storage medium and computer equipment based on LR-Bagging algorithm, which can accurately predict default probability of arrearages according to the behavior track of customers, thereby changing the passive situation of after-the-fact arrearages management and achieving the purpose of reducing the arrearages risk of power customers.
In order to achieve the purpose, the technical scheme adopted by the invention is as follows: the electric charge recycling risk prediction method based on the LR-Bagging algorithm comprises the following steps:
step 1, establishing an analysis target, and extracting appropriate positive and negative samples in proportion;
step 2, collecting indexes, analyzing basic information, historical payment data and electricity consumption behavior data of the electricity consumption customer, selecting indexes related to the arrearage risk of the electricity consumption customer, and deriving to form an index system according to the related indexes;
step 3, index preprocessing, namely screening indexes with high forecasting power to enter a model;
step 4, constructing an LR-Bagging algorithm-based model, training the LR-Bagging algorithm model according to the randomly selected subset of the samples and the indexes screened in the step 3, and determining the termination iteration condition of the model by the change rate of the AUC statistic; and taking the multiple logistic regression models obtained after training as base classifiers, integrating the multiple logistic regression models through a Bagging algorithm to form a logistic regression-based integration algorithm, and taking the weighted average of the prediction results of the multiple base classifiers as the final prediction probability.
Preferably, step 3 firstly checks the quality of the index data through index preprocessing, secondly constructs derived variables, processes and processes the original index data to obtain variables with more predictive power and interpretability, and then screens the indexes with high predictive power into a model.
Preferably, when the indexes with high predictive power are screened in the step 3 and enter the model, the IV value is used for measuring the predictive power of the indexes;
introducing WOE evidence weight to obtain an IV value; performing WOE encoding on a variable, which needs to be firstly subjected to grouping processing, wherein the calculation formula of the WOE encoding of a group i is as follows:
pyiis the proportion of responding clients in group i to all responding clients in the sample, pniIs the proportion of unresponsive clients in group i to all unresponsive clients in the sample, yiIs the number of responding clients in group i, niIs the number of unresponsive clients in the group i, y is the number of all responding clients in the sample, and n is the number of all unresponsive clients in the sample;
for packet i, there is a corresponding IViThe value, the calculation formula is as follows:
IV of each groupiThe values are added to obtain the IV value of the whole variable:
where K is the number of groups of the variable.
The invention relates to an electric charge recovery risk prediction system based on an LR-Bagging algorithm, which comprises:
the sample extraction module is used for establishing an analysis target and extracting appropriate positive and negative samples in proportion;
the index collection module is used for collecting indexes, analyzing basic information, historical payment data and electricity consumption behavior data of the electricity consumption customer, selecting indexes related to the arrearage risk of the electricity consumption customer, and deriving to form an index system according to the related indexes;
the index preprocessing module is used for preprocessing the indexes and screening the indexes with high forecasting power to enter the model;
the model construction module is used for constructing an LR-Bagging algorithm-based model, training the LR-Bagging algorithm model according to the subset of randomly selected samples and the screened indexes, and determining the termination iteration condition of the model by the change rate of AUC statistics; and taking the multiple logistic regression models obtained after training as base classifiers, integrating the multiple logistic regression models through a Bagging algorithm to form a logistic regression-based integration algorithm, and taking the weighted average of the prediction results of the multiple base classifiers as the final prediction probability.
The storage medium of the present invention has stored thereon computer instructions which, when executed by a processor, implement the steps of the electricity charge recycling risk prediction method.
The computer device comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein when the processor runs the computer program, the electric charge recycling risk prediction method is realized.
The model constructed by the invention can accurately predict the default probability of arrearage according to the behavior track of the customer, thereby changing the passive situation of after-the-fact arrearage management and achieving the purpose of reducing the arrearage risk of the power customer. Compared with the prior art, the invention has the following beneficial effects:
selecting basic information, payment channel information (such as payment channel preference in half a year), payment duration and electric quantity and electricity charge data of an electricity customer to construct an index system, establishing a feature selection-based improved LR-Bagging algorithm model, selecting an index with an IV value larger than 0.02 and a correlation coefficient smaller than 0.6 to input into the LR-Bagging model, randomly selecting a subset of samples to train a logistic regression model, obtaining records and fields of each trained LR-based classifier through random sampling, and determining the termination iteration condition of the algorithm by the change rate of AUC statistic. And taking the trained multiple logistic regression models as base classifiers, and taking the weighted average of the prediction results of the multiple base classifiers as the final prediction probability. Analyzing and predicting all high-voltage users, and evaluating and verifying the classification effect of the model by selecting the values of Accuracy (Accuracy), Precision (Precision), Recall (Recall) and F1, wherein the result shows that the algorithm effectively improves the prediction force of the model.
Drawings
Fig. 1 is a flowchart of an electric charge recycling risk prediction method based on an LR-Bagging algorithm.
Detailed Description
The present invention will be described in further detail with reference to the drawings and specific embodiments, but the present invention is not limited thereto.
Examples
As shown in fig. 1, the electric charge recycling risk prediction model method based on LR-Bagging algorithm in this embodiment mainly includes the following steps:
step 1, establishing an analysis target, extracting a sample: carrying out hierarchical sampling on all samples according to a grade city, and extracting appropriate positive and negative samples in proportion;
in the embodiment, a high-voltage user in a certain province is selected as an analysis object, basic information of a power consumption client, payment channel preference, payment duration and electric quantity and power charge data in half a year (such as between 9 and 2 months in 2018 and 2019) are selected to construct an index system, and 70% of samples are randomly selected as a training set. Because the proportion of the target users in all samples is 5.9%, in the process of training the model, the embodiment performs hierarchical sampling according to each grade city, so that the proportion of the target to the non-target samples in the training set is 1: 10; for the sampled data, 70% of the samples were randomly selected each time to train a logistic regression model and used to predict the original data.
Step 2, collecting indexes, analyzing basic information, historical payment data and electricity consumption behavior data of the electricity consumption customer, selecting indexes related to the arrearage risk of the electricity consumption customer, and deriving to form an index system according to the related indexes;
in this embodiment, the archive data, the electric quantity and electricity fee data, the 95598 customer service data, and other electric power marketing related data (such as data of default electricity stealing and electric power policy) of the electricity consumer are extracted, and an index system is formed on the basis of statistical analysis of relevant influence factors of user arrearages. Taking the whole high-voltage users in a certain province as an example, basic information, payment channel information, payment duration information and electric quantity and electricity charge information of a customer are selected as indexes, and are derived based on the indexes to form an index system.
Step 3, index pretreatment: filling missing values, replacing abnormal values, performing dimensionality reduction on indexes, and screening the indexes with high predictive power to enter a model;
in this embodiment, on the basis of the index data obtained in step 2, the quality of the index data is first checked by index preprocessing, which includes: uniqueness of user numbers, sample integrity, and range, value, missing value, abnormal value, etc. of variables; and secondly, constructing derivative variables, namely processing and processing original index data to obtain more predictive and explanatory variables, such as client seasonal factors, regional factors and the like, and screening the high-predictive index into a model.
The variable screening of the model to be completed in this step is a relatively complex process, and an IV Value (Information Value or Information amount) needs to be considered, that is, the predictive power of the index is measured by the IV Value. The function of the IV value is similar to the information gain, the information gain ratio and the degree of purity of the Gini, and the IV value is used for feature selection, but when the decision tree is constructed, the importance of the features can be calculated in the process of constructing the decision tree, the logistic regression algorithm does not require the importance of the features to be calculated, and the model is easy to be over-fitted due to the fact that the unimportant features are mixed into the model, so that when the logistic regression algorithm is used for modeling, the IV value is calculated firstly for feature screening. This embodiment introduces WOE (Weight of Evidence) Evidence weights to find the IV value. WOE is a form of encoding of the original arguments. To perform WOE encoding on a variable, the variable needs to be first subjected to grouping processing (also called discretization, binning, etc.); after grouping, for the ith group (i.e., group i), the calculation formula of the WOE code is as follows:
wherein, pyiIs the proportion of the responding customers (corresponding default customers in the risk model, which means the individuals with the predictive variable value of "yes" (namely the value of 1)) in the group i to all the responding customers in the sample, pniIs the proportion of unresponsive clients in the group to all unresponsive clients in the sample, yiIs the groupNumber of responding clients, niIs the number of unresponsive clients in the group, y is the number of all responding clients in the sample, and n is the number of all unresponsive clients in the sample.
Similarly, for packet i, there will be a corresponding IViThe value, the calculation formula is as follows:
with a variable, the IV value of each group can be calculated, and the method is simple, namely the IV of each group is calculatediValue addition:
where K is the number of groups of the variable.
And calculating the IV value and the correlation of each variable, and keeping the index with the IV value larger than 0.02 and the correlation smaller than 0.6 to enter the model to obtain the index finally entering the model and the corresponding IV value thereof.
The index variable of the step uses an ECC correlation coefficient to carry out dimension reduction processing, and an IV value is used for screening independent variables with high predictive power; and carrying out quantile binning based on the discrete variable manual binning and the continuous variable.
Step 4, model construction: constructing an LR-Bagging algorithm-based model, training the LR-Bagging algorithm model according to the subset of randomly selected samples and the indexes screened in the step 3, wherein the termination iteration condition of the model is determined by the change rate of AUC statistics; and taking the multiple logistic regression models obtained after training as base classifiers, integrating the multiple logistic regression models through a Bagging algorithm to form a logistic regression-based integration algorithm, and taking the weighted average of the prediction results of the multiple base classifiers as the final prediction probability.
After the indexes are subjected to dimension reduction processing in the step 3, the selected optimal indexes enter an LR-Bagging algorithm model, m training samples are sampled from a training set each time, the samples are put back into the training set after the training is finished, the initial training samples can appear for many times or do not appear at all in a certain training set, and a plurality of trained logistic regression models are obtained.
The LR-Bagging algorithm model is characterized in that records and fields of each trained base classifier are obtained through random sampling, the termination iteration criterion of the algorithm is determined by the change rate of AUC statistics, and the algorithm fully considers the strong generalization capability of LR, the high accuracy of Bagging, the diversity of LR and classifiers brought by feature selection and the like during classification.
The logistic regression model is a probability nonlinear regression model, and is used for researching two-classification observation results and some influencing factors x1,x2,...,xnA multivariate analysis method of the relationship between them. Let n independent variables X ═ X1,x2,...,xn) And the conditional probability p (Y ═ 1| X) is the probability of occurrence under the condition of the value X, then the logistic regression model is expressed as:
the probability that the X does not occur under the value-taking condition is as follows:
assuming that m test samples are provided, the observed values are respectively y1,y2,…ymLet p denotei=p(yi=1|Xi) For a given XiUnder the condition of yiProbability of 1, then pi=p(yi=0|Xi) Is equal to 1-pi(ii) a Since the logistic regression event satisfies the Bernoulli probability, itBecause the observed samples are independent of each other, the joint distribution of the m samples is the product of the edge distributions, and the likelihood function is obtained as:
the parameter estimate is first determined which maximizes the value of this likelihood function, i.e. the parameter β is determined0,β1…βnMaking the likelihood function L (w) take the maximum value, and taking the logarithm of the likelihood function L (w) to obtain:
to formula (4) with respect to beta0,β1…βnSolving a partial derivative to obtain an equation set:
solving the equation set (5) to obtain the corresponding model parameters β of the logistic regression model0,β1…βn。
The logistic regression model trained in the step is a weak classifier, and a plurality of logistic regression models are integrated by using a Bagging algorithm to form an integrated algorithm based on logistic regression. The idea of the Bagging algorithm was that LeoBreiman was proposed in his 1 technical report "Bagging predicos" in 1994. The main idea of the Bagging algorithm is to give 1 weak learning algorithm and 1 training set (x)1,y1),(x2,y2)…(xn,yn). M training samples are sampled from the training set every time, the samples are put back to the training set after training is finished, and the initial training samples can appear for many times or not appear at all in a certain training set. After training, 1 prediction function sequence h can be obtained1,h2,...,htPredicting the new example by adopting a weighted average method to the regression problem to obtain the final predictionThe flow of the Bagging algorithm is as follows:
(1) given the original data set S ═ x1,y1),(x2,y2)…(xn,yn);
(2) Initializing a data set;
(3)For t=1…T;
(4) for each cycle t, m samples are taken from the original data set S by adopting a Booststract sampling technology to form a new training set S' (x)1,y1),(x2,y2)…(xn,ym);
(5) Training by using a basic learning algorithm in a new training set S' to obtain a learning model ht;
(6) Saving t-wheel learner model htLearning each individual1,h2,...,htintegrating an ensemble learner alpha by means of a weighted averaget(T-1, 2, … T), the same value may be taken as the contribution weight of T individuals.
Step 5, model application: the constructed LR-Bagging algorithm-based model can accurately predict default probability of arrearages according to the behavior track of customers, so that the passive situation of after-the-fact arrearage management is changed, and the purpose of reducing the arrearage risk of power customers is achieved.
After the model is constructed, its accuracy must be evaluated. The best and the most important evaluation criterion of a model is the application effect in practice. First, 4 basic definitions are introduced:
tune Positive (TP), the number of subjects whose model is predicted to be 1 and which are actually 1;
true Negative (TN) means the number of observed objects whose model is predicted to be 0 and actually 0;
false Positive (FP), the number of observation objects whose model is predicted to be 1 and is actually 0;
false Negative (FN) indicates the number of observed objects whose model is predicted to be 0, and actually 1.
Constructing a confusion matrix based on the above definitions as shown in the following table one:
table-confusion matrix
Based on the above definitions, a plurality of evaluation indexes can be extended, and the Accuracy (Accuracy), Precision (Precision), Recall (Recall) and F1 values are selected in this embodiment.
Accuracy (Accuracy), as the name implies, is the total weight of all predictions (positive class and negative class):
precision rate (precision ion), i.e. precision rate, the proportion of correct predictions as positive over all predictions as positive:
recall (Recall), i.e. Recall, correctly predicted positive accounts for a proportion that is actually positive in total:
the value of F1, which is the arithmetic mean divided by the geometric mean, and the larger the better, substituting the above equations for Precision and Recall into equation (9) below will find that when the value of F1 is small, True Positive relatively increases and false relatively decreases, i.e., Precision and Recall both relatively increase, and F1 weights both Precision and Recall.
The verification results of the invention are shown in the following table two:
table two verification results
The invention also provides an electric charge recycling risk prediction system based on the LR-Bagging algorithm, which comprises the following steps:
the sample extraction module is used for realizing the step 1, establishing an analysis target and extracting appropriate positive and negative samples in proportion;
the index collection module is used for realizing the step 2, collecting indexes, analyzing basic information, historical payment data and electricity consumption behavior data of the electricity consumption customer, selecting indexes related to the arrearage risk of the electricity consumption customer, and forming an index system according to the related indexes in a derivative mode;
the index preprocessing module is used for realizing the step 3, preprocessing the indexes, screening the indexes with high forecasting power and entering the model;
the model construction module is used for realizing the step 4, constructing an LR-Bagging algorithm-based model, training the LR-Bagging algorithm model according to the randomly selected subset of the samples and the indexes screened in the step 3, and determining the termination iteration condition of the model by the change rate of AUC statistics; and taking the multiple logistic regression models obtained after training as base classifiers, integrating the multiple logistic regression models through a Bagging algorithm to form a logistic regression-based integration algorithm, and taking the weighted average of the prediction results of the multiple base classifiers as the final prediction probability.
The present invention also contemplates a storage medium having stored thereon computer instructions which, when executed by a processor, implement the steps of the method for forecasting electric charge recycling risk.
The invention also provides computer equipment which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein when the processor runs the computer program, the electric charge recycling risk prediction method is realized.
The electric charge recycling risk prediction model method, the electric charge recycling risk prediction model system, the electric charge recycling risk prediction storage medium and the computer equipment based on the LR-Bagging algorithm greatly enhance the transparency and the understandability of the model. It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is intended to include such modifications and variations.
Claims (10)
1. The electric charge recycling risk prediction method based on the LR-Bagging algorithm is characterized by comprising the following steps of:
step 1, establishing an analysis target, and extracting appropriate positive and negative samples in proportion;
step 2, collecting indexes, analyzing basic information, historical payment data and electricity consumption behavior data of the electricity consumption customer, selecting indexes related to the arrearage risk of the electricity consumption customer, and deriving to form an index system according to the related indexes;
step 3, index preprocessing, namely screening indexes with high forecasting power to enter a model;
step 4, constructing an LR-Bagging algorithm-based model, training the LR-Bagging algorithm model according to the randomly selected subset of the samples and the indexes screened in the step 3, wherein the termination iteration condition of the model is determined by the change rate of the AUC statistic; and taking the multiple logistic regression models obtained after training as base classifiers, integrating the multiple logistic regression models through a Bagging algorithm to form a logistic regression-based integration algorithm, and taking the weighted average of the prediction results of the multiple base classifiers as the final prediction probability.
2. The electric charge recycling risk prediction method according to claim 1, wherein step 3 comprises firstly, conducting index preprocessing to check the quality of index data, secondly, constructing derivative variables, conducting processing and processing on original index data to obtain more predictive and explanatory variables, and then screening the high predictive index into a model.
3. The electric charge recycling risk prediction method according to claim 1, wherein when the indexes with high predictive power are screened in the step 3 and enter the model, the predictive power of the indexes is measured by an IV value;
introducing WOE evidence weight to obtain an IV value; performing WOE encoding on a variable, which needs to be firstly subjected to grouping processing, wherein the calculation formula of the WOE encoding of a group i is as follows:
pyiis the proportion of responding clients in group i to all responding clients in the sample, pniIs the proportion of unresponsive clients in group i to all unresponsive clients in the sample, yiIs the number of responding clients in group i, niIs the number of unresponsive clients in the group i, y is the number of all responding clients in the sample, and n is the number of all unresponsive clients in the sample;
for packet i, there is a corresponding IViThe value, the calculation formula is as follows:
IV of each groupiThe values are added to obtain the IV value of the whole variable:
where K is the number of groups of the variable.
4. The electric charge collection risk prediction method according to claim 1, wherein in step 4, n independent variables X ═ (X) are set1,x2,...,xn) And the conditional probability p (Y ═ 1| X) is the probability that occurs under the condition of the value of X, then the logistic regression model is expressed as:
wherein beta is0,β1…βnAre model parameters of the logistic regression model.
5. LR-Bagging algorithm-based electric charge recycling risk prediction system is characterized by comprising:
the sample extraction module is used for establishing an analysis target and extracting appropriate positive and negative samples in proportion;
the index collection module is used for collecting indexes, analyzing basic information, historical payment data and electricity consumption behavior data of the electricity consumption customer, selecting indexes related to the arrearage risk of the electricity consumption customer, and deriving to form an index system according to the related indexes;
the index preprocessing module is used for preprocessing the indexes and screening the indexes with high forecasting power to enter the model;
the model construction module is used for constructing an LR-Bagging algorithm-based model, training the LR-Bagging algorithm model according to the subset of randomly selected samples and the screened indexes, and determining the termination iteration condition of the model by the change rate of the AUC statistic; and taking the multiple logistic regression models obtained after training as base classifiers, integrating the multiple logistic regression models through a Bagging algorithm to form a logistic regression-based integration algorithm, and taking the weighted average of the prediction results of the multiple base classifiers as the final prediction probability.
6. The electric charge recycling risk prediction system according to claim 5, wherein the index preprocessing module firstly performs index preprocessing to check the quality of index data, secondly constructs derivative variables, processes and processes the original index data to obtain more predictive and explanatory variables, and then screens the high predictive index into the model.
7. The electric charge recycling risk prediction system according to claim 5, wherein when the index preprocessing module screens the index with high predictive power into the model, the IV value is used for measuring the predictive power of the index;
introducing WOE evidence weight to obtain an IV value; performing WOE encoding on a variable, which needs to be firstly subjected to grouping processing, wherein the calculation formula of the WOE encoding of a group i is as follows:
pyiis the proportion of responding clients in group i to all responding clients in the sample, pniIs the proportion of unresponsive clients in group i to all unresponsive clients in the sample, yiIs the number of responding clients in group i, niIs the number of unresponsive clients in the group i, y is the number of all responding clients in the sample, and n is the number of all unresponsive clients in the sample;
for packet i, there is a corresponding IViThe value, the calculation formula is as follows:
IV of each groupiThe values are added to obtain the IV value of the whole variable:
where K is the number of groups of the variable.
8. The electric charge recycling risk prediction system according to claim 5, wherein the index preprocessing module preprocesses the index including missing value filling, abnormal value replacement processing, and dimension reduction processing.
9. A storage medium having stored thereon computer instructions, wherein the computer instructions, when executed by a processor, implement the steps of the electric charge recycling risk prediction method of any one of claims 1-4.
10. Computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the electric charge reclamation risk prediction method according to any one of claims 1 to 4 when executing the computer program.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911232092.5A CN111178675A (en) | 2019-12-05 | 2019-12-05 | LR-Bagging algorithm-based electric charge recycling risk prediction method, system, storage medium and computer equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911232092.5A CN111178675A (en) | 2019-12-05 | 2019-12-05 | LR-Bagging algorithm-based electric charge recycling risk prediction method, system, storage medium and computer equipment |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111178675A true CN111178675A (en) | 2020-05-19 |
Family
ID=70653875
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911232092.5A Pending CN111178675A (en) | 2019-12-05 | 2019-12-05 | LR-Bagging algorithm-based electric charge recycling risk prediction method, system, storage medium and computer equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111178675A (en) |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112102074A (en) * | 2020-10-14 | 2020-12-18 | 深圳前海弘犀智能科技有限公司 | Grading card modeling method |
CN112116225A (en) * | 2020-09-07 | 2020-12-22 | 中国人民解放军63921部队 | Fighting efficiency evaluation method and device for equipment system, and storage medium |
CN112308293A (en) * | 2020-10-10 | 2021-02-02 | 北京贝壳时代网络科技有限公司 | Default probability prediction method and device |
CN112331285A (en) * | 2020-07-10 | 2021-02-05 | 青岛国新健康产业科技有限公司 | Case grouping method, case grouping device, electronic equipment and storage medium |
CN112486842A (en) * | 2020-12-17 | 2021-03-12 | 中国农业银行股份有限公司 | Product testing method and device |
CN112734560A (en) * | 2020-12-31 | 2021-04-30 | 深圳前海微众银行股份有限公司 | Variable construction method, device, equipment and computer readable storage medium |
CN113469374A (en) * | 2021-09-02 | 2021-10-01 | 北京易真学思教育科技有限公司 | Data prediction method, device, equipment and medium |
CN113485910A (en) * | 2021-06-07 | 2021-10-08 | 广发银行股份有限公司 | Test risk early warning method, system, equipment and storage medium |
CN113537607A (en) * | 2021-07-23 | 2021-10-22 | 国网青海省电力公司信息通信公司 | Power failure prediction method |
CN114491416A (en) * | 2022-02-23 | 2022-05-13 | 北京百度网讯科技有限公司 | Characteristic information processing method and device, electronic equipment and storage medium |
CN116433403A (en) * | 2023-06-14 | 2023-07-14 | 国网安徽省电力有限公司营销服务中心 | Account tracking-based electric enterprise accounts receivable early warning method and system |
CN117391836A (en) * | 2023-07-26 | 2024-01-12 | 人上融融(江苏)科技有限公司 | Method for modeling overdue probability based on heterogeneous integration of different labels |
CN118094339A (en) * | 2024-04-17 | 2024-05-28 | 中海油田服务股份有限公司 | Stratum temperature prediction method and device and computing equipment |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106251241A (en) * | 2016-08-02 | 2016-12-21 | 贵州电网有限责任公司信息中心 | A kind of feature based selects the LR Bagging algorithm improved |
CN109063931A (en) * | 2018-09-06 | 2018-12-21 | 盈盈(杭州)网络技术有限公司 | A kind of model method for predicting freight logistics driver Default Probability |
CN109272396A (en) * | 2018-08-20 | 2019-01-25 | 平安科技(深圳)有限公司 | Customer risk method for early warning, device, computer equipment and medium |
CN109727066A (en) * | 2018-12-27 | 2019-05-07 | 浙江华云信息科技有限公司 | A kind of big industrial electricity consumers load forecasting method based on XGBoost algorithm |
-
2019
- 2019-12-05 CN CN201911232092.5A patent/CN111178675A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106251241A (en) * | 2016-08-02 | 2016-12-21 | 贵州电网有限责任公司信息中心 | A kind of feature based selects the LR Bagging algorithm improved |
CN109272396A (en) * | 2018-08-20 | 2019-01-25 | 平安科技(深圳)有限公司 | Customer risk method for early warning, device, computer equipment and medium |
CN109063931A (en) * | 2018-09-06 | 2018-12-21 | 盈盈(杭州)网络技术有限公司 | A kind of model method for predicting freight logistics driver Default Probability |
CN109727066A (en) * | 2018-12-27 | 2019-05-07 | 浙江华云信息科技有限公司 | A kind of big industrial electricity consumers load forecasting method based on XGBoost algorithm |
Non-Patent Citations (1)
Title |
---|
吴漾;朱州;: "基于特征选择改进LR-Bagging算法的电力欠费风险居民客户预测", 电子产品世界, no. 04, pages 70 * |
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112331285B (en) * | 2020-07-10 | 2023-01-10 | 青岛国新健康产业科技有限公司 | Case grouping method, case grouping device, electronic equipment and storage medium |
CN112331285A (en) * | 2020-07-10 | 2021-02-05 | 青岛国新健康产业科技有限公司 | Case grouping method, case grouping device, electronic equipment and storage medium |
CN112116225A (en) * | 2020-09-07 | 2020-12-22 | 中国人民解放军63921部队 | Fighting efficiency evaluation method and device for equipment system, and storage medium |
CN112308293A (en) * | 2020-10-10 | 2021-02-02 | 北京贝壳时代网络科技有限公司 | Default probability prediction method and device |
CN112102074A (en) * | 2020-10-14 | 2020-12-18 | 深圳前海弘犀智能科技有限公司 | Grading card modeling method |
CN112102074B (en) * | 2020-10-14 | 2024-01-30 | 深圳前海弘犀智能科技有限公司 | Score card modeling method |
CN112486842A (en) * | 2020-12-17 | 2021-03-12 | 中国农业银行股份有限公司 | Product testing method and device |
CN112734560A (en) * | 2020-12-31 | 2021-04-30 | 深圳前海微众银行股份有限公司 | Variable construction method, device, equipment and computer readable storage medium |
CN112734560B (en) * | 2020-12-31 | 2024-05-14 | 深圳前海微众银行股份有限公司 | Variable construction method, device, equipment and computer readable storage medium |
CN113485910A (en) * | 2021-06-07 | 2021-10-08 | 广发银行股份有限公司 | Test risk early warning method, system, equipment and storage medium |
CN113537607A (en) * | 2021-07-23 | 2021-10-22 | 国网青海省电力公司信息通信公司 | Power failure prediction method |
CN113537607B (en) * | 2021-07-23 | 2022-08-05 | 国网青海省电力公司信息通信公司 | Power failure prediction method |
CN113469374A (en) * | 2021-09-02 | 2021-10-01 | 北京易真学思教育科技有限公司 | Data prediction method, device, equipment and medium |
CN114491416A (en) * | 2022-02-23 | 2022-05-13 | 北京百度网讯科技有限公司 | Characteristic information processing method and device, electronic equipment and storage medium |
CN116433403A (en) * | 2023-06-14 | 2023-07-14 | 国网安徽省电力有限公司营销服务中心 | Account tracking-based electric enterprise accounts receivable early warning method and system |
CN117391836A (en) * | 2023-07-26 | 2024-01-12 | 人上融融(江苏)科技有限公司 | Method for modeling overdue probability based on heterogeneous integration of different labels |
CN118094339A (en) * | 2024-04-17 | 2024-05-28 | 中海油田服务股份有限公司 | Stratum temperature prediction method and device and computing equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111178675A (en) | LR-Bagging algorithm-based electric charge recycling risk prediction method, system, storage medium and computer equipment | |
Yang et al. | SMAA-based model for decision aiding using regret theory in discrete Z-number context | |
CN108520357B (en) | Method and device for judging line loss abnormality reason and server | |
Wang et al. | Predicting construction cost and schedule success using artificial neural networks ensemble and support vector machines classification models | |
Cho et al. | A hybrid approach based on the combination of variable selection using decision trees and case-based reasoning using the Mahalanobis distance: For bankruptcy prediction | |
CN107633265A (en) | For optimizing the data processing method and device of credit evaluation model | |
CN110852856B (en) | Invoice false invoice identification method based on dynamic network representation | |
Kou et al. | An integrated expert system for fast disaster assessment | |
CN104036360B (en) | User data processing system and processing method based on magcard attendance behaviors | |
CN110930198A (en) | Electric energy substitution potential prediction method and system based on random forest, storage medium and computer equipment | |
CN112700324A (en) | User loan default prediction method based on combination of Catboost and restricted Boltzmann machine | |
TWI677830B (en) | Method and device for detecting key variables in a model | |
CN111652403A (en) | Feedback correction-based work platform task workload prediction method | |
CN113393316A (en) | Loan overall process accurate wind control and management system based on massive big data and core algorithm | |
CN111090679B (en) | Time sequence data representation learning method based on time sequence influence and graph embedding | |
CN115545342A (en) | Risk prediction method and system for enterprise electric charge recovery | |
CN114626940A (en) | Data analysis method and device and electronic equipment | |
CN114066173A (en) | Capital flow behavior analysis method and storage medium | |
CN113837481A (en) | Financial big data management system based on block chain | |
Nureni et al. | Loan approval prediction based on machine learning approach | |
Zeng | A comparison study on the era of internet finance China construction of credit scoring system model | |
Lv et al. | Detecting pyramid scheme accounts with time series financial transactions | |
Nascimento et al. | Applying Machine Learning to Improve Collection and to Reduce Write-Offs in Utilities | |
Wasito et al. | TIME SERIES CLASSIFICATION FOR FINANCIAL STATEMENT FRAUD DETECTION USING RECURRENT NEURAL NETWORKS BASED APPROACHES | |
Chandrasekaran et al. | Uncertainty-Aware Functional Analysis for Electricity Consumption Prediction Using Multi-Task Optimization Learning Model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |