CN111178675A - LR-Bagging algorithm-based electric charge recycling risk prediction method, system, storage medium and computer equipment - Google Patents

LR-Bagging algorithm-based electric charge recycling risk prediction method, system, storage medium and computer equipment Download PDF

Info

Publication number
CN111178675A
CN111178675A CN201911232092.5A CN201911232092A CN111178675A CN 111178675 A CN111178675 A CN 111178675A CN 201911232092 A CN201911232092 A CN 201911232092A CN 111178675 A CN111178675 A CN 111178675A
Authority
CN
China
Prior art keywords
model
indexes
index
clients
electric charge
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911232092.5A
Other languages
Chinese (zh)
Inventor
姜磊
杨钊
杨军仓
陈素琴
成强
赵军辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Brilliant Data Analytics Inc
Original Assignee
Brilliant Data Analytics Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Brilliant Data Analytics Inc filed Critical Brilliant Data Analytics Inc
Priority to CN201911232092.5A priority Critical patent/CN111178675A/en
Publication of CN111178675A publication Critical patent/CN111178675A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0635Risk analysis of enterprise or organisation activities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0639Performance analysis of employees; Performance analysis of enterprise or organisation operations
    • G06Q10/06393Score-carding, benchmarking or key performance indicator [KPI] analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/06Energy or water supply

Landscapes

  • Business, Economics & Management (AREA)
  • Human Resources & Organizations (AREA)
  • Engineering & Computer Science (AREA)
  • Economics (AREA)
  • Strategic Management (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Marketing (AREA)
  • Educational Administration (AREA)
  • Development Economics (AREA)
  • Tourism & Hospitality (AREA)
  • Physics & Mathematics (AREA)
  • General Business, Economics & Management (AREA)
  • Game Theory and Decision Science (AREA)
  • Health & Medical Sciences (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Public Health (AREA)
  • Water Supply & Treatment (AREA)
  • Primary Health Care (AREA)
  • General Health & Medical Sciences (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention relates to an electric charge recycling risk prediction method, system, storage medium and computer equipment based on LR-Bagging algorithm, which can accurately predict default probability of arrearage according to a customer behavior track, change the passive situation of after-the-fact arrearage management and reduce the risk of electric customer arrearage. The method comprises the following steps: establishing an analysis target, and extracting positive and negative samples; collecting indexes, selecting indexes related to the arrearage risk of the electricity consumer, and deriving to form an index system according to the related indexes; index preprocessing, namely screening indexes with high forecasting power to enter a model; constructing an LR-Bagging algorithm-based model, training the LR-Bagging algorithm model according to the subset of randomly selected samples and the indexes screened in the step 3, wherein the termination iteration condition of the model is determined by the change rate of AUC statistics; and taking the plurality of logistic regression models obtained after training as base classifiers, and taking the weighted average of the prediction results of the plurality of base classifiers as the final prediction probability.

Description

LR-Bagging algorithm-based electric charge recycling risk prediction method, system, storage medium and computer equipment
Technical Field
The invention relates to the field of electric power, in particular to an electric charge recycling risk prediction method and system based on an LR-Bagging algorithm, a storage medium and computer equipment.
Background
The power supply unit always inherits the market rule of 'first power consumption and then payment', and the rule is established on the basis that the power supply company pays the money cost for the consumed electric energy within the specified time limit after enjoying the use of the electric energy. However, some users violate the promised payment refusal or delayed payment of the electric charge, thereby influencing the fund recovery of the power supply enterprise; and the failure to recover the electric charge in time severely limits the capital turnover of power grid enterprises, thereby influencing the power supply and causing vicious circle. The effective prevention and avoidance of the arrearage risk play an important role in the effective operation of the power enterprise, a power grid generates a large amount of data during the operation period, the data can be analyzed by adopting a data mining technology, and valuable information can be extracted. However, improvements in connection with the actual service are still needed in the selection of the index system.
The existing arrearage risk prediction model designs key influence variables in an arrearage risk recognition model by using available data on the basis of analyzing the cause of arrearage of an electric power customer, establishes a model capable of recognizing the possibility of arrearage of the electric power customer by using a Logistic regression theory and method, and predicts the default probability in advance according to the latest information of the customer which can be mastered, so that the passive situation of after-the-fact arrearage management is changed, and the purpose of reducing the arrearage risk of the electric power customer is achieved. However, in the model, the arrearage reasons of the customer group need to be found, and in actual work, the arrearage reasons of the customers are various and it is difficult to find out all the reasons which can influence the arrearage of the customers, so the model has certain limitation on index selection.
Disclosure of Invention
Aiming at the problems in the prior art, the invention provides an electric charge recycling risk prediction method, system, storage medium and computer equipment based on LR-Bagging algorithm, which can accurately predict default probability of arrearages according to the behavior track of customers, thereby changing the passive situation of after-the-fact arrearages management and achieving the purpose of reducing the arrearages risk of power customers.
In order to achieve the purpose, the technical scheme adopted by the invention is as follows: the electric charge recycling risk prediction method based on the LR-Bagging algorithm comprises the following steps:
step 1, establishing an analysis target, and extracting appropriate positive and negative samples in proportion;
step 2, collecting indexes, analyzing basic information, historical payment data and electricity consumption behavior data of the electricity consumption customer, selecting indexes related to the arrearage risk of the electricity consumption customer, and deriving to form an index system according to the related indexes;
step 3, index preprocessing, namely screening indexes with high forecasting power to enter a model;
step 4, constructing an LR-Bagging algorithm-based model, training the LR-Bagging algorithm model according to the randomly selected subset of the samples and the indexes screened in the step 3, and determining the termination iteration condition of the model by the change rate of the AUC statistic; and taking the multiple logistic regression models obtained after training as base classifiers, integrating the multiple logistic regression models through a Bagging algorithm to form a logistic regression-based integration algorithm, and taking the weighted average of the prediction results of the multiple base classifiers as the final prediction probability.
Preferably, step 3 firstly checks the quality of the index data through index preprocessing, secondly constructs derived variables, processes and processes the original index data to obtain variables with more predictive power and interpretability, and then screens the indexes with high predictive power into a model.
Preferably, when the indexes with high predictive power are screened in the step 3 and enter the model, the IV value is used for measuring the predictive power of the indexes;
introducing WOE evidence weight to obtain an IV value; performing WOE encoding on a variable, which needs to be firstly subjected to grouping processing, wherein the calculation formula of the WOE encoding of a group i is as follows:
Figure BDA0002303834140000021
pyiis the proportion of responding clients in group i to all responding clients in the sample, pniIs the proportion of unresponsive clients in group i to all unresponsive clients in the sample, yiIs the number of responding clients in group i, niIs the number of unresponsive clients in the group i, y is the number of all responding clients in the sample, and n is the number of all unresponsive clients in the sample;
for packet i, there is a corresponding IViThe value, the calculation formula is as follows:
Figure BDA0002303834140000022
IV of each groupiThe values are added to obtain the IV value of the whole variable:
Figure BDA0002303834140000023
where K is the number of groups of the variable.
The invention relates to an electric charge recovery risk prediction system based on an LR-Bagging algorithm, which comprises:
the sample extraction module is used for establishing an analysis target and extracting appropriate positive and negative samples in proportion;
the index collection module is used for collecting indexes, analyzing basic information, historical payment data and electricity consumption behavior data of the electricity consumption customer, selecting indexes related to the arrearage risk of the electricity consumption customer, and deriving to form an index system according to the related indexes;
the index preprocessing module is used for preprocessing the indexes and screening the indexes with high forecasting power to enter the model;
the model construction module is used for constructing an LR-Bagging algorithm-based model, training the LR-Bagging algorithm model according to the subset of randomly selected samples and the screened indexes, and determining the termination iteration condition of the model by the change rate of AUC statistics; and taking the multiple logistic regression models obtained after training as base classifiers, integrating the multiple logistic regression models through a Bagging algorithm to form a logistic regression-based integration algorithm, and taking the weighted average of the prediction results of the multiple base classifiers as the final prediction probability.
The storage medium of the present invention has stored thereon computer instructions which, when executed by a processor, implement the steps of the electricity charge recycling risk prediction method.
The computer device comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein when the processor runs the computer program, the electric charge recycling risk prediction method is realized.
The model constructed by the invention can accurately predict the default probability of arrearage according to the behavior track of the customer, thereby changing the passive situation of after-the-fact arrearage management and achieving the purpose of reducing the arrearage risk of the power customer. Compared with the prior art, the invention has the following beneficial effects:
selecting basic information, payment channel information (such as payment channel preference in half a year), payment duration and electric quantity and electricity charge data of an electricity customer to construct an index system, establishing a feature selection-based improved LR-Bagging algorithm model, selecting an index with an IV value larger than 0.02 and a correlation coefficient smaller than 0.6 to input into the LR-Bagging model, randomly selecting a subset of samples to train a logistic regression model, obtaining records and fields of each trained LR-based classifier through random sampling, and determining the termination iteration condition of the algorithm by the change rate of AUC statistic. And taking the trained multiple logistic regression models as base classifiers, and taking the weighted average of the prediction results of the multiple base classifiers as the final prediction probability. Analyzing and predicting all high-voltage users, and evaluating and verifying the classification effect of the model by selecting the values of Accuracy (Accuracy), Precision (Precision), Recall (Recall) and F1, wherein the result shows that the algorithm effectively improves the prediction force of the model.
Drawings
Fig. 1 is a flowchart of an electric charge recycling risk prediction method based on an LR-Bagging algorithm.
Detailed Description
The present invention will be described in further detail with reference to the drawings and specific embodiments, but the present invention is not limited thereto.
Examples
As shown in fig. 1, the electric charge recycling risk prediction model method based on LR-Bagging algorithm in this embodiment mainly includes the following steps:
step 1, establishing an analysis target, extracting a sample: carrying out hierarchical sampling on all samples according to a grade city, and extracting appropriate positive and negative samples in proportion;
in the embodiment, a high-voltage user in a certain province is selected as an analysis object, basic information of a power consumption client, payment channel preference, payment duration and electric quantity and power charge data in half a year (such as between 9 and 2 months in 2018 and 2019) are selected to construct an index system, and 70% of samples are randomly selected as a training set. Because the proportion of the target users in all samples is 5.9%, in the process of training the model, the embodiment performs hierarchical sampling according to each grade city, so that the proportion of the target to the non-target samples in the training set is 1: 10; for the sampled data, 70% of the samples were randomly selected each time to train a logistic regression model and used to predict the original data.
Step 2, collecting indexes, analyzing basic information, historical payment data and electricity consumption behavior data of the electricity consumption customer, selecting indexes related to the arrearage risk of the electricity consumption customer, and deriving to form an index system according to the related indexes;
in this embodiment, the archive data, the electric quantity and electricity fee data, the 95598 customer service data, and other electric power marketing related data (such as data of default electricity stealing and electric power policy) of the electricity consumer are extracted, and an index system is formed on the basis of statistical analysis of relevant influence factors of user arrearages. Taking the whole high-voltage users in a certain province as an example, basic information, payment channel information, payment duration information and electric quantity and electricity charge information of a customer are selected as indexes, and are derived based on the indexes to form an index system.
Step 3, index pretreatment: filling missing values, replacing abnormal values, performing dimensionality reduction on indexes, and screening the indexes with high predictive power to enter a model;
in this embodiment, on the basis of the index data obtained in step 2, the quality of the index data is first checked by index preprocessing, which includes: uniqueness of user numbers, sample integrity, and range, value, missing value, abnormal value, etc. of variables; and secondly, constructing derivative variables, namely processing and processing original index data to obtain more predictive and explanatory variables, such as client seasonal factors, regional factors and the like, and screening the high-predictive index into a model.
The variable screening of the model to be completed in this step is a relatively complex process, and an IV Value (Information Value or Information amount) needs to be considered, that is, the predictive power of the index is measured by the IV Value. The function of the IV value is similar to the information gain, the information gain ratio and the degree of purity of the Gini, and the IV value is used for feature selection, but when the decision tree is constructed, the importance of the features can be calculated in the process of constructing the decision tree, the logistic regression algorithm does not require the importance of the features to be calculated, and the model is easy to be over-fitted due to the fact that the unimportant features are mixed into the model, so that when the logistic regression algorithm is used for modeling, the IV value is calculated firstly for feature screening. This embodiment introduces WOE (Weight of Evidence) Evidence weights to find the IV value. WOE is a form of encoding of the original arguments. To perform WOE encoding on a variable, the variable needs to be first subjected to grouping processing (also called discretization, binning, etc.); after grouping, for the ith group (i.e., group i), the calculation formula of the WOE code is as follows:
Figure BDA0002303834140000041
wherein, pyiIs the proportion of the responding customers (corresponding default customers in the risk model, which means the individuals with the predictive variable value of "yes" (namely the value of 1)) in the group i to all the responding customers in the sample, pniIs the proportion of unresponsive clients in the group to all unresponsive clients in the sample, yiIs the groupNumber of responding clients, niIs the number of unresponsive clients in the group, y is the number of all responding clients in the sample, and n is the number of all unresponsive clients in the sample.
Similarly, for packet i, there will be a corresponding IViThe value, the calculation formula is as follows:
Figure BDA0002303834140000051
with a variable, the IV value of each group can be calculated, and the method is simple, namely the IV of each group is calculatediValue addition:
Figure BDA0002303834140000052
where K is the number of groups of the variable.
And calculating the IV value and the correlation of each variable, and keeping the index with the IV value larger than 0.02 and the correlation smaller than 0.6 to enter the model to obtain the index finally entering the model and the corresponding IV value thereof.
The index variable of the step uses an ECC correlation coefficient to carry out dimension reduction processing, and an IV value is used for screening independent variables with high predictive power; and carrying out quantile binning based on the discrete variable manual binning and the continuous variable.
Step 4, model construction: constructing an LR-Bagging algorithm-based model, training the LR-Bagging algorithm model according to the subset of randomly selected samples and the indexes screened in the step 3, wherein the termination iteration condition of the model is determined by the change rate of AUC statistics; and taking the multiple logistic regression models obtained after training as base classifiers, integrating the multiple logistic regression models through a Bagging algorithm to form a logistic regression-based integration algorithm, and taking the weighted average of the prediction results of the multiple base classifiers as the final prediction probability.
After the indexes are subjected to dimension reduction processing in the step 3, the selected optimal indexes enter an LR-Bagging algorithm model, m training samples are sampled from a training set each time, the samples are put back into the training set after the training is finished, the initial training samples can appear for many times or do not appear at all in a certain training set, and a plurality of trained logistic regression models are obtained.
The LR-Bagging algorithm model is characterized in that records and fields of each trained base classifier are obtained through random sampling, the termination iteration criterion of the algorithm is determined by the change rate of AUC statistics, and the algorithm fully considers the strong generalization capability of LR, the high accuracy of Bagging, the diversity of LR and classifiers brought by feature selection and the like during classification.
The logistic regression model is a probability nonlinear regression model, and is used for researching two-classification observation results and some influencing factors x1,x2,...,xnA multivariate analysis method of the relationship between them. Let n independent variables X ═ X1,x2,...,xn) And the conditional probability p (Y ═ 1| X) is the probability of occurrence under the condition of the value X, then the logistic regression model is expressed as:
Figure BDA0002303834140000061
the probability that the X does not occur under the value-taking condition is as follows:
Figure BDA0002303834140000062
assuming that m test samples are provided, the observed values are respectively y1,y2,…ymLet p denotei=p(yi=1|Xi) For a given XiUnder the condition of yiProbability of 1, then pi=p(yi=0|Xi) Is equal to 1-pi(ii) a Since the logistic regression event satisfies the Bernoulli probability, it
Figure BDA0002303834140000063
Because the observed samples are independent of each other, the joint distribution of the m samples is the product of the edge distributions, and the likelihood function is obtained as:
Figure BDA0002303834140000064
the parameter estimate is first determined which maximizes the value of this likelihood function, i.e. the parameter β is determined01…βnMaking the likelihood function L (w) take the maximum value, and taking the logarithm of the likelihood function L (w) to obtain:
Figure BDA0002303834140000065
to formula (4) with respect to beta01…βnSolving a partial derivative to obtain an equation set:
Figure BDA0002303834140000066
solving the equation set (5) to obtain the corresponding model parameters β of the logistic regression model01…βn
The logistic regression model trained in the step is a weak classifier, and a plurality of logistic regression models are integrated by using a Bagging algorithm to form an integrated algorithm based on logistic regression. The idea of the Bagging algorithm was that LeoBreiman was proposed in his 1 technical report "Bagging predicos" in 1994. The main idea of the Bagging algorithm is to give 1 weak learning algorithm and 1 training set (x)1,y1),(x2,y2)…(xn,yn). M training samples are sampled from the training set every time, the samples are put back to the training set after training is finished, and the initial training samples can appear for many times or not appear at all in a certain training set. After training, 1 prediction function sequence h can be obtained1,h2,...,htPredicting the new example by adopting a weighted average method to the regression problem to obtain the final prediction
Figure BDA0002303834140000067
The flow of the Bagging algorithm is as follows:
(1) given the original data set S ═ x1,y1),(x2,y2)…(xn,yn);
(2) Initializing a data set;
(3)For t=1…T;
(4) for each cycle t, m samples are taken from the original data set S by adopting a Booststract sampling technology to form a new training set S' (x)1,y1),(x2,y2)…(xn,ym);
(5) Training by using a basic learning algorithm in a new training set S' to obtain a learning model ht
(6) Saving t-wheel learner model htLearning each individual1,h2,...,htintegrating an ensemble learner alpha by means of a weighted averaget(T-1, 2, … T), the same value may be taken as the contribution weight of T individuals.
Step 5, model application: the constructed LR-Bagging algorithm-based model can accurately predict default probability of arrearages according to the behavior track of customers, so that the passive situation of after-the-fact arrearage management is changed, and the purpose of reducing the arrearage risk of power customers is achieved.
After the model is constructed, its accuracy must be evaluated. The best and the most important evaluation criterion of a model is the application effect in practice. First, 4 basic definitions are introduced:
tune Positive (TP), the number of subjects whose model is predicted to be 1 and which are actually 1;
true Negative (TN) means the number of observed objects whose model is predicted to be 0 and actually 0;
false Positive (FP), the number of observation objects whose model is predicted to be 1 and is actually 0;
false Negative (FN) indicates the number of observed objects whose model is predicted to be 0, and actually 1.
Constructing a confusion matrix based on the above definitions as shown in the following table one:
Figure BDA0002303834140000071
table-confusion matrix
Based on the above definitions, a plurality of evaluation indexes can be extended, and the Accuracy (Accuracy), Precision (Precision), Recall (Recall) and F1 values are selected in this embodiment.
Accuracy (Accuracy), as the name implies, is the total weight of all predictions (positive class and negative class):
Figure BDA0002303834140000072
precision rate (precision ion), i.e. precision rate, the proportion of correct predictions as positive over all predictions as positive:
Figure BDA0002303834140000073
recall (Recall), i.e. Recall, correctly predicted positive accounts for a proportion that is actually positive in total:
Figure BDA0002303834140000081
the value of F1, which is the arithmetic mean divided by the geometric mean, and the larger the better, substituting the above equations for Precision and Recall into equation (9) below will find that when the value of F1 is small, True Positive relatively increases and false relatively decreases, i.e., Precision and Recall both relatively increase, and F1 weights both Precision and Recall.
Figure BDA0002303834140000082
The verification results of the invention are shown in the following table two:
Figure BDA0002303834140000083
table two verification results
The invention also provides an electric charge recycling risk prediction system based on the LR-Bagging algorithm, which comprises the following steps:
the sample extraction module is used for realizing the step 1, establishing an analysis target and extracting appropriate positive and negative samples in proportion;
the index collection module is used for realizing the step 2, collecting indexes, analyzing basic information, historical payment data and electricity consumption behavior data of the electricity consumption customer, selecting indexes related to the arrearage risk of the electricity consumption customer, and forming an index system according to the related indexes in a derivative mode;
the index preprocessing module is used for realizing the step 3, preprocessing the indexes, screening the indexes with high forecasting power and entering the model;
the model construction module is used for realizing the step 4, constructing an LR-Bagging algorithm-based model, training the LR-Bagging algorithm model according to the randomly selected subset of the samples and the indexes screened in the step 3, and determining the termination iteration condition of the model by the change rate of AUC statistics; and taking the multiple logistic regression models obtained after training as base classifiers, integrating the multiple logistic regression models through a Bagging algorithm to form a logistic regression-based integration algorithm, and taking the weighted average of the prediction results of the multiple base classifiers as the final prediction probability.
The present invention also contemplates a storage medium having stored thereon computer instructions which, when executed by a processor, implement the steps of the method for forecasting electric charge recycling risk.
The invention also provides computer equipment which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein when the processor runs the computer program, the electric charge recycling risk prediction method is realized.
The electric charge recycling risk prediction model method, the electric charge recycling risk prediction model system, the electric charge recycling risk prediction storage medium and the computer equipment based on the LR-Bagging algorithm greatly enhance the transparency and the understandability of the model. It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is intended to include such modifications and variations.

Claims (10)

1. The electric charge recycling risk prediction method based on the LR-Bagging algorithm is characterized by comprising the following steps of:
step 1, establishing an analysis target, and extracting appropriate positive and negative samples in proportion;
step 2, collecting indexes, analyzing basic information, historical payment data and electricity consumption behavior data of the electricity consumption customer, selecting indexes related to the arrearage risk of the electricity consumption customer, and deriving to form an index system according to the related indexes;
step 3, index preprocessing, namely screening indexes with high forecasting power to enter a model;
step 4, constructing an LR-Bagging algorithm-based model, training the LR-Bagging algorithm model according to the randomly selected subset of the samples and the indexes screened in the step 3, wherein the termination iteration condition of the model is determined by the change rate of the AUC statistic; and taking the multiple logistic regression models obtained after training as base classifiers, integrating the multiple logistic regression models through a Bagging algorithm to form a logistic regression-based integration algorithm, and taking the weighted average of the prediction results of the multiple base classifiers as the final prediction probability.
2. The electric charge recycling risk prediction method according to claim 1, wherein step 3 comprises firstly, conducting index preprocessing to check the quality of index data, secondly, constructing derivative variables, conducting processing and processing on original index data to obtain more predictive and explanatory variables, and then screening the high predictive index into a model.
3. The electric charge recycling risk prediction method according to claim 1, wherein when the indexes with high predictive power are screened in the step 3 and enter the model, the predictive power of the indexes is measured by an IV value;
introducing WOE evidence weight to obtain an IV value; performing WOE encoding on a variable, which needs to be firstly subjected to grouping processing, wherein the calculation formula of the WOE encoding of a group i is as follows:
Figure FDA0002303834130000011
pyiis the proportion of responding clients in group i to all responding clients in the sample, pniIs the proportion of unresponsive clients in group i to all unresponsive clients in the sample, yiIs the number of responding clients in group i, niIs the number of unresponsive clients in the group i, y is the number of all responding clients in the sample, and n is the number of all unresponsive clients in the sample;
for packet i, there is a corresponding IViThe value, the calculation formula is as follows:
Figure FDA0002303834130000012
IV of each groupiThe values are added to obtain the IV value of the whole variable:
Figure FDA0002303834130000013
where K is the number of groups of the variable.
4. The electric charge collection risk prediction method according to claim 1, wherein in step 4, n independent variables X ═ (X) are set1,x2,...,xn) And the conditional probability p (Y ═ 1| X) is the probability that occurs under the condition of the value of X, then the logistic regression model is expressed as:
Figure FDA0002303834130000021
wherein beta is01…βnAre model parameters of the logistic regression model.
5. LR-Bagging algorithm-based electric charge recycling risk prediction system is characterized by comprising:
the sample extraction module is used for establishing an analysis target and extracting appropriate positive and negative samples in proportion;
the index collection module is used for collecting indexes, analyzing basic information, historical payment data and electricity consumption behavior data of the electricity consumption customer, selecting indexes related to the arrearage risk of the electricity consumption customer, and deriving to form an index system according to the related indexes;
the index preprocessing module is used for preprocessing the indexes and screening the indexes with high forecasting power to enter the model;
the model construction module is used for constructing an LR-Bagging algorithm-based model, training the LR-Bagging algorithm model according to the subset of randomly selected samples and the screened indexes, and determining the termination iteration condition of the model by the change rate of the AUC statistic; and taking the multiple logistic regression models obtained after training as base classifiers, integrating the multiple logistic regression models through a Bagging algorithm to form a logistic regression-based integration algorithm, and taking the weighted average of the prediction results of the multiple base classifiers as the final prediction probability.
6. The electric charge recycling risk prediction system according to claim 5, wherein the index preprocessing module firstly performs index preprocessing to check the quality of index data, secondly constructs derivative variables, processes and processes the original index data to obtain more predictive and explanatory variables, and then screens the high predictive index into the model.
7. The electric charge recycling risk prediction system according to claim 5, wherein when the index preprocessing module screens the index with high predictive power into the model, the IV value is used for measuring the predictive power of the index;
introducing WOE evidence weight to obtain an IV value; performing WOE encoding on a variable, which needs to be firstly subjected to grouping processing, wherein the calculation formula of the WOE encoding of a group i is as follows:
Figure FDA0002303834130000022
pyiis the proportion of responding clients in group i to all responding clients in the sample, pniIs the proportion of unresponsive clients in group i to all unresponsive clients in the sample, yiIs the number of responding clients in group i, niIs the number of unresponsive clients in the group i, y is the number of all responding clients in the sample, and n is the number of all unresponsive clients in the sample;
for packet i, there is a corresponding IViThe value, the calculation formula is as follows:
Figure FDA0002303834130000031
IV of each groupiThe values are added to obtain the IV value of the whole variable:
Figure FDA0002303834130000032
where K is the number of groups of the variable.
8. The electric charge recycling risk prediction system according to claim 5, wherein the index preprocessing module preprocesses the index including missing value filling, abnormal value replacement processing, and dimension reduction processing.
9. A storage medium having stored thereon computer instructions, wherein the computer instructions, when executed by a processor, implement the steps of the electric charge recycling risk prediction method of any one of claims 1-4.
10. Computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the electric charge reclamation risk prediction method according to any one of claims 1 to 4 when executing the computer program.
CN201911232092.5A 2019-12-05 2019-12-05 LR-Bagging algorithm-based electric charge recycling risk prediction method, system, storage medium and computer equipment Pending CN111178675A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911232092.5A CN111178675A (en) 2019-12-05 2019-12-05 LR-Bagging algorithm-based electric charge recycling risk prediction method, system, storage medium and computer equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911232092.5A CN111178675A (en) 2019-12-05 2019-12-05 LR-Bagging algorithm-based electric charge recycling risk prediction method, system, storage medium and computer equipment

Publications (1)

Publication Number Publication Date
CN111178675A true CN111178675A (en) 2020-05-19

Family

ID=70653875

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911232092.5A Pending CN111178675A (en) 2019-12-05 2019-12-05 LR-Bagging algorithm-based electric charge recycling risk prediction method, system, storage medium and computer equipment

Country Status (1)

Country Link
CN (1) CN111178675A (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112102074A (en) * 2020-10-14 2020-12-18 深圳前海弘犀智能科技有限公司 Grading card modeling method
CN112116225A (en) * 2020-09-07 2020-12-22 中国人民解放军63921部队 Fighting efficiency evaluation method and device for equipment system, and storage medium
CN112308293A (en) * 2020-10-10 2021-02-02 北京贝壳时代网络科技有限公司 Default probability prediction method and device
CN112331285A (en) * 2020-07-10 2021-02-05 青岛国新健康产业科技有限公司 Case grouping method, case grouping device, electronic equipment and storage medium
CN112486842A (en) * 2020-12-17 2021-03-12 中国农业银行股份有限公司 Product testing method and device
CN112734560A (en) * 2020-12-31 2021-04-30 深圳前海微众银行股份有限公司 Variable construction method, device, equipment and computer readable storage medium
CN113469374A (en) * 2021-09-02 2021-10-01 北京易真学思教育科技有限公司 Data prediction method, device, equipment and medium
CN113485910A (en) * 2021-06-07 2021-10-08 广发银行股份有限公司 Test risk early warning method, system, equipment and storage medium
CN113537607A (en) * 2021-07-23 2021-10-22 国网青海省电力公司信息通信公司 Power failure prediction method
CN114491416A (en) * 2022-02-23 2022-05-13 北京百度网讯科技有限公司 Characteristic information processing method and device, electronic equipment and storage medium
CN116433403A (en) * 2023-06-14 2023-07-14 国网安徽省电力有限公司营销服务中心 Account tracking-based electric enterprise accounts receivable early warning method and system
CN117391836A (en) * 2023-07-26 2024-01-12 人上融融(江苏)科技有限公司 Method for modeling overdue probability based on heterogeneous integration of different labels
CN118094339A (en) * 2024-04-17 2024-05-28 中海油田服务股份有限公司 Stratum temperature prediction method and device and computing equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106251241A (en) * 2016-08-02 2016-12-21 贵州电网有限责任公司信息中心 A kind of feature based selects the LR Bagging algorithm improved
CN109063931A (en) * 2018-09-06 2018-12-21 盈盈(杭州)网络技术有限公司 A kind of model method for predicting freight logistics driver Default Probability
CN109272396A (en) * 2018-08-20 2019-01-25 平安科技(深圳)有限公司 Customer risk method for early warning, device, computer equipment and medium
CN109727066A (en) * 2018-12-27 2019-05-07 浙江华云信息科技有限公司 A kind of big industrial electricity consumers load forecasting method based on XGBoost algorithm

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106251241A (en) * 2016-08-02 2016-12-21 贵州电网有限责任公司信息中心 A kind of feature based selects the LR Bagging algorithm improved
CN109272396A (en) * 2018-08-20 2019-01-25 平安科技(深圳)有限公司 Customer risk method for early warning, device, computer equipment and medium
CN109063931A (en) * 2018-09-06 2018-12-21 盈盈(杭州)网络技术有限公司 A kind of model method for predicting freight logistics driver Default Probability
CN109727066A (en) * 2018-12-27 2019-05-07 浙江华云信息科技有限公司 A kind of big industrial electricity consumers load forecasting method based on XGBoost algorithm

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
吴漾;朱州;: "基于特征选择改进LR-Bagging算法的电力欠费风险居民客户预测", 电子产品世界, no. 04, pages 70 *

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112331285B (en) * 2020-07-10 2023-01-10 青岛国新健康产业科技有限公司 Case grouping method, case grouping device, electronic equipment and storage medium
CN112331285A (en) * 2020-07-10 2021-02-05 青岛国新健康产业科技有限公司 Case grouping method, case grouping device, electronic equipment and storage medium
CN112116225A (en) * 2020-09-07 2020-12-22 中国人民解放军63921部队 Fighting efficiency evaluation method and device for equipment system, and storage medium
CN112308293A (en) * 2020-10-10 2021-02-02 北京贝壳时代网络科技有限公司 Default probability prediction method and device
CN112102074A (en) * 2020-10-14 2020-12-18 深圳前海弘犀智能科技有限公司 Grading card modeling method
CN112102074B (en) * 2020-10-14 2024-01-30 深圳前海弘犀智能科技有限公司 Score card modeling method
CN112486842A (en) * 2020-12-17 2021-03-12 中国农业银行股份有限公司 Product testing method and device
CN112734560A (en) * 2020-12-31 2021-04-30 深圳前海微众银行股份有限公司 Variable construction method, device, equipment and computer readable storage medium
CN112734560B (en) * 2020-12-31 2024-05-14 深圳前海微众银行股份有限公司 Variable construction method, device, equipment and computer readable storage medium
CN113485910A (en) * 2021-06-07 2021-10-08 广发银行股份有限公司 Test risk early warning method, system, equipment and storage medium
CN113537607A (en) * 2021-07-23 2021-10-22 国网青海省电力公司信息通信公司 Power failure prediction method
CN113537607B (en) * 2021-07-23 2022-08-05 国网青海省电力公司信息通信公司 Power failure prediction method
CN113469374A (en) * 2021-09-02 2021-10-01 北京易真学思教育科技有限公司 Data prediction method, device, equipment and medium
CN114491416A (en) * 2022-02-23 2022-05-13 北京百度网讯科技有限公司 Characteristic information processing method and device, electronic equipment and storage medium
CN116433403A (en) * 2023-06-14 2023-07-14 国网安徽省电力有限公司营销服务中心 Account tracking-based electric enterprise accounts receivable early warning method and system
CN117391836A (en) * 2023-07-26 2024-01-12 人上融融(江苏)科技有限公司 Method for modeling overdue probability based on heterogeneous integration of different labels
CN118094339A (en) * 2024-04-17 2024-05-28 中海油田服务股份有限公司 Stratum temperature prediction method and device and computing equipment

Similar Documents

Publication Publication Date Title
CN111178675A (en) LR-Bagging algorithm-based electric charge recycling risk prediction method, system, storage medium and computer equipment
Yang et al. SMAA-based model for decision aiding using regret theory in discrete Z-number context
CN108520357B (en) Method and device for judging line loss abnormality reason and server
Wang et al. Predicting construction cost and schedule success using artificial neural networks ensemble and support vector machines classification models
Cho et al. A hybrid approach based on the combination of variable selection using decision trees and case-based reasoning using the Mahalanobis distance: For bankruptcy prediction
CN107633265A (en) For optimizing the data processing method and device of credit evaluation model
CN110852856B (en) Invoice false invoice identification method based on dynamic network representation
Kou et al. An integrated expert system for fast disaster assessment
CN104036360B (en) User data processing system and processing method based on magcard attendance behaviors
CN110930198A (en) Electric energy substitution potential prediction method and system based on random forest, storage medium and computer equipment
CN112700324A (en) User loan default prediction method based on combination of Catboost and restricted Boltzmann machine
TWI677830B (en) Method and device for detecting key variables in a model
CN111652403A (en) Feedback correction-based work platform task workload prediction method
CN113393316A (en) Loan overall process accurate wind control and management system based on massive big data and core algorithm
CN111090679B (en) Time sequence data representation learning method based on time sequence influence and graph embedding
CN115545342A (en) Risk prediction method and system for enterprise electric charge recovery
CN114626940A (en) Data analysis method and device and electronic equipment
CN114066173A (en) Capital flow behavior analysis method and storage medium
CN113837481A (en) Financial big data management system based on block chain
Nureni et al. Loan approval prediction based on machine learning approach
Zeng A comparison study on the era of internet finance China construction of credit scoring system model
Lv et al. Detecting pyramid scheme accounts with time series financial transactions
Nascimento et al. Applying Machine Learning to Improve Collection and to Reduce Write-Offs in Utilities
Wasito et al. TIME SERIES CLASSIFICATION FOR FINANCIAL STATEMENT FRAUD DETECTION USING RECURRENT NEURAL NETWORKS BASED APPROACHES
Chandrasekaran et al. Uncertainty-Aware Functional Analysis for Electricity Consumption Prediction Using Multi-Task Optimization Learning Model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination