CN115660730A - Loss user analysis method and system based on classification algorithm - Google Patents

Loss user analysis method and system based on classification algorithm Download PDF

Info

Publication number
CN115660730A
CN115660730A CN202211331442.5A CN202211331442A CN115660730A CN 115660730 A CN115660730 A CN 115660730A CN 202211331442 A CN202211331442 A CN 202211331442A CN 115660730 A CN115660730 A CN 115660730A
Authority
CN
China
Prior art keywords
data
user
model
target
target data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211331442.5A
Other languages
Chinese (zh)
Inventor
冯瑞雪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Communication Information System Co Ltd
Original Assignee
Inspur Communication Information System Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Communication Information System Co Ltd filed Critical Inspur Communication Information System Co Ltd
Priority to CN202211331442.5A priority Critical patent/CN115660730A/en
Publication of CN115660730A publication Critical patent/CN115660730A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention discloses a loss user analysis method and system based on a classification algorithm, belongs to the technical field of data processing, and aims to solve the technical problem of how to analyze users of telecommunication operators to predict loss users. The method comprises the following steps: the preprocessed historical data are target data, and data screening is carried out on the target data based on the influence degree of the target data on user loss; performing dimension reduction processing and importance sorting on the target data; constructing a logistic regression model and a random forest model for user loss prediction, and respectively performing model training on the logistic regression model and the random forest model based on the sample data and corresponding labels to obtain a trained logistic regression model and a trained random forest model; and respectively carrying out user loss prediction on the input data through the trained logistic regression model and the trained random forest model.

Description

Loss user analysis method and system based on classification algorithm
Technical Field
The invention relates to the technical field of data processing, in particular to a loss user analysis method and system based on a classification algorithm.
Background
The telecommunication operator has huge user size, and the technical problem to be solved is how to maintain users, reduce user loss, perform early warning and analysis on possible lost users, take measures in time to save users and reduce enterprise loss to the maximum extent.
Disclosure of Invention
The technical task of the invention is to provide a loss user analysis method and system based on a classification algorithm aiming at the defects so as to solve the technical problem of how to analyze users of telecommunication operators to predict loss users.
In a first aspect, the invention relates to a loss user analysis method based on a classification algorithm, which comprises the following steps:
obtaining historical user data of a telecom operator, and labeling a label for the historical user data, wherein the historical user data comprises user basic information, user contract information, user usage information and user change information, and the label is used for indicating whether the label is a lost user;
with the historical user data as target data, carrying out data preprocessing on the target data, deleting abnormal values through the data preprocessing, and carrying out code conversion on qualitative data to obtain the preprocessed historical user data;
taking the preprocessed historical data as target data, and screening the target data based on the influence degree of the target data on user loss to obtain screened historical user data;
taking the screened historical user data as target data, performing dimensionality reduction processing and importance sorting on the target data, and taking the sorted historical user data as sample data;
constructing a logistic regression model and a random forest model for user loss prediction, and respectively performing model training on the logistic regression model and the random forest model based on the sample data and corresponding labels to obtain a trained logistic regression model and a trained random forest model;
acquiring real-time user data of a telecom operator, wherein the real-time user data comprises user basic information, user contract information, user usage information and user change information;
taking the real-time user data as target data, carrying out data preprocessing on the target data, deleting abnormal values through data preprocessing and carrying out coding conversion on qualitative data to obtain preprocessed real-time user data, and taking the preprocessed real-time user data as input data;
and respectively carrying out user loss prediction on the input data through the trained logistic regression model and the trained random forest model.
Preferably, deleting abnormal values through data preprocessing, wherein the abnormal values comprise user data with voice, data and short message consumption in a month and consumption number of zero;
the qualitative data is transcoded in the form of Boolean values and the non-quantifiable data is transcoded into the quantitative data.
Preferably, based on the influence degree of the target data on the user loss, the data screening is performed on the target data, and the method comprises the following steps:
performing data analysis on the preprocessed user data in a visual analysis mode, screening out data with the influence on user loss smaller than a threshold value, and obtaining the analyzed user data;
and carrying out correlation analysis on the analyzed user data based on the label corresponding to the user data, and screening out data with correlation smaller than a threshold value with the user loss to obtain the screened user data.
Preferably, the target data is subjected to dimension reduction processing through a principal component analysis method, and importance ranking is performed on the target data subjected to dimension reduction processing through a trained GBDT model based on the influence degree of the target data on user churn.
Preferably, after user loss prediction is performed on input data through the trained logistic regression model and the random forest model, the logistic regression model and the random forest model with high prediction accuracy are selected as target models based on the actual situation of user loss, and model training is performed on the target data based on the real-time user data and the corresponding actual situation of user loss.
In a second aspect, the present invention provides a system for analyzing attrition users based on a classification algorithm, for performing attrition user analysis on a telecommunications carrier by using the system for analyzing attrition users based on a classification algorithm according to any one of the first aspect, the system comprising:
the data acquisition module is used for acquiring historical user data of a telecom operator and labeling a label for the historical user data, wherein the historical user data comprises user basic information, user contract information, user usage information and user change information, and the label is used for indicating whether the label is a lost user; the system comprises a data acquisition module, a data transmission module, a data processing module and a data processing module, wherein the data acquisition module is used for acquiring real-time user data of a telecom operator, and the real-time user data comprises user basic information, user contract information, user usage information and user change information;
the data preprocessing module is used for preprocessing the target data by taking the historical user data as the target data, deleting abnormal values through data preprocessing and carrying out code conversion on qualitative data to obtain preprocessed historical user data; the data preprocessing module is used for preprocessing the target data by taking the real-time user data as the target data, deleting abnormal values through data preprocessing and carrying out code conversion on qualitative data to obtain preprocessed real-time user data, and taking the preprocessed real-time user data as input data;
the data screening module is used for screening the target data by taking the preprocessed historical data as the target data based on the influence degree of the target data on user loss to obtain the screened historical user data;
the data dimension reduction sorting module is used for taking the screened historical user data as target data, performing dimension reduction processing on the target data, performing importance sorting, and taking the sorted historical user data as sample data;
the model training module is used for constructing a logistic regression model and a random forest model for user loss prediction, and respectively performing model training on the logistic regression model and the random forest model based on the sample data and corresponding labels to obtain a trained logistic regression model and a trained random forest model;
and the prediction analysis module is used for predicting the user loss of the input data through the trained logistic regression model and the trained random forest model respectively.
Preferably, the data preprocessing module is used for deleting abnormal values through data preprocessing, and the abnormal values comprise user data with the consumption number of zero and voice, data and short message consumption in a month;
the data preprocessing module is used for performing code conversion on qualitative data in a Boolean value form and converting non-quantifiable data into quantitative data.
Preferably, the data screening module is configured to perform the following:
performing data analysis on the preprocessed user data in a visual analysis mode, screening out data with the influence on user loss smaller than a threshold value, and obtaining the analyzed user data;
and carrying out correlation analysis on the analyzed user data based on the label corresponding to the user data, and screening out data with correlation smaller than a threshold value with the user loss to obtain the screened user data.
Preferably, the data dimension reduction sorting module is configured to perform dimension reduction processing on the target data through a principal component analysis method, and is configured to perform importance sorting on the target data after the dimension reduction processing through a trained GBDT model based on the influence degree of the target data on the user churn.
Preferably, the system further comprises a model iterative training module, after the trained logistic regression model and the random forest model predict the user loss of the input data, the model iterative training module is used for selecting the logistic regression model with high prediction accuracy and the random forest model as target models based on the actual situation of the user loss, and performing model training on the target data based on the real-time user data and the corresponding actual situation of the user loss.
The loss user analysis method and the system based on the classification algorithm have the following advantages that:
1. analyzing key factors influencing user loss, training a random forest model and a regression logic model based on user data influencing user loss as training data, and predicting screened real-time user data through the trained random forest model and the trained regression logic model respectively, so that the accuracy of user loss analysis can be improved;
2. and dimension reduction and sorting processing are performed on the screened data, so that the operation efficiency and accuracy are improved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed for the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.
The invention is further described below with reference to the accompanying drawings.
FIG. 1 is a block flow diagram of a attrition user analysis method based on a classification algorithm according to embodiment 1;
fig. 2 is a schematic diagram illustrating analysis of user data by a pie chart in an attrition user analysis method based on a classification algorithm according to embodiment 1;
FIG. 3 is a schematic diagram illustrating user data analysis by a histogram in an attrition user analysis method based on a classification algorithm according to embodiment 1;
FIG. 4 is a schematic diagram illustrating analysis of user data by a density map in an attrition user analysis method based on a classification algorithm according to embodiment 1;
fig. 5 is a schematic diagram illustrating analysis of user data by a density map in an attrition user analysis method based on a classification algorithm in embodiment 1.
Detailed Description
The present invention is further described in the following with reference to the drawings and the specific embodiments so that those skilled in the art can better understand the present invention and can implement the present invention, but the embodiments are not to be construed as limiting the present invention, and the embodiments and the technical features of the embodiments can be combined with each other without conflict.
The embodiment of the invention provides a lost user analysis method and system based on a classification algorithm, which are used for solving the technical problem of how to analyze users of a telecommunication operator to predict lost users.
Example 1:
the invention relates to a loss user analysis method based on a classification algorithm, which comprises the following steps:
s100, obtaining historical user data of a telecom operator, and labeling a label for the historical user data, wherein the historical user data comprises user basic information, user contract information, user usage information and user change information, and the label is used for indicating whether the label is a lost user;
s200, with the historical user data as target data, carrying out data preprocessing on the target data, deleting abnormal values through the data preprocessing, and carrying out code conversion on qualitative data to obtain the preprocessed historical user data;
s300, taking the preprocessed historical data as target data, and carrying out data screening on the target data based on the influence degree of the target data on user loss to obtain screened historical user data;
s400, with the screened historical user data as target data, performing dimensionality reduction processing and importance sorting on the target data, and taking the sorted historical user data as sample data;
s500, constructing a logistic regression model and a random forest model for user loss prediction, and respectively performing model training on the logistic regression model and the random forest model based on the sample data and corresponding labels to obtain a trained logistic regression model and a trained random forest model;
s600, acquiring real-time user data of a telecom operator, wherein the real-time user data comprises user basic information, user contract information, user usage information and user change information;
s700, with the real-time user data as target data, carrying out data preprocessing on the target data, deleting abnormal values through the data preprocessing, and carrying out code conversion on qualitative data to obtain preprocessed real-time user data;
s800, taking the preprocessed real-time user data as target data, performing data screening on the target data based on the screened historical user data, and taking the real-time user data affecting user loss as the screened real-time user number data;
s900, taking the screened historical user data as target data, performing dimension reduction processing and importance sorting on the target data, and taking sorted real-time user data as input data;
and SA00, performing user loss prediction on the input data through the trained logistic regression model and the trained random forest model respectively.
The user basic information, the user contract information, the user usage information, and the user change information obtained in this embodiment are respectively as follows:
TABLE 1 basic information of users
Figure BDA0003913638470000071
TABLE 2 user contract information
Figure BDA0003913638470000072
TABLE 3 user usage information
Figure BDA0003913638470000073
TABLE 4 user Change information
Figure BDA0003913638470000081
Step S200 performs preprocessing, including removing abnormal values and encoding data of the target data.
Abnormal value deletion: deleting data with voice, data and short message consumption in the month but zero consumption number.
And (3) data encoding: the qualitative data is encoded and converted into non-quantifiable data, such as gender and address, and is distinguished from quantitative data, such as age and weight. The Boolean value is used for data coding conversion, such as conversion of a gender field into two fields, namely whether the male is or not and whether the female is or not.
The data analysis of step S300 includes data analysis in two aspects, which are respectively:
(1) Performing data analysis on the preprocessed user data in a visual analysis mode, screening out data with the influence on user loss smaller than a threshold value, and obtaining the analyzed user data;
(2) And performing correlation analysis on the analyzed user data based on the label corresponding to the user data, and screening out data with the correlation with the user loss smaller than a threshold value to obtain the screened user data.
The step (1) is mainly descriptive data analysis, and the basic method is to carry out preliminary analysis on the data through visualization and the situation of a chart. The method aims to analyze the distribution of data, find out rules and draw conclusions, if which factor has a larger influence on user loss, the model input is considered subsequently, and if which factor has a smaller influence on user loss, the model input is abandoned.
In specific implementation, the loss user percentage analysis is carried out through a pie chart (figure 2), and as seen from the pie chart, loss customers account for one tenth of the total customers, and the loss rate reaches 7.41%.
The influence of the gender, payment and age attributes of the user on the user loss is analyzed through a histogram. It can be seen from fig. 3 that the attrition customers are concentrated in post-paid customers, as is the carrier data in the middle east region, female customers are fewer, and gender is not taken into account. The rate of run-off between 25 and 35 is high from an age point of view.
By analyzing the user consumption attributes through the density map, it is seen from fig. 4 that the consumption capacity of the lost users is concentrated on 0-200, and it is seen from fig. 5 that the lost users are more easily lost when the number of days of network access is about 1000.
Data that affects the loss of the user, such as age, gender, and flow rate of consumption, are characterized. The data analysis described above is a descriptive analysis of the data, and the conclusion is drawn by visualizing the chart, in order to analyze those features that would affect user churn.
Based on the upper data analysis, characteristics which are not considered are deleted, the characteristics are primarily screened through correlation analysis, and the characteristics with weak correlation are deleted.
The correlation analysis is a visual method, and linear correlation analysis between features is performed by using the following formula, wherein r is a value between-1 and 1, and corresponds to the change of red-green in the following graph. And remove features that have zero correlation to the tag (whether it is stale or not).
Figure BDA0003913638470000091
After data analysis is performed to screen out data affecting user churn, data dimension reduction and sorting are performed through step S400.
In this embodiment, the feature with high relevance is subjected to dimension reduction by a principal component analysis method, and importance ranking is performed by the trained GDBT.
The tree model is a model with decision tree as a principle, including GDBT, RF, lightBGM and the like, and has a great function of showing a decision process and screening by which characteristics.
The base model is the base model for feature selection. There are many kinds of feature selection, and the model using the model for feature selection is the base model.
The GDBT model is an extension of the decision tree model, and is formed by accumulating a plurality of decision trees, each decision tree is a fitting to the combined prediction residual of the previous decision tree, and is a kind of 'correction' to the previous model result.
Based on the characteristics of the GDBT, a decision process can be shown, the features and the ranking of the features (similar to the ranking of a tree structure from top to bottom) can be determined, and the result obtained by training the data once can be the importance ranking of the features. Training was performed in this protocol by kini = entcopy.
GBDT in the tree model is also used as a base model for feature selection, and the selected FromModel class of the feature _ selection library is combined with the GBDT model, and the sequence is as follows:
Figure BDA0003913638470000101
step S500 is the construction and training of the prediction model. And (4) performing modeling training by using the features of the top 50 of the importance, wherein the models mainly adopted are logistic regression and random forest models.
For the logistic regression model:
(1) Inputting a model:
labeling: whether customers are lost
Is characterized in that: the characteristics after the characteristic engineering screening comprise basic attribute characteristics of a customer, customer usage characteristics and customer ordering characteristics.
(2) Constructing a model:
model:
Figure BDA0003913638470000111
the model results are numerical values from 0-1, understood as probabilities.
Determining an objective function:
Figure BDA0003913638470000112
(3) Model training:
and minimizing the objective function by using a gradient solving mode, and solving the parameter solution theta of the objective function so as to determine the model.
(4) And (3) model evaluation:
using AUC index as model performance index
The classification model divides the samples into positive and negative samples, and the result has four conditions:
TP: positive class prediction as positive class
FP: positive class prediction as negative class
TN: negative class prediction as negative class
FN: negative class prediction as positive class
Precision is directed to the prediction result, indicating how many of the samples predicted to be positive are true positive samples
The recall rate recall indicates how many positive examples in the sample are predicted to be correct for the original sample
The true positive class rate is used for measuring the correctly predicted positive class proportion.
The false positive class rate measures the proportion of "negative classes" that are mistaken for "positive classes".
Figure BDA0003913638470000121
Figure BDA0003913638470000122
Figure BDA0003913638470000123
Figure BDA0003913638470000124
The x-axis of the ROC curve is the false positive class rate and the y-axis is the true positive class rate, describing the change in the number of correctly classified positive samples with the number of incorrectly classified negative samples. The closer the curve to the upper left indicates better prediction performance. The area under the curve is an important index for measuring the effect of the model, and the larger the area is, the more positive and negative samples can be better classified by the model.
ROC is an index of an evaluation model, and the larger the ROC is, the better the model performance is
(5) And (3) modeling results:
by cross-validation, the mean highest AUC was 0.862
(6) Model online use:
through inputting the basic information, the usage information and the ordering information of the user, the model predicts the loss probability of the user, and if the probability is larger than 0.5, the user is judged to be likely to lose.
2. For random forest models
(1) Inputting a model:
labeling: whether customers are lost
Is characterized in that: the characteristics after the characteristic engineering screening comprise basic attribute characteristics of a customer, customer usage characteristics and customer ordering characteristics.
(2) Model construction:
model: random forest model
The random forest is a forest established in a random mode, a plurality of decision trees are arranged in the forest, and each decision tree in the random forest is not related. After a forest is obtained, when a new input sample enters, each decision tree in the forest is judged, the sample is looked at which type the sample should belong, and then the sample is predicted to be the type by looking at which type is selected most.
(3) Model training:
a training set S, a testing set T and a feature dimension F are given.
Determining parameters: the number of used decision trees t, the depth of each tree d, the number of features used by each node f, the termination condition: the minimum sample number s on the node and the minimum information gain m on the node; the scheme adopts a grid searching mode to find out the optimal parameters (the coefficient of kini = entropy, the maximum depth of a decision tree =10, and the minimum number of samples required for subdivision = 4)
For the ith tree, i =1-t:
the training set S (i) with the same size as S is extracted from the S, samples of the root nodes are randomly selected, training is started from the root nodes, if the termination condition is met on the current nodes, the current nodes are set as leaf nodes, the prediction output of the leaf nodes is the type c (j) with the largest quantity in the sample set of the current nodes, and the probability p is the proportion of the c (j) in the sample set of the current nodes;
and (3) repeating until all nodes are trained or marked as leaf nodes.
And (4) repeating the steps (3) and (4) until all the decision trees are trained.
(4) And (3) model evaluation:
using AUC index as model performance index
The classification model divides the samples into positive and negative samples, and the result has four conditions:
TP: positive class prediction as positive class
FP: positive class prediction as negative class
TN: negative class prediction as negative class
FN: negative class prediction as positive class
Precision is directed to the prediction result, indicating how many of the samples predicted to be positive are true positive samples
Recall recalls for the original samples, indicating how many positive examples in the samples were predicted to be correct
The true positive class rate is used for measuring the correctly predicted positive class proportion.
The false positive class rate measures the proportion of "negative classes" that are mistaken for "positive classes".
Figure BDA0003913638470000141
Figure BDA0003913638470000142
Figure BDA0003913638470000143
Figure BDA0003913638470000144
The x-axis of the ROC curve is the false positive class rate and the y-axis is the true positive class rate, describing the change in the number of true positive samples with the number of false negative samples. The closer the curve to the upper left indicates better prediction performance. The area under the curve is an important index for measuring the effect of the model, and the larger the area is, the more positive and negative samples can be better classified by the model.
ROC is an index of an evaluation model, and the larger the ROC is, the better the model performance is
(5) And (3) modeling results:
by cross-validation method, the mean highest AUC was 0.786
(6) Model online use:
through inputting the basic information, the usage information and the ordering information of the user, the model predicts the loss probability of the user, and if the probability is more than 0.5, the user is judged to be likely to lose.
And predicting the loss user through the trained logistic regression model and the random forest model.
For the collected real-time user data, preprocessing is performed through step S700, including performing outlier deletion and data encoding on the target data.
Outlier deletion: and deleting data with voice, data and short message consumption but zero consumption in a month.
Data encoding: the qualitative data is encoded and converted into non-quantifiable data, such as gender and address, and is distinguished from quantitative data, such as age and weight. The Boolean value is used for data coding conversion, such as conversion of a gender field into two fields, namely whether the male is or not and whether the female is or not.
And executing the step S800-SA00 to analyze the lost user for the preprocessed real-time user data.
In this embodiment, after user loss prediction is performed on input data through a trained logistic regression model and a random forest model, a logistic regression model and a random forest model with high prediction accuracy are selected as target models based on actual conditions of user loss, and model training is performed on the target data based on the real-time user data and corresponding actual conditions of user loss.
Example 2:
the invention discloses a loss user analysis system based on a classification algorithm, which comprises a data acquisition module, a data preprocessing module, a data screening module, a data dimension reduction sequencing module, a model training module and a prediction analysis module, wherein the system is used for carrying out loss user analysis on a telecom operator based on the method disclosed by embodiment 1.
The data acquisition module is used for acquiring historical user data of a telecom operator and labeling a label for the historical user data, wherein the historical user data comprises user basic information, user contract information, user usage information and user change information, and the label is used for indicating whether a lost user exists or not; and the real-time user data comprises user basic information, user contract information, user usage information and user change information.
In this embodiment, the collected user basic information, user contract information, user usage information, and user change information are all the same as those disclosed in embodiment 1.
The data preprocessing module is used for preprocessing the target data by taking the historical user data as the target data, deleting abnormal values through data preprocessing and performing code conversion on qualitative data to obtain preprocessed historical user data; and the data preprocessing unit is used for preprocessing the target data by taking the real-time user data as the target data, deleting abnormal values through data preprocessing and carrying out code conversion on qualitative data to obtain preprocessed real-time user data, and the preprocessed real-time user data is taken as input data.
As a specific implementation, the data preprocessing module is configured to perform the following preprocessing:
(1) Outlier deletion: deleting data with voice, data and short message consumption in the month but zero consumption number.
(2) And (3) data encoding: the qualitative data is coded and converted into non-quantifiable data, such as sex and address, which is different from quantitative data, such as age and weight. The Boolean value is used for data coding conversion, such as conversion of a gender field into two fields, namely whether the male is or not and whether the female is or not.
And the data screening module is used for screening the target data by taking the preprocessed historical data as the target data based on the influence degree of the target data on the user loss to obtain the screened historical user data.
As a specific implementation, the data screening module is configured to perform the following:
(1) Performing data analysis on the preprocessed user data in a visual analysis mode, screening out data with the influence on user loss smaller than a threshold value, and obtaining the analyzed user data;
(2) And carrying out correlation analysis on the analyzed user data based on the label corresponding to the user data, and screening out data with correlation smaller than a threshold value with the user loss to obtain the screened user data.
The step (1) is mainly descriptive data analysis, and the basic method is to carry out preliminary analysis on the data through visualization and the situation of a chart. The method aims to analyze the distribution of data, find out rules and draw conclusions, if which factor has a large influence on user loss, then model input is considered subsequently, and if which factor has a small influence on user loss, the model input is abandoned.
In specific implementation, the loss user percentage analysis is carried out through a pie chart (figure 2), and as seen from the pie chart, loss customers account for one tenth of the total customers, and the loss rate reaches 7.41%.
And analyzing the influence of the gender, payment and age attributes of the user on the user loss through the histogram. It can be seen from fig. 3 that the attrition customers are concentrated in post-paid customers, as is the carrier data in the middle east region, female customers are fewer, and gender is not taken into account. The rate of run-off between 25 and 35 is high from an age point of view.
By analyzing the user consumption attributes through the density map, it is seen from fig. 4 that the consumption capacity of the lost users is concentrated on 0-200, and it is seen from fig. 5 that the lost users are more easily lost when the number of days of network access is about 1000.
Data that affects the loss of the user, such as age, gender, and flow rate of consumption, are characterized. The data analysis described above is a descriptive analysis of the data, and the conclusion is drawn by visualizing the chart, in order to analyze those features that would affect user churn.
Based on the upper data analysis, characteristics which are not considered are deleted, the characteristics are primarily screened through correlation analysis, and the characteristics with weak correlation are deleted.
The correlation analysis is a visual method, and linear correlation analysis between features is performed by using the following formula, wherein r is a value between-1 and 1, and corresponds to the change of red-green in the following graph. And remove features that have zero correlation to the tag (whether it is stale).
Figure BDA0003913638470000171
And the data dimension reduction sorting module is used for performing dimension reduction processing and importance sorting on the target data by taking the screened historical user data as target data, and taking the sorted historical user data as sample data.
As a specific implementation, the data dimension reduction sorting module is used for performing dimension reduction on the features with high relevance by a principal component analysis method, and performing importance sorting by the trained GDBT.
The tree model is a model with decision tree as a principle, and comprises GDBT, RF, lightBGM and the like, and the tree model has a great function of showing a decision process and screening by which characteristics.
The base model is the base model for feature selection. There are many kinds of feature selection, and the model using the model for feature selection is the base model.
The GDBT model is an extension of the decision tree model, and is formed by accumulating a plurality of decision trees, each decision tree is a fitting to the combined prediction residual of the previous decision tree, and is a kind of 'correction' to the previous model result.
Based on the characteristics of the GDBT, a decision process can be shown, the features and the ranking of the features (similar to the ranking of a tree structure from top to bottom) can be determined, and the result obtained by training the data once is the importance ranking of the features. This protocol was trained with kini = entrypy.
GBDT in the tree model is also used as a base model for feature selection, and the GBDT model is combined with the SelectFromModel class of the feature _ selection library.
The model training module is used for constructing a logistic regression model and a random forest model for user loss prediction, and respectively performing model training on the logistic regression model and the random forest model based on the sample data and the corresponding labels to obtain the trained logistic regression model and the trained random forest model.
As a specific implementation, for the logistic regression model:
(1) Inputting a model:
labeling: whether customers are lost
Is characterized in that: the characteristics after the characteristic engineering screening comprise basic attribute characteristics of a customer, customer usage characteristics and customer ordering characteristics.
(2) Constructing a model:
model:
Figure BDA0003913638470000181
the model results are numerical values from 0-1, understood as probabilities.
Determining an objective function:
Figure BDA0003913638470000182
(3) Model training:
and minimizing the objective function by using a gradient solving mode, and solving the parameter solution theta of the objective function so as to determine the model.
(4) And (3) model evaluation:
using AUC index as model performance index
The classification model divides the samples into positive and negative samples, and the result has four conditions:
TP: positive class prediction as positive class
FP: positive class prediction as negative class
TN: negative class prediction as negative class
FN: negative class prediction as positive class
Precision is directed to the prediction result, indicating how many of the samples predicted to be positive are true positive samples
The recall rate recall indicates how many positive examples in the sample are predicted to be correct for the original sample
The true positive class rate is used for measuring the correctly predicted positive class proportion.
The false positive class rate measures the proportion of "negative classes" that are mistaken for "positive classes".
Figure BDA0003913638470000191
Figure BDA0003913638470000192
Figure BDA0003913638470000193
Figure BDA0003913638470000194
The x-axis of the ROC curve is the false positive class rate and the y-axis is the true positive class rate, describing the change in the number of true positive samples with the number of false negative samples. The closer the curve to the upper left indicates better prediction performance. The area under the curve is an important index for measuring the effect of the model, and the larger the area is, the more positive and negative samples can be better classified by the model.
ROC is an index of an evaluation model, and the larger the ROC is, the better the model performance is
(5) And (3) modeling results:
by cross-validation, the mean highest AUC was 0.862
(6) Model online use:
through inputting the basic information, the usage information and the ordering information of the user, the model predicts the loss probability of the user, and if the probability is more than 0.5, the user is judged to be likely to lose.
2. For random forest models
(1) Inputting a model:
labeling: whether customers are lost
Is characterized in that: the characteristics after the characteristic engineering screening comprise basic attribute characteristics of a customer, customer usage characteristics and customer ordering characteristics.
(2) Constructing a model:
model: random forest model
The random forest is a forest established in a random mode, a plurality of decision trees are arranged in the forest, and each decision tree in the random forest is not related. After a forest is obtained, when a new input sample enters, each decision tree in the forest is judged, the sample is looked at which type the sample should belong, and then the sample is predicted to be the type by looking at which type is selected most.
(3) Model training:
a training set S, a testing set T and a feature dimension F are given.
Determining parameters: the number of used decision trees t, the depth of each tree d, the number of features used by each node f, the termination condition: the minimum sample number s on the node and the minimum information gain m on the node; the scheme adopts a grid searching mode to find out the optimal parameters (the coefficient of kini = entropy, the maximum depth of a decision tree =10, and the minimum number of samples required for subdivision = 4)
For the ith tree, i =1-t:
the training set S (i) which is returned from the S and has the same extraction size as the training set S is selected randomly as a sample of a root node, the training is started from the root node, if the termination condition is reached on the current node, the current node is set as a leaf node, the prediction output of the leaf node is the class c (j) with the largest quantity in the sample set of the current node, and the probability p is the proportion of the c (j) in the current sample set;
and (3) repeating until all nodes are trained or marked as leaf nodes.
Repeating (3) and (4) until all decision trees are trained.
(4) And (3) model evaluation:
using AUC index as model performance index
The classification model divides the samples into positive and negative samples, and the result has four conditions:
TP: positive class prediction as positive class
FP: positive class prediction as negative class
TN: negative class prediction as negative class
FN: negative class prediction as positive class
Precision prediction refers to the prediction result and indicates how many of the samples predicted to be positive are true positive samples
The recall rate recall indicates how many positive examples in the sample are predicted to be correct for the original sample
The real positive class rate is used for measuring the correctly predicted positive class proportion.
The false positive class rate measures the proportion of "negative classes" that are mistaken for "positive classes".
Figure BDA0003913638470000211
Figure BDA0003913638470000212
Figure BDA0003913638470000213
Figure BDA0003913638470000214
The x-axis of the ROC curve is the false positive class rate and the y-axis is the true positive class rate, describing the change in the number of correctly classified positive samples with the number of incorrectly classified negative samples. The closer the curve to the upper left indicates better prediction performance. The area under the curve is an important index for measuring the effect of the model, and the larger the area is, the more positive and negative samples can be better classified by the model.
ROC is an index of an evaluation model, and the larger the ROC is, the better the model performance is
(5) And (3) modeling results:
by cross-validation method, the mean highest AUC was 0.786
(6) Model online use:
through inputting the basic information, the usage information and the ordering information of the user, the model predicts the loss probability of the user, and if the probability is larger than 0.5, the user is judged to be likely to lose.
And predicting the loss user through the trained logistic regression model and the random forest model.
And the prediction analysis module is used for performing user loss prediction on the input data through the trained logistic regression model and the trained random forest model respectively.
As an improvement, the system further comprises a model iterative training module, after the trained logistic regression model and the random forest model predict the user loss of the input data, the model iterative training module is used for selecting the logistic regression model with high prediction accuracy and the random forest model as target models based on the actual condition of the user loss, and performing model training on the target data based on the real-time user data and the corresponding actual condition of the user loss.
While the invention has been shown and described in detail in the drawings and in the preferred embodiments, it is not intended to limit the invention to the embodiments disclosed, and it will be apparent to those skilled in the art that various combinations of the code auditing means in the various embodiments described above may be used to obtain further embodiments of the invention, which are also within the scope of the invention.

Claims (10)

1. A loss user analysis method based on a classification algorithm is characterized by comprising the following steps:
obtaining historical user data of a telecom operator, and labeling a label for the historical user data, wherein the historical user data comprises user basic information, user contract information, user usage information and user change information, and the label is used for indicating whether the historical user data is lost or not;
with the historical user data as target data, carrying out data preprocessing on the target data, deleting abnormal values through the data preprocessing, and carrying out code conversion on qualitative data to obtain the preprocessed historical user data;
taking the preprocessed historical data as target data, and screening the target data based on the influence degree of the target data on user loss to obtain screened historical user data;
taking the screened historical user data as target data, performing dimensionality reduction processing and importance sorting on the target data, and taking the sorted historical user data as sample data;
constructing a logistic regression model and a random forest model for user loss prediction, and respectively performing model training on the logistic regression model and the random forest model based on the sample data and corresponding labels to obtain a trained logistic regression model and a trained random forest model;
acquiring real-time user data of a telecom operator, wherein the real-time user data comprises user basic information, user contract information, user usage information and user change information;
taking the real-time user data as target data, carrying out data preprocessing on the target data, deleting abnormal values through the data preprocessing, and carrying out code conversion on qualitative data to obtain preprocessed real-time user data;
taking the preprocessed real-time user data as target data, performing data screening on the target data based on screened historical user data, and taking the real-time user data affecting user loss as screened real-time user data;
taking the screened historical user data as target data, performing dimension reduction processing and importance sorting on the target data, and taking sorted real-time user data as input data;
and respectively carrying out user loss prediction on the input data through the trained logistic regression model and the trained random forest model.
2. The attrition user analysis method based on the classification algorithm as claimed in claim 1 wherein the abnormal values are deleted by data preprocessing, including deleting user data with voice, data and short message usage in the month, but with zero consumption;
the qualitative data is transcoded in the form of Boolean values and the non-quantifiable data is transcoded into the quantitative data.
3. The attrition user analysis method based on the classification algorithm as claimed in claim 1 wherein the data screening of the target data is performed based on the influence degree of the target data on user attrition, comprising the steps of:
performing data analysis on the preprocessed user data in a visual analysis mode, screening out data with the influence on user loss smaller than a threshold value, and obtaining the analyzed user data;
and carrying out correlation analysis on the analyzed user data based on the label corresponding to the user data, and screening out data with correlation smaller than a threshold value with the user loss to obtain the screened user data.
4. The attrition user analysis method based on the classification algorithm as claimed in claim 1 wherein the target data is dimension-reduced by a principal component analysis method, and the target data after dimension-reduction is ranked in importance by the trained GBDT model based on the degree of influence of the target data on the user attrition.
5. The attrition user analysis method based on the classification algorithm as claimed in any one of claims 1 to 4 wherein after the trained logistic regression model and random forest model are used to predict the user attrition of the input data, the logistic regression model and random forest model with high prediction accuracy are selected as the target model based on the actual situation of the user attrition, and the target data is model trained based on the real-time user data and the corresponding actual situation of the user attrition.
6. A attrition subscriber analysis system based on classification algorithm for conducting attrition subscriber analysis on a telecommunications carrier by a classification algorithm based attrition subscriber analysis according to any one of claims 1-5, the system comprising:
the data acquisition module is used for acquiring historical user data of a telecom operator and labeling a label for the historical user data, wherein the historical user data comprises user basic information, user contract information, user usage information and user change information, and the label is used for indicating whether the label is a lost user; the system comprises a data acquisition module, a data transmission module, a data processing module and a data processing module, wherein the data acquisition module is used for acquiring real-time user data of a telecom operator, and the real-time user data comprises user basic information, user contract information, user usage information and user change information;
the data preprocessing module is used for preprocessing the target data by taking the historical user data as the target data, deleting abnormal values through data preprocessing and carrying out coding conversion on qualitative data to obtain preprocessed historical user data; the system is used for carrying out data preprocessing on the target data by taking the real-time user data as target data, deleting abnormal values through the data preprocessing and carrying out code conversion on qualitative data to obtain preprocessed real-time user data, and taking the preprocessed real-time user data as input data;
the data screening module is used for screening the target data by taking the preprocessed historical data as the target data based on the influence degree of the target data on user loss to obtain the screened historical user data;
the data dimension reduction sorting module is used for taking the screened historical user data as target data, performing dimension reduction processing on the target data, performing importance sorting, and taking the sorted historical user data as sample data;
the model training module is used for constructing a logistic regression model and a random forest model for user loss prediction, and respectively performing model training on the logistic regression model and the random forest model based on the sample data and corresponding labels to obtain a trained logistic regression model and a trained random forest model;
and the prediction analysis module is used for performing user loss prediction on the input data through the trained logistic regression model and the trained random forest model respectively.
7. The attrition user analysis system of claim 6 wherein the data pre-processing module is configured to remove outliers by data pre-processing, including removing user data with zero consumption for voice, data, and text messages during a month;
the data preprocessing module is used for carrying out code conversion on the qualitative data in a Boolean value mode and converting the unquantizable data into the quantitative data.
8. The system according to claim 6, wherein the data filtering module is configured to perform the following:
performing data analysis on the preprocessed user data in a visual analysis mode, screening out data with the influence on user loss smaller than a threshold value, and obtaining the analyzed user data;
and carrying out correlation analysis on the analyzed user data based on the label corresponding to the user data, and screening out data with correlation smaller than a threshold value with the user loss to obtain the screened user data.
9. The attrition user analysis system of claim 6 wherein the data dimension reduction ranking module is configured to perform dimension reduction processing on the target data through a principal component analysis method, and is configured to rank the importance of the target data after dimension reduction processing through the trained GBDT model based on the influence degree of the target data on the user attrition.
10. The attrition user analysis system based on the classification algorithm as claimed in any one of claims 6 to 9 further comprising a model iterative training module, wherein after the user attrition prediction is performed on the input data through the trained logistic regression model and the random forest model, the model iterative training module is configured to select the logistic regression model and the random forest model with high prediction accuracy as target models based on actual situations of the user attrition, and perform model training on the target data based on the real-time user data and corresponding actual situations of the user attrition.
CN202211331442.5A 2022-10-28 2022-10-28 Loss user analysis method and system based on classification algorithm Pending CN115660730A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211331442.5A CN115660730A (en) 2022-10-28 2022-10-28 Loss user analysis method and system based on classification algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211331442.5A CN115660730A (en) 2022-10-28 2022-10-28 Loss user analysis method and system based on classification algorithm

Publications (1)

Publication Number Publication Date
CN115660730A true CN115660730A (en) 2023-01-31

Family

ID=84993595

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211331442.5A Pending CN115660730A (en) 2022-10-28 2022-10-28 Loss user analysis method and system based on classification algorithm

Country Status (1)

Country Link
CN (1) CN115660730A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117112857A (en) * 2023-10-23 2023-11-24 合肥综合性国家科学中心人工智能研究院(安徽省人工智能实验室) Machining path recommending method suitable for industrial intelligent manufacturing

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117112857A (en) * 2023-10-23 2023-11-24 合肥综合性国家科学中心人工智能研究院(安徽省人工智能实验室) Machining path recommending method suitable for industrial intelligent manufacturing
CN117112857B (en) * 2023-10-23 2024-01-05 合肥综合性国家科学中心人工智能研究院(安徽省人工智能实验室) Machining path recommending method suitable for industrial intelligent manufacturing

Similar Documents

Publication Publication Date Title
CN110263230B (en) Data cleaning method and device based on density clustering
CN114742477B (en) Enterprise order data processing method, device, equipment and storage medium
JP2004157814A (en) Decision tree generating method and model structure generating device
CN111652661B (en) Mobile phone client user loss early warning processing method
CN112036476A (en) Data feature selection method and device based on two-classification service and computer equipment
CN111026870A (en) ICT system fault analysis method integrating text classification and image recognition
CN111126865B (en) Technology maturity judging method and system based on technology big data
CN115660730A (en) Loss user analysis method and system based on classification algorithm
CN110378739B (en) Data traffic matching method and device
CN115794803A (en) Engineering audit problem monitoring method and system based on big data AI technology
CN114996525A (en) Big data analysis method and system
CN113726558A (en) Network equipment flow prediction system based on random forest algorithm
CN111027841A (en) Low-voltage transformer area line loss calculation method based on gradient lifting decision tree
CN112508363A (en) Deep learning-based power information system state analysis method and device
CN115629988A (en) Core case determination method and device, electronic equipment and storage medium
CN113191569A (en) Enterprise management method and system based on big data
CN113469406A (en) User loss prediction method combining multi-granularity window scanning and combined multi-classification
CN110443305A (en) Self-adaptive features processing method and processing device
CN117829435B (en) Urban data quality management method and system based on big data
Deng et al. Research on C4. 5 Algorithm Optimization for User Churn
CN113742472B (en) Data mining method and device based on customer service marketing scene
Ibitoye et al. Customer Churn Predictive Analytics using Relative Churn Fuzzy Feature-Weight Model in Telecoms
CN118035507A (en) Data query system and method based on data mining technology
CN117319290A (en) Real-time geographic communication method based on space-time big data
CN117829435A (en) Urban data quality management method and system based on big data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination