CN111652661B - Mobile phone client user loss early warning processing method - Google Patents

Mobile phone client user loss early warning processing method Download PDF

Info

Publication number
CN111652661B
CN111652661B CN202010769480.3A CN202010769480A CN111652661B CN 111652661 B CN111652661 B CN 111652661B CN 202010769480 A CN202010769480 A CN 202010769480A CN 111652661 B CN111652661 B CN 111652661B
Authority
CN
China
Prior art keywords
user
variable
value
data
variables
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010769480.3A
Other languages
Chinese (zh)
Other versions
CN111652661A (en
Inventor
邵俊
蔺静茹
张磊
曹新建
支磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Suoxinda Data Technology Co ltd
Soxinda Beijing Data Technology Co ltd
Original Assignee
Shenzhen Suoxinda Data Technology Co ltd
Soxinda Beijing Data Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Suoxinda Data Technology Co ltd, Soxinda Beijing Data Technology Co ltd filed Critical Shenzhen Suoxinda Data Technology Co ltd
Priority to CN202010769480.3A priority Critical patent/CN111652661B/en
Publication of CN111652661A publication Critical patent/CN111652661A/en
Application granted granted Critical
Publication of CN111652661B publication Critical patent/CN111652661B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0201Market modelling; Market analysis; Collecting market data
    • G06Q30/0202Market predictions or forecasting for commercial activities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/211Selection of the most significant subset of features
    • G06F18/2115Selection of the most significant subset of features by evaluating different subsets according to an optimisation criterion, e.g. class separability, forward selection or backward elimination
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0201Market modelling; Market analysis; Collecting market data
    • G06Q50/40

Abstract

The invention relates to a mobile phone client user loss early warning processing method, which comprises the following steps: collecting user information at regular time, forming a first user information set and carrying out digital processing to form a first user data set; estimating a first probability value for each user in the first set of user data; when the first probability value is greater than a first threshold value, classifying the user as a first type of user and calculating user data of the user; matching the calculation result with a corresponding type question bank; sending alarm information to a management platform and sending a first request to a first type of user; and adopting corresponding countermeasures based on the first request. Compared with the prior art, the method eliminates the collinearity by the splitting method under the condition of keeping the precision as much as possible, avoids losing important variables and precision due to the fact that a certain variable which is most representative in a cluster (such as the variable with the maximum correlation with the principal component) is simply kept for eliminating the collinearity, and therefore the accuracy of early warning processing is improved.

Description

Mobile phone client user loss early warning processing method
Technical Field
The invention belongs to the field of big data analysis and data mining, relates to a user information classification method, and particularly relates to a mobile phone client user loss early warning processing method.
Background
With the development of the mobile internet, the mobile phone gradually replaces the operator to become the first interface selected by the user, and the marketing of the mobile phone occupies an increasingly important position in the marketing strategy of the operator. At present, three operators increase the purchase and sale of mobile phones. The operator changes the mobile internet into a mobile internet without the support of a mobile phone, and the architecture of the mobile internet comprises three aspects of a cloud end, a pipeline and the mobile phone. The operator needs to make an intelligent pipeline, and the development of the mobile internet and the 4G/5G service presents obvious mobile phone driving characteristics. The main current channel of mobile internet application is a software application store embedded in a mobile phone, and the quality of the mobile phone service user directly influences the use of the user.
The mobile phone becomes a first interface of the user, and the left user and the right user can select the operator. Every year, a user changes a mobile phone, which can become a key opportunity for the user to reselect an operator, in the 4G/5G era, the mobile phone and a network are relatively bound due to the difference of technical systems, the user selects the mobile phone due to application selection, and the case ratio of selecting the operator due to mobile phone selection is all the same, so that a related selection mode is formed. This means that the user often chooses the handset first in the selection, and thus the network selection falls back second. For example, the user may select a network of china unicom by selecting an apple phone, and the user may select a millet phone by preferring a rice chat service, and then select a back operator. As the market capacity of mobile communication subscribers is approaching saturation, the focus of competition among various operators has gradually shifted to the competition of subscribers of other networks. Therefore, how to effectively analyze the potential lost users, search the causes and adopt a targeted means to reserve the users is a problem which needs to be solved urgently at present.
In addition, regression analysis is a statistical analysis method for determining the quantitative relationship of interdependence between two or more variables. The application is very wide, and regression analysis is divided into unitary regression analysis and multiple regression analysis according to the number of related variables; according to the number of independent variables, simple regression analysis and multiple regression analysis can be divided; according to the type of relationship between independent variables and dependent variables, linear regression analysis and nonlinear regression analysis can be classified. If a regression analysis includes only one independent variable and one dependent variable and the relationship between the independent variable and the dependent variable can be approximated by a straight line, the regression analysis is called a univariate linear regression analysis. If two or more independent variables are included in the regression analysis and there is a linear correlation between the independent variables, it is referred to as a multiple linear regression analysis.
An optimization analysis method for eliminating the problem of collinearity of regression data in a complex system is provided in Chinese patent ZL201510881058.6, and the essence of the optimization analysis method is a method for continuously screening variables based on principal component analysis. The method mainly comprises the steps of selecting the variable with the maximum correlation after calculating the principal component each time, simultaneously removing other variables highly correlated with the principal component, and calculating the next principal component. Although it selects variables, the above method may have two drawbacks: the contribution degree of the selected variables to the model may not be high; in the process of eliminating the variables, the highly relevant judgment has strong subjectivity, and the important variables are easy to lose. Due to the fact that the selected variables are not typical and the important variables are lost, data analysis of the system is inaccurate, and the credibility of the system is low. Therefore, how to rapidly and efficiently classify, sort and model the obtained massive data information and extract valuable or concerned data information meeting preset conditions is a technical problem in the field of big data analysis and data mining.
Disclosure of Invention
In view of the above-mentioned drawbacks in the prior art, an object of the present invention is to provide a method and system for effectively predicting users who are potentially lost and providing corresponding solutions in time.
In order to achieve the above object, the present invention provides a method and a system for processing loss early warning of a mobile phone client user, comprising the following steps:
collecting user information in an operator server at regular time to form a first user information set;
carrying out digital processing on the first user information set to form a first user data set;
estimating, using a first estimation module, a first probability value for each user in the first set of user data based on the first set of user data;
when the first probability value is greater than a first threshold, classifying the user as a first type of user;
calculating the user data of the first type of user based on a second data model to obtain a calculation result, inquiring a database, and matching the calculation result with a corresponding type question bank;
sending alarm information to a management platform and sending a first request to the first type of user;
and adopting corresponding countermeasures based on the first request.
Wherein estimating, using a first estimation module, a first probability value for each user in the first set of user data based on the first set of user data comprises:
estimating the first set of user data based on a first data model, wherein the first probability value is a user churn probability value.
Wherein the establishing of the first data model comprises the steps of:
selecting historical user information for modeling, and dividing a historical user information set into a training set and a test set according to a proportion, wherein the training set is used for modeling and model parameter estimation, and the test set is used for model evaluation;
extracting user characteristic data which can be used for modeling, and establishing a data analysis broad table;
and establishing the first data model based on the data analysis broad table.
Wherein the establishing the first data model based on the data analysis broad table specifically comprises:
binning the data set;
performing WOE conversion on each box to obtain a WOE value;
performing variable clustering operation by a splitting method, and screening variables;
the variables are further screened by a backward elimination method, and if the variable VIF is more than 10, the variable with the maximum p value is eliminated. The remaining variables were then modeled by logistic regression.
The screening step was repeated until all variables VIF <10 and p-value < 0.05.
The variable clustering operation performed by the splitting method specifically comprises the following steps:
solving a covariance matrix for vectors formed by all variables, and calculating a first characteristic root and a second characteristic root as well as a corresponding first characteristic vector and a corresponding second characteristic vector;
judging a second feature vector, and if the second feature vector is larger than 0.8, dividing the variables into two types;
and respectively calculating covariance matrixes of the two classified variables, respectively calculating a first characteristic root and a second characteristic root of the two classified variables, and a corresponding first characteristic vector and a corresponding second characteristic vector, and returning to the judging step until the second characteristic vectors of the covariance matrixes of all the subclasses are not more than 0.8 or only 1 variable exists in the subclasses.
Wherein the screening variables specifically include:
reserving a variable with the highest IV value and a variable with the highest IV value in each class
Figure 372401DEST_PATH_IMAGE001
The variable with the lowest value; in which the variable X is
Figure 492672DEST_PATH_IMAGE001
The formula for the value is:
Figure 958289DEST_PATH_IMAGE002
where R2 represents a representative metric within a cluster, which can be obtained by squaring the pearson correlation coefficient of the variable with the first principal component to which it belongs,
Figure 748390DEST_PATH_IMAGE003
representing the first principal component of each class not containing the variable and the largest Pearson correlation coefficient in the Pearson correlation coefficients of the variable, and the formula is:
Figure 299457DEST_PATH_IMAGE004
wherein k represents the number of classes, and the k classes are numbered from 1 to k in sequence,
Figure 98786DEST_PATH_IMAGE005
the first principal component of the j-th class is represented, i represents the number of the class in which X is located, and Corr represents the pearson correlation coefficient.
Wherein the second data model is a decision tree based multi-classification model.
Based on the first request, corresponding countermeasures are adopted, and the method comprises the following steps:
the first request is to ask the first type user whether to accept a questionnaire survey;
if the first type user agrees, sending a corresponding network link address to the first type user;
receiving a feedback response of the first type user, wherein the feedback response comprises an answer of the first type user to the type question;
and adopting corresponding countermeasures based on the feedback response.
The user information comprises user personal information, user behavior related data in a charging system and mobile phone client information of the user.
The information of the mobile phone client of the user is acquired through a radio resource control connection REQUEST RRCCONNECTION REQUEST message or a CHANNEL REQUEST CHANNEL REQUEST message.
Compared with the prior art, the early warning processing system provided by the invention has the advantages that the user information is digitally processed and converted into the data information in the specific format of the system, the colinearity is eliminated by using the modeling module through a splitting method under the condition that the accuracy is kept as far as possible, the important variable and the accuracy are avoided being lost because a certain most representative variable in a cluster (such as the maximum correlation with a main component) is simply kept for eliminating the colinearity, and the accuracy of the early warning processing is improved.
Drawings
The above and other objects, features and advantages of exemplary embodiments of the present disclosure will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the present disclosure are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which like reference numerals refer to similar or corresponding parts and in which:
fig. 1 is a flowchart illustrating a method for processing a mobile phone client user churn early warning according to an embodiment of the present invention;
FIG. 2 is a flow chart illustrating a user churn warning processing method according to one embodiment of the present invention;
FIG. 3 is a flow diagram illustrating the building of a logistic regression model according to one embodiment of the invention;
FIG. 4 is a flow diagram illustrating variable clustering according to an embodiment of the invention;
FIG. 5 is a flow chart illustrating the discovery of the cause of churn according to one embodiment of the present invention; and
fig. 6 is a block diagram illustrating a mobile phone client user churn early warning processing system according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be described in further detail with reference to the accompanying drawings, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The terminology used in the embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the examples of the present invention and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise, and "a plurality" typically includes at least two.
It should be understood that although the terms first, second, third, etc. may be used to describe … … in embodiments of the present invention, these … … should not be limited to these terms. These terms are used only to distinguish … …. For example, the first … … can also be referred to as the second … … and similarly the second … … can also be referred to as the first … … without departing from the scope of embodiments of the present invention.
It should be understood that the term "and/or" as used herein is merely one type of association that describes an associated object, meaning that three relationships may exist, e.g., a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.
The words "if", as used herein, may be interpreted as "at … …" or "at … …" or "in response to a determination" or "in response to a detection", depending on the context. Similarly, the phrases "if determined" or "if detected (a stated condition or event)" may be interpreted as "when determined" or "in response to a determination" or "when detected (a stated condition or event)" or "in response to a detection (a stated condition or event)", depending on the context.
It is also noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that an article or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such article or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in the article or device in which the element is included.
Alternative embodiments of the present invention are described in detail below with reference to the accompanying drawings.
Example one
Referring to fig. 1, the invention discloses a method for processing loss early warning of a mobile phone client user, which comprises the following steps:
collecting user information in an operator server at regular time to form a first user information set;
carrying out digital processing on the first user information set to form a first user data set;
estimating, using a first estimation module, a first probability value for each user in the first set of user data based on the first set of user data;
when the first probability value is greater than a first threshold, classifying the user as a first type of user;
calculating the user data of the first type of user based on a second data model to obtain a calculation result, inquiring a database, and matching the calculation result with a corresponding type question bank;
sending alarm information to a management platform and sending a first request to the first type of user;
and adopting corresponding countermeasures based on the first request.
Example two
On the basis of the first embodiment, the present embodiment further includes the following contents:
in the internet field, big data technologies are increasingly being applied. In this field, user information classification is a very important aspect, especially for user churn warning. Because obtaining a new user usually saves several times of the cost of an old user, and the income brought by the old user to the company is far beyond that of the new user, the potential user loss is early warned by big data analysis and a strategy is made in time to save, which is a very critical link in many industries (such as telecommunication, bank, etc.). The method and the device have the advantages that the probability of possible loss of the user is output through the logistic regression model, the potential loss list is formulated, attribution is carried out according to the loss of the user, and corresponding measures are formulated.
Specifically, referring to fig. 2, the method for generating the potential churn list by outputting the probability of the possible churn of the user through the logistic regression model includes the following steps:
step 1, defining user churn behaviors through data, wherein for example, a churn user is determined when 20% of assets are churn in nearly three months;
step 2, selecting historical user information for modeling, and dividing a training set and a test set according to a proportion;
step 3, extracting user characteristic data which can be used for modeling, such as held product information, proportion, balance, transaction preference and the like, and establishing a data analysis broad table;
step 4, establishing a logistic regression model based on the wide table, outputting the probability of possible loss of the user, and sequencing through the loss probability values to obtain an early warning list;
step 5, discovering the possible loss reasons of the users on the list through a multi-classification model;
and 6, sending the early warning list and the loss reason to a management platform, and performing targeted strategy retrieval by the management platform according to the loss reason.
EXAMPLE III
On the basis of the second embodiment, the present embodiment further includes the following contents:
referring to fig. 3, step 4 may include the steps of:
step 4.1, variable binning and WOE transformation;
step 4.2, performing variable clustering operation by a splitting method, and screening variables;
and 4.3, establishing a logistic regression model, observing a regression result, removing variables with p values larger than 0.05, further screening the variables by a backward removal method, and removing the variables with the maximum p values if the variables VIF are larger than 10. Then performing logistic regression modeling on the remaining variables;
step 4.4, continuously repeating step 4.3 until all variables VIF <10 and p value < 0.05.
Example four
On the basis of the third embodiment, the present embodiment further includes the following contents:
the meaning of the box separation in the step 4.1 is as follows:
1) the value of the text variable which cannot be calculated is converted into a numerical value which can be calculated,
2) the stability of the model is increased, and the large change of the model result caused by the small disturbance of the numerical value is prevented.
More specifically, the variable X is divided into three boxes X, y and z, and the WOE value calculation formula of the X box is as follows:
WOE(X=x)=ln((#{Y=1,X=x}/#{Y=1})/(#{Y=0,X=x}/#{Y=0}))…(1)
where # (a) represents the number of samples satisfying condition a, # (a, B) represents the number of samples satisfying both conditions a and B, and ln () is a natural logarithmic function.
EXAMPLE five
On the basis of the fourth embodiment, the present embodiment further includes the following contents:
referring to fig. 4, the variable clustering operation by the splitting method in step 4.2 may include the following steps:
solving a covariance matrix for the vector composed of all the N variables, and calculating a first feature root and a second feature root, and corresponding feature vectors (the first feature vector and the second feature vector, respectively).
If the second feature vector is >0.8, the N variables are classified into two categories, and the specific classification manner may include the following steps:
respectively calculating the Pearson correlation coefficient of each variable and the two eigenvectors, and comparing the absolute values of the correlation coefficients; if the absolute value of the correlation coefficient of the variable and the first feature vector is larger than the absolute value of the correlation coefficient of the variable and the second feature vector, the variable belongs to the first class, otherwise, the variable belongs to the second class.
And respectively calculating covariance matrixes of the two groups of classified (class) variables, and respectively calculating a first characteristic root, a second characteristic root and corresponding characteristic vectors of the two groups of classified (class) variables. If the second characteristic root vector of a certain group of variables is greater than 0.8, repeating the classification steps on the group of variables until the second characteristic root vectors of the covariance matrixes of all the subclasses are not greater than 0.8 or only 1 variable in the subclasses exists.
The method has the advantage that whether the splitting of the group is terminated is judged according to the size of the second characteristic root, so that the variables with weak correlation are not gathered together. The invention ensures that the second characteristic root in each small group is less than 0.8 through an iterative mode so as to ensure the interpretability of the first principal component on the variance of the integral variable in the class.
In addition, the screening variables in step 4.2 may include the following steps:
reserving a variable with the highest IV value and a variable with the highest IV value in each class after variable clustering
Figure 684488DEST_PATH_IMAGE001
The variable with the lowest value. Wherein the variable X is
Figure 179579DEST_PATH_IMAGE001
The formula for the value is:
Figure 483522DEST_PATH_IMAGE002
where R2 represents a representative metric within a cluster, which can be obtained by squaring the pearson correlation coefficient of the variable with the first principal component to which it belongs,
Figure 820962DEST_PATH_IMAGE003
the first principal component of each class not containing the variable and the largest pearson correlation coefficient of the pearson correlation coefficients of the variable are expressed by the formula:
Figure 385804DEST_PATH_IMAGE004
wherein k represents the number of classes, the invention numbers the k classes from 1 to k in sequence,
Figure 924233DEST_PATH_IMAGE005
the first principal component of the j-th class is represented, i represents the number of the class in which X is located, and Corr represents the pearson correlation coefficient.
To make it possible to
Figure 715472DEST_PATH_IMAGE001
As small as possible, then R2 should also be as large and large as possible
Figure 856603DEST_PATH_IMAGE006
The values are as small as possible, i.e. the variables are not only representative within the group, but should be as weakly correlated with other classes as possible.
EXAMPLE six
On the basis of the fifth embodiment, the present embodiment further includes the following contents:
after the variables of the logistic regression model enter the final regression link, the effectiveness of the model is generally judged through two indexes: p-value (assumed value) and VIF (variance inflation factor) value. Wherein a p-value reflects the significance of a single variable, a larger p-value means a lower significance of the variable, and if the p-value >0.05, the variable is considered to be not significant and should be removed from the model; the VIF value reflects the degree of co-linearity of the variables, the higher the VIF value is, the larger the co-linearity is, and generally if the VIF value is greater than 10, the co-linearity is considered to exist in the model, and the variables need to be adjusted.
Wherein, VIF represents the co-linearity coefficient of the model and the formula is
VIF=1/(1-R2) Wherein R is a complex correlation coefficient of the independent variable to the rest independent variables for regression analysis.
The p-value is the degree of significance that logistic regression uses the z-statistic to characterize, i.e.,
p = Pr (| s | > | z |), where s obeys a standard normal distribution, and Pr is an operation to solve a probability, that is, to solve a probability of | s | > | z |.
If the p-value is greater than 0.05, the variable is considered to be not significant and should be removed from the model.
In order to facilitate understanding of the above-described co-linearity coefficient, complex correlation coefficient, and significance, detailed descriptions thereof will be given below, respectively.
In which co-linearity coefficients are used in the invention
Figure 416897DEST_PATH_IMAGE007
The relationship between the VIF value and the complex correlation coefficient is as follows:
Figure 253791DEST_PATH_IMAGE008
wherein the complex correlation coefficient is
Figure 266746DEST_PATH_IMAGE009
The square root of (a). The larger the complex correlation coefficient is, the larger the complex correlation coefficient is
Figure 211568DEST_PATH_IMAGE009
The larger, so the greater the co-linear coefficient of the variables, i.e.
Figure 626369DEST_PATH_IMAGE007
Strong correlation with other variables exists, which can result in that stable parameter estimation cannot be obtained during model training.
The above
Figure 896814DEST_PATH_IMAGE007
The complex correlation coefficients for other variables have the specific meaning: in all the independent variables, to
Figure 131486DEST_PATH_IMAGE007
As dependent variables, all others
Figure 879999DEST_PATH_IMAGE010
Establishing a linear regression model of the coefficients of a solution as independent variables
Figure 85326DEST_PATH_IMAGE011
The square root of (a). In a linear regression model, let y be the dependent variable and X be the independent variable, then
Figure 261093DEST_PATH_IMAGE012
Wherein
Figure 248640DEST_PATH_IMAGE013
Is the average value of the samples and is,
Figure 535265DEST_PATH_IMAGE014
to estimate y by the linear model, the equation characterizes the percentage that can be interpreted using the linear model in the overall compilation of the y values, with the remaining unexplained proportion being due to random perturbations caused by sampling. The larger the value, the more interpretable y is by the model, and the stronger the correlation between y and the argument. In the present inventionIn the bright scene, it is used
Figure 659079DEST_PATH_IMAGE007
As the dependent variable y, use
Figure 5746DEST_PATH_IMAGE010
As an independent variable, the above calculation may be made.
In addition, the significance of the above-mentioned degrees of significance specifically means: whether an index of the original hypothesis should be rejected in the statistical hypothesis testing process. For example:
h0, the coefficient of the variable X is 0, and the model result has no interpretation capability, namely X cannot enter the model;
let H1 be that the coefficient of variable X is not 0 and should enter the model;
the P value is used to refer to the probability that H0 holds, and if the P value is greater than a set significance level of 0.05, then it is considered that there is insufficient reason to reject the original hypothesis, i.e., X should not enter the model. The larger the value of P, the more likely the contribution of the variable to the model is due solely to sampling errors, and the more the model should be rejected.
EXAMPLE seven
On the basis of the sixth embodiment, the present embodiment further includes the following contents:
steps 4.3 and 4.4 may include the following:
performing logistic regression modeling on the retained maximum 2k variables, and iteratively using a backward elimination method until the VIF values and the p values of all the variables meet specified conditions;
and then adding the rejected variables back one by using a forward selection method based on the variables remained after the rejection, if after a certain variable is added, the VIF of all the variables is still less than 10, and the p value is not more than 0.05, keeping the added variable, and continuing the step until all the remaining variables cannot be added.
The reason for adopting the forward selection method after using the backward elimination method is as follows: because the backward elimination method adopts a greedy algorithm, namely, the variable which should be eliminated most is eliminated each time, and then the whole process is possibly trapped in a local optimal variable selection rather than a global optimal variable selection, the invention continues to carry out forward selection and add the variable on the basis of using the backward elimination method so as to prevent the variable from being killed by mistake.
Example eight
On the basis of the seventh embodiment, the present embodiment further includes the following contents:
referring to fig. 5, step 5 can be disassembled into the following steps:
step 5.1, the invention divides the reasons of user loss into 3 types according to the business experience: 1. product reason 2, customer reason 3, external reason; more specifically, the reasons for user churn can be classified into 6 categories: 1. lack of customer care 2, mobile phone failure 3, network uncovered 4, poor quality of service 5, no suitable tariff scheme 6, other reasons.
Step 5.2, by randomly drawing a sufficient number of attrition users (more than 5000) with one of the above category labels.
And 5.3, using the data obtained by the multi-classification model training based on the decision tree, and using the trained model to predict the potential loss reasons of the users in the early warning list.
Example nine
With reference to fig. 1 to 5, on the basis of the above embodiments, an embodiment of the present invention provides a method for processing a loss early warning of a mobile phone client user, which includes the following steps:
collecting user information in an operator server at regular time to form a first user information set;
carrying out digital processing on the first user information set to form a first user data set;
estimating, using a first estimation module, a first probability value for each user in the first set of user data based on the first set of user data;
when the first probability value is greater than a first threshold, classifying the user as a first type of user;
calculating the user data of the first type of user based on a second data model to obtain a calculation result, inquiring a database, and matching the calculation result with a corresponding type question bank; preferably, the second data model is a decision tree based multi-classification model;
sending alarm information to a management platform and sending a first request to the first type of user;
and adopting corresponding countermeasures based on the first request.
The early warning processing system provided by the embodiment of the invention carries out digital processing on user information, converts the user information into data information in a system specific format, eliminates the collinearity by using the modeling module through a splitting method under the condition of keeping the precision as much as possible, and avoids losing important variables and precision due to simply keeping a certain variable which is most representative in a cluster (for example, the maximum correlation with a principal component) in order to eliminate the collinearity, thereby improving the accuracy of early warning processing.
Further, the present invention estimates a first probability value for each user in the first set of user data using a first estimation module based on the first set of user data, may include:
estimating the first set of user data based on a first data model, wherein the first probability value is a user churn probability value.
At a practical application, the establishing of the first data model may include the following steps:
selecting historical user information for modeling, and dividing a historical user information set into a training set and a test set according to a proportion, wherein the training set is used for modeling and model parameter estimation, and the test set is used for model evaluation;
extracting user characteristic data which can be used for modeling, and establishing a data analysis broad table;
and establishing the first data model based on the data analysis broad table.
Further, to complete the evaluation of the first set of user data, a first data model can be built by a wide table of data analytics. Wherein the building the first data model based on the data analysis broad table may comprise:
binning the data set;
performing WOE conversion on each box to obtain a WOE value;
performing variable clustering operation by a splitting method, and screening variables;
the variables are further screened by a backward elimination method, and if the variable VIF is more than 10, the variable with the maximum p value is eliminated. The remaining variables were then modeled by logistic regression.
The screening step was repeated until all variables VIF <10 and p-value < 0.05.
To facilitate an understanding of the above-described steps of establishing the first data model, some of the steps present therein will be described in detail for clarity. Wherein the variable clustering operation by the splitting method may include:
solving a covariance matrix for vectors formed by all variables, and calculating a first characteristic root and a second characteristic root as well as a corresponding first characteristic vector and a corresponding second characteristic vector;
judging a second feature vector, and if the second feature vector is larger than 0.8, dividing the variables into two types; the classification standard is as follows: respectively calculating the Pearson correlation coefficient of each variable and the two eigenvectors, and comparing the absolute values of the correlation coefficients; if the absolute value of the correlation coefficient of the variable and the first feature vector is larger than the absolute value of the correlation coefficient of the variable and the second feature vector, the variable belongs to a first class, otherwise, the variable belongs to a second class;
and respectively calculating covariance matrixes of the two classified variables, respectively calculating a first characteristic root and a second characteristic root of the two classified variables, and a corresponding first characteristic vector and a corresponding second characteristic vector, and returning to the judging step until the second characteristic vectors of the covariance matrixes of all the subclasses are not more than 0.8 or only 1 variable exists in the subclasses.
Correspondingly, the screening variables may include:
reserving a variable with the highest IV value and a variable with the highest IV value in each class
Figure 621536DEST_PATH_IMAGE001
The variable with the lowest value; wherein the variable XIs/are as follows
Figure 711851DEST_PATH_IMAGE001
The formula for the value is:
Figure 958681DEST_PATH_IMAGE002
where R2 represents a representative metric within a cluster, which can be obtained by squaring the pearson correlation coefficient of the variable with the first principal component to which it belongs,
Figure 210670DEST_PATH_IMAGE003
the first principal component of each class not containing the variable and the largest pearson correlation coefficient of the pearson correlation coefficients of the variable are expressed by the formula:
Figure 172810DEST_PATH_IMAGE004
wherein k represents the number of classes, and the k classes are numbered from 1 to k in sequence,
Figure 801238DEST_PATH_IMAGE005
the first principal component of the j-th class is represented, i represents the number of the class in which X is located, and Corr represents the pearson correlation coefficient.
In the present invention, based on the first request, corresponding countermeasures are adopted, which may include:
the first request is to ask the first type user whether to accept a questionnaire survey;
if the first type user agrees, sending a corresponding network link address to the first type user;
receiving a feedback response of the first type user, wherein the feedback response comprises an answer of the first type user to the type question;
and adopting corresponding countermeasures based on the feedback response.
In order to facilitate understanding of the user loss early warning processing method, some parameters or terms are explained. The user information may include personal information of the user, data related to user behavior in the billing system, and mobile client information of the user. The user personal information may include: using mobile phone brand, age bracket, network access time, model, price, occupation and income; the user behavior related data may include: the call and profit and loss conditions of the user, the service conditions of the user, the stability conditions of the mobile phone and the like. In addition, the invention can also collect data such as user consumption behavior, payment behavior and the like in the charging system at regular time.
According to the invention, corresponding countermeasures are adopted based on the first request. The measures adopted are mainly to avoid user loss, and the user loss can include two aspects: firstly, the user transfers from the terminal to other terminals; secondly, the monthly average call cost of the user is reduced, and the user becomes a low-value user from a high-value user.
In an actual application scenario, the information of the mobile phone client of the user may be obtained through a rrc connection REQUEST RRCCONNECTION REQUEST message or a CHANNEL REQUEST message. More specifically, the obtaining of the mobile phone client information of the user may be obtained through a first message when the user accesses the network, where the first message may be a radio resource control connection REQUEST (RRCCONNECTION REQUEST) message in a protocol of "2 GHz TD-SCDMA UU interface technical requirement layer three technical requirements" of 3GPP TS 25.331 and CCSA or a CHANNEL REQUEST (CHANNEL REQUEST) in 3GPP TS 04.08. In order to ensure the compatibility of the protocol, a terminal type cell, a terminal manufacturer cell, a terminal model cell and a version information cell are added in the expandable part of the RRC CONNECTION REQUEST message, and occupy four bytes which are respectively used for carrying the terminal type information, the terminal manufacturer information, the terminal model information and the version information of the terminal.
Example ten
Referring to fig. 6, the embodiment further provides a system 600 for processing loss early warning of a mobile phone client user, which includes:
the early warning server 601 is used for collecting user information in an operator server at regular time to form a first user information set;
carrying out digital processing on the first user information set to form a first user data set;
estimating, using a first estimation module, a first probability value for each user in the first set of user data based on the first set of user data;
when the first probability value is greater than a first threshold, classifying the user as a first type of user;
calculating the user data of the first type of users based on a second data model to obtain a calculation result,
a database 603 for storing a corresponding type question bank matched with the calculation result;
and the management platform 602 is configured to receive alarm information sent by the early warning server, and obtain an answer analysis result of the user for the corresponding type of question bank.
EXAMPLE eleven
The disclosed embodiments provide a non-volatile computer storage medium having stored thereon computer-executable instructions that may perform the method steps as described in the embodiments above.
It should be noted that the computer readable medium in the present disclosure can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.
The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.
Computer program code for carrying out operations for aspects of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a local Area Network (AN) or a Wide Area Network (WAN), or the connection may be made to AN external computer (for example, through the internet using AN internet service provider).
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units described in the embodiments of the present disclosure may be implemented by software or hardware. Where the name of an element does not in some cases constitute a limitation on the element itself.
The foregoing describes preferred embodiments of the present invention, and is intended to provide a clear and concise description of the spirit and scope of the invention, and not to limit the same, but to include all modifications, substitutions, and alterations falling within the spirit and scope of the invention as defined by the appended claims.

Claims (6)

1. A mobile phone client user loss early warning processing method comprises the following steps:
collecting user information in an operator server at regular time to form a first user information set;
carrying out digital processing on the first user information set to form a first user data set;
estimating, using a first estimation module, a first probability value for each user in the first set of user data based on the first set of user data;
when the first probability value is greater than a first threshold, classifying the user as a first type of user;
calculating the user data of the first type of user based on a second data model to obtain a calculation result, inquiring a database, and matching the calculation result with a corresponding type question bank;
sending alarm information to a management platform and sending a first request to the first type of user;
based on the first request, adopting corresponding countermeasures;
wherein estimating, using a first estimation module, a first probability value for each user in the first set of user data based on the first set of user data comprises:
estimating the first set of user data based on a first data model, wherein the first probability value is a user churn probability value;
wherein the establishing of the first data model comprises the steps of:
selecting historical user information for modeling, and dividing a historical user information set into a training set and a test set according to a proportion, wherein the training set is used for modeling and model parameter estimation, and the test set is used for model evaluation;
extracting user characteristic data which can be used for modeling, and establishing a data analysis broad table;
establishing the first data model based on the data analysis broad table;
wherein the establishing the first data model based on the data analysis broad table specifically comprises:
binning the data set;
performing WOE conversion on each box to obtain a WOE value;
performing variable clustering operation by a splitting method, and screening variables;
further screening variables by a backward elimination method, if a variable VIF is more than 10, eliminating the variable with the maximum p value, wherein the VIF is a variance expansion coefficient, the p value is an assumed value p-value, and then performing logistic regression modeling on the remaining variables;
continuously repeating the screening step until all variables, VIF <10 and p-value < 0.05;
the variable clustering operation performed by the splitting method specifically comprises the following steps:
solving a covariance matrix for vectors formed by all variables, and calculating a first characteristic root and a second characteristic root as well as a corresponding first characteristic vector and a corresponding second characteristic vector;
judging a second feature vector, and if the second feature vector is larger than 0.8, dividing the variables into two types;
and respectively calculating covariance matrixes of the two classified variables, respectively calculating a first characteristic root and a second characteristic root of the two classified variables, and a corresponding first characteristic vector and a corresponding second characteristic vector, and returning to the judging step until the second characteristic vectors of the covariance matrixes of all the subclasses are not more than 0.8 or only 1 variable exists in the subclasses.
2. The method of claim 1, performing variable clustering operations by a fragmentation method, and screening variables, wherein screening variables specifically comprises:
reserving a variable with the highest IV value and a variable with the highest IV value in each class
Figure 178230DEST_PATH_IMAGE001
The variable with the lowest value, and the IV value being high means that the contribution of the variable to the model result is high; in which the variable X is
Figure 754705DEST_PATH_IMAGE001
The formula for the value is:
Figure 87597DEST_PATH_IMAGE002
where R2 represents a representative metric within a cluster, which can be obtained by squaring the pearson correlation coefficient of the variable with the first principal component to which it belongs,
Figure 839652DEST_PATH_IMAGE003
representing the first principal component of each class not containing the variable and the largest Pearson correlation coefficient in the Pearson correlation coefficients of the variable, and the formula is:
Figure 98333DEST_PATH_IMAGE004
wherein k represents the number of classes, and the k classes are numbered from 1 to k in sequence,
Figure 303050DEST_PATH_IMAGE005
the first principal component of the j-th class is represented, i represents the number of the class in which X is located, and Corr represents the pearson correlation coefficient.
3. The method of claim 1, wherein the second data model is a decision tree based multi-classification model.
4. The method of claim 1, wherein based on the first request, taking corresponding countermeasures comprises:
the first request is to ask the first type user whether to accept a questionnaire survey;
if the first type user agrees, sending a corresponding network link address to the first type user;
receiving a feedback response of the first type user, wherein the feedback response comprises an answer of the first type user to a type question;
and adopting corresponding countermeasures based on the feedback response.
5. The method of claim 1, wherein the user information includes user personal information, user behavior related data in a billing system, and mobile client information of the user.
6. The method of claim 5, wherein the user's mobile client information is obtained through a radio resource control connection REQUEST RRCCONNECTION REQUEST message or a CHANNEL REQUEST message.
CN202010769480.3A 2020-08-04 2020-08-04 Mobile phone client user loss early warning processing method Active CN111652661B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010769480.3A CN111652661B (en) 2020-08-04 2020-08-04 Mobile phone client user loss early warning processing method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010769480.3A CN111652661B (en) 2020-08-04 2020-08-04 Mobile phone client user loss early warning processing method

Publications (2)

Publication Number Publication Date
CN111652661A CN111652661A (en) 2020-09-11
CN111652661B true CN111652661B (en) 2020-12-08

Family

ID=72348813

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010769480.3A Active CN111652661B (en) 2020-08-04 2020-08-04 Mobile phone client user loss early warning processing method

Country Status (1)

Country Link
CN (1) CN111652661B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112686718B (en) * 2021-03-19 2021-06-29 深圳索信达数据技术有限公司 Method and device for acquiring user loss reason, computer equipment and storage medium
CN113139715A (en) * 2021-03-30 2021-07-20 北京思特奇信息技术股份有限公司 Comprehensive assessment early warning method and system for loss of group customers in telecommunication industry
CN113256328B (en) * 2021-05-18 2024-02-23 深圳索信达数据技术有限公司 Method, device, computer equipment and storage medium for predicting target clients

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107730289A (en) * 2016-08-11 2018-02-23 株式会社理光 A kind of user behavior analysis method and user behavior analysis device
CN110147803B (en) * 2018-02-08 2022-02-18 北大方正集团有限公司 User loss early warning processing method and device
CN108537587A (en) * 2018-04-03 2018-09-14 广州优视网络科技有限公司 It is lost in user's method for early warning, device, computer readable storage medium and server
CN111311318A (en) * 2020-02-12 2020-06-19 上海东普信息科技有限公司 User loss early warning method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN111652661A (en) 2020-09-11

Similar Documents

Publication Publication Date Title
CN111652661B (en) Mobile phone client user loss early warning processing method
US20210365963A1 (en) Target customer identification method and device, electronic device and medium
CN105491599B (en) Predict the novel regression system of LTE network performance indicator
EP2652909B1 (en) Method and system for carrying out predictive analysis relating to nodes of a communication network
CN110008977B (en) Clustering model construction method and device
CN111368147A (en) Graph feature processing method and device
CN111931809A (en) Data processing method and device, storage medium and electronic equipment
CN113205403A (en) Method and device for calculating enterprise credit level, storage medium and terminal
CN112036476A (en) Data feature selection method and device based on two-classification service and computer equipment
CN110378739B (en) Data traffic matching method and device
CN114662772A (en) Traffic noise early warning method, model training method, device, equipment and medium
CN114239697A (en) Target object classification method and device, electronic equipment and storage medium
CN114723554B (en) Abnormal account identification method and device
CN114330720A (en) Knowledge graph construction method and device for cloud computing and storage medium
CN115660730A (en) Loss user analysis method and system based on classification algorithm
CN113657635B (en) Method for predicting loss of communication user and electronic equipment
CN113220947A (en) Method and device for encoding event characteristics
CN112256836A (en) Recording data processing method and device and server
CN112734352A (en) Document auditing method and device based on data dimensionality
CN110895564A (en) Potential customer data processing method and device
CN112307075A (en) User relationship identification method and device
CN111797848B (en) User classification method, device, equipment and storage medium
CN117333191A (en) Complaint event association method and device and complaint event association system
CN113837863A (en) Business prediction model creation method and device and computer readable storage medium
CN117852968A (en) Evaluation model determination method, apparatus, device, storage medium, and program product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant