Background
With the development of the big data era, competition among communication industry operators is increased, in order to absorb more users, the operators perform some marketing activities to attract new users registered from various channels, however, some abnormally registered users exist in the marketing activities, risk users such as mobile phone card non-use, card raising and fraud are handled, and in order to identify the abnormal users, the operators identify the abnormal users by utilizing a large amount of data generated by the users every day, such as user basic information, internet behavior data, communication consumption data and position data establishment algorithm model classification.
At present, the identification of abnormal users in the communication industry is mainly detected through single attribute data and a single algorithm, and the data aspect mainly comprises network data and mobile position data; the algorithm aspect mainly describes the abnormal users through statistical analysis and detects the abnormal users based on the analysis of a basic algorithm model.
The existing technology for detecting abnormal users in the communication industry has the following problems:
(1) data is single and coverage is incomplete.
The existing abnormal user detection technology in the industry depends on user position data or user internet data alone for network element detection, under the background of big data, the behaviors of users are more and more diverse, the interest preference is more and more diverse, and the identification rate and the accuracy rate are lower when single-dimensional data is used for analysis and detection.
(2) The detection method is relatively basic and has weak generalization capability.
In the prior art, description statistics is adopted, abnormal users are difficult to identify by counting the mean value, variance and the like of each index data of the users, the limitation on data formats is high by adopting a basic machine learning algorithm, discrete data cannot be used in part of models, continuous data cannot be used in part of models, and the adopted technical model is easy to over-fit and under-fit, unstable in model, weak in generalization capability and poor in model output expression effect.
(3) The calculation is complex and the resource occupation is large.
In the prior art, the processing steps of data and models are complex, so the calculation amount is large, the execution efficiency of model recursion and iteration is low, and the occupied computer resources are large.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides an abnormal user identification method, equipment and a computer readable storage medium based on an XGboost algorithm, aiming at fully utilizing mobile big data to be suitable for wider and comprehensive service scenes and solving the problems of single data and incomplete coverage; by using the integration algorithm, the weak learner is integrated, the generalization capability of the model and the expression effect of the model output are improved, the feature granularity is optimized in parallel, the algorithm efficiency is improved, and the calculation amount is reduced.
The technical scheme adopted by the invention is as follows: an abnormal user identification method based on an XGboost algorithm is specifically realized by the following steps:
s1, data preprocessing and feature selection: obtaining user data of users needing to be identified in batch within a specified period of time, performing data preprocessing through data cleaning and feature engineering, and outputting feature vectors and category labels;
s2, model establishment: constructing an integrated classification model by using the processed characteristic vectors and the class labels as a sample set D of model input, and calculating a predicted value by using the model
Then, an objective function of the algorithm is constructed according to the predicted value output by the model calculation
Wherein the front part of the objective function
Representing loss functions, the second half
The regular term representing the target function is used for expressing the complexity function of the tree, the smaller the value is, the lower the complexity is, the stronger the generalization capability of the model is, T in the regular term is the number of leaf nodes of the tree, and gamma is controlMaking coefficients of the number of leaf nodes; the second half part is the L2 model square of the leaf node score omega, the L2 is the calculation square sum root-opening sign for preventing overfitting and enabling the optimization solution to be stable and quick, and the lambda is a regular term coefficient for ensuring that the leaf node score is not too large; iterating to obtain an optimal loss function so as to obtain a final classification result;
s3, model parameter optimization and model verification: and optimizing model parameters, performing multiple evaluation and verification on the trained model, taking parameters with the best detection effect in verification, and outputting the model.
Preferably, the user data includes: user personal information, whether the user is real-name or not, real-name age, attribution, network access time, activation time, startup time, apru value, calling times, called times, total call times, website access times, APP access times, usage flow, usage time interval, access IP, resident cell base station and resident cell field.
Preferably, in step S1, the data preprocessing is implemented as follows:
a1, data cleaning: removing repeated values of the extracted user data, and uniformly processing the user data according to different types of data processing missing values and data field formats;
a2, characteristic engineering: and carrying out standardization processing on the cleaned data, encoding classification variables, converting the classification variables into dummy variables, carrying out binarization on quantitative field characteristics, and converting text data into numerical data so as to construct a characteristic vector of the model.
Preferably, the step S2 is implemented by the following steps:
b1, construction model: the processed features are constructed as a sample set D { (x)i,yi)}(|D|=n,xi∈Rm,yiE to R), adopting K-fold cross validation to divide K subsets, and constructing an integrated classification model, wherein xiRepresenting a feature vector, yiRepresenting class labels, R representing a set of real numbers, RmA set of real numbers representing the mth sample set;
b2, model initialization: initializing the weights ω with a constant p
0And a function
Wherein y is
iIs a sample label, gamma is an adjustment parameter, and N is the total number of samples;
b3, iterating and calculating a predicted value: characteristic x
iAnd category label y
iPredicted value of (2)
Wherein F ═ { F (x) ═ ω
q(x)}(q:R
m→T,ω∈R
T) Representing a set of CART algorithms of the decision trees, K representing the number of the decision trees, T representing the number of leaf nodes on the decision trees, and each classification decision tree f
kWeights ω corresponding to an independent tree structure q and leaves;
b4, iterating and calculating the error: objective function
Wherein y is
iIs the true value of the,
is a predicted value; calculating the model result of the previous (t-1) times, training the model according to the residual error, adding a new function on the basis of the original model for each new model, and iterating the t th time
Wherein C is a constant term, Ω (f)
t) As a regularization term, f
t(x
i)=w
q(x
i) Is a function represented by the tree model structure part q and the leaf node sample weight w together; the method is obtained by adopting Taylor formula second-order expansion approximate expansion and combining with regular term expansion and constant term removal
Where γ and λ are tuning parameters, γ represents the weight of the L2 regularization term, the larger the model is, λ is a parameter used to control the node splitting threshold,
I
j{i|q(x
i) J represents the set of labels in the sample assigned to the jth leaf, let
The optimal value omega of the jth leaf can be obtained by carrying out partial derivation on the objective function'
jAnd the minimum value of the objective function obj
B5, completing the model: obtaining an optimal value omega 'by adopting a gradient descent method'jAnd when the target function obj' is reached, the loss function is minimum, iteration is stopped, and a final model is output.
Preferably, in step S3, the method for model evaluation includes the following steps:
c1, confusion matrix: the confusion matrix represents the prediction category by columns, the total number of data predicted for that category by the total number of columns, the true attribution category of the sample by rows, and the total number of rows by the total number of data instances for that category, as shown in Table 1 below.
TABLE 1 confusion matrix
The Precision rate Precision is TP/(TP + FP), the higher the value is, the better the effect is, the higher the recall rate recall is TP/(TP + FN), the higher the model effect is, wherein TP represents the number of positive classes predicted by the positive classes, the truth is 0, the prediction is also 0, FN represents the number of negative classes predicted by the positive classes, the truth is 0, the prediction is 1, FP represents the number of positive classes predicted by the negative classes, the truth is 1, the prediction is 0, TN represents the number of negative classes predicted by the negative classes, the truth is 1, and the prediction is also 1;
c2, ROC curve and AUC values: the ROC curve is a curve in which the false positive rate is used as an abscissa and the true rate is used as an ordinate in the confusion matrix to represent the increasing relation of two variables, a threshold is given, samples larger than the threshold are divided into positive classes, samples smaller than the threshold are divided into negative classes, and the steeper the ROC curve is, the better the ROC curve is. The AUC value is the area under the ROC curve, the larger the AUC value is, the better the performance of the model is, and the AUC 1 corresponds to an ideal model.
An abnormal user identification device based on the XGboost algorithm comprises a storage device and a processor, wherein the storage device is used for storing one or more programs, when the one or more programs are executed by the processor, the processor realizes the abnormal user identification method based on the XGboost algorithm, and the device also preferably comprises a communication interface which is used for communication and data interactive transmission with an external device.
An abnormal user identification computer readable storage medium based on an XGboost algorithm comprises a computer readable storage medium storing at least one program, and when the program is executed by a processor, the abnormal user identification computer readable storage medium realizes the abnormal user identification method based on the XGboost algorithm.
The invention has the beneficial effects that: the integrated weak learner strengthens the model expression effect, inputs characteristic indexes of each latitude of a user, establishes an XGboost model, calculates a loss function by Taylor second-order derivation, adjusts a weight matrix and improves the model expression effect and the recognition rate; meanwhile, a proper loss function can be defined by customizing the loss function and inputting characteristic variables according to different service scenes, so that a proper weight matrix can be adjusted according to different service scenes, the method is convenient for user identification of more service scenes, and the identification capability of various abnormal users of governments and enterprises is improved.
The method has the following specific beneficial effects:
(1) by using multi-dimensional user data, more characteristics of users are extracted, more rules of the users are mined, and the system can cover the data more comprehensively;
(2) the weak learner selected by the algorithm is a CART algorithm, and can input discrete features and continuous features, so that the model data format breaks through the conventional limitation, and meanwhile, the operation efficiency is improved by the application and improvement of the CART algorithm;
(3) the scheme integrates a plurality of weak learners simultaneously, so that the model expression effect is enhanced, the generalization capability is enhanced, the accuracy and the recognition rate are improved, and the scheme is suitable for more service scenes, such as telecom fraud user recognition and off-network risk user recognition;
(4) and column sampling and parallel optimization characteristics are adopted, so that the calculation amount is reduced, and the calculation resources are saved.
Detailed Description
The present invention will be further described with reference to the accompanying drawings and the detailed description, and it should be noted that any combination of the embodiments or technical features described below can be used to form a new embodiment without conflict.
The abnormal users to be identified by the method mainly comprise risk users transacting mobile phone card nonuse or card maintenance, fraud and the like, and in order to identify the abnormal users, the abnormal users are identified by establishing an algorithm model by utilizing a large amount of data generated by mobile phone users every day, such as user basic information, internet surfing behavior data, communication consumption data and position data. Mobile phone user demographic data are mainly used: age, industry, real name, place of attribution, real name certificate place of attribution; consumption behavior data: monthly consumption, apru value, calling/called times and charges, total call times, etc.; and (3) networking behavior data: website access times, APP access times, usage flow, usage period, access IP and the like; position data: resident cell base station, resident cell, other data: and establishing an XGboost model by using data such as network access time, user state, starting time, activation time and the like, and outputting abnormal classified users.
Referring to fig. 1 and 2, the present invention is an abnormal user identification method, device and computer readable storage medium based on XGBoost algorithm, and the following is one embodiment of the present invention, and is applied to abnormal user identification of users of mobile big data platform in cantonese, and based on XGBoost algorithm in ensemble learning, the specific implementation flow is as follows:
s1, data preprocessing and feature selection: the method mainly comprises the following steps that data of 100 ten thousand user data registered in an activity period are randomly acquired based on a big data platform of a communication service provider, wherein the data are about three months, and the data mainly comprise fields such as whether a user is real-name or not, real-name age, attribution, network access time, activation time, starting time, apru value, calling times, called times, total call times, website access times, APP access times, use flow, use time interval, access IP (Internet protocol), resident cell base stations, resident cells and the like.
S11, data cleaning: removing repeated values of user data extracted from a big data platform, and uniformly processing the user data according to different types of data processing missing values and data field formats;
s12, characteristic engineering: z-score normalization of the cleaned data
Wherein x' is the processed characteristic, x is the original characteristic, mu is the mean value, and sigma is the standard deviation; then, encoding the classification variable into a dummy variable, wherein the dummy variable is the real name, the user state, the industry and the attribution; binarizing the quantitative field characteristics, and converting text data into numerical data so as to construct a characteristic vector of the model;
s2, establishing a model: model input feature vector X ═ X1,x2,x3...xn)TThe category label Y ═ Y1,y2,y3...yn)TAnd T is a vector transposition notation, and the implementation is as follows:
s21, feature vector X ═ X1,x2,x3...xn)TIn the experiment, K is used for 10 and 10 folds of cross validation, wherein 9 subsets are used for training, 1 subset is used for testing, 10 experiments are carried out, 10 subsets are alternately used for testing, and the final result is obtained by averaging 10 experiments;
s22, model initialization, initialization weight omega with constant p
0And a function
Where yi is the sample labelGamma is an adjusting parameter, and N is the total number of the samples of 100 ten thousand;
s23, calculating a predicted value: iterative computation of feature x
iAnd category label y
iPredicted value of (2)
Wherein F ═ { F (x) ═ ω
q(x)}(q:R
m→T,ω∈R
T) Representing a set of CART algorithms of the decision trees, K representing the number of the decision trees, T representing the number of leaf nodes on the decision trees, and each classification decision tree f
kOutputting a prediction label value of a sample corresponding to an independent tree structure q and the weight omega of the leaf;
s24, calculating an error: according to the deformed objective function
Where γ and λ are tuning parameters, γ represents the weight of the L2 regularization term, λ is a parameter used to control the node splitting threshold,
I
j{i|q(x
i) J represents the set of labels in the sample assigned to the jth leaf, let
Calculating to obtain the optimal value omega 'of the jth leaf'
jAnd the minimum value of the objective function obj
The optimal weight ω matrix of the tree at this time is obtained.
S25, iterating to minimize the target loss function obj 'to obtain an optimal value omega'
jA weight matrix, the method iterates for 1500 times to obtain an optimal value, stops iteration and outputs a final model
137900 abnormal users are finally identified from 100 ten thousand users.
S3, model parameter optimization and model verification:
s31, model evaluation: performing confusion matrix evaluation on the trained model, and outputting Precision TP/(TP + FP) and recall TP/(TP + FN), wherein the Precision of the embodiment after 1500 iterations is 82.41%, and the recall is 80.05%;
s32, ROC curve and AUC values: the model plots an ROC curve by taking the false positive rate in the confusion matrix as an abscissa and the real rate as an ordinate to show the increase of two variables, wherein the AUC value is 0.804:
s33, model parameter optimization: the xgboost model parameters are optimized through the parameters eta, max _ depth, nround, subsample, alpha, lambda, min _ child _ weight, etc. in the model
Adjusting; in this embodiment, when the fixed parameter eta is 0.3, the subsample is 0.6, the parameters alpha and lambda are default values of 1, the maximum iteration number nround is 35, and the depth max _ depth of the tree is 42, the model effect is optimal, and the model evaluation final results of 10 experiments are as shown in table 2 below.
TABLE 2 evaluation results of the model
The XGboost algorithm-based abnormal user identification method comprises a storage device and a processor, wherein the storage device is used for storing one or more programs, and when the one or more programs are executed by the processor, the processor realizes the abnormal user identification method based on the XGboost algorithm.
The preferred device may also preferably include a communication interface for communicating with external devices and for interactive transmission of data.
It should be noted that the memory may include a high-speed RAM memory, and may also include a nonvolatile memory (nonvolatile memory), such as at least one disk memory.
In a specific implementation, if the memory, the processor and the communication interface are integrated on a chip, the memory, the processor and the communication interface can complete mutual communication through the internal interface. If the memory, the processor and the communication interface are implemented independently, the memory, the processor and the communication interface may be connected to each other through a bus and perform communication with each other.
The invention also discloses a computer readable storage medium which stores at least one program, and when the program is executed by a processor, the abnormal user identification method based on the XGboost algorithm is realized.
It should be understood that the computer-readable storage medium described above is any data storage device that can store data or programs which can thereafter be read by a computer system. Examples of computer-readable storage media include: read-only memory, random access memory, CD-ROM, HDD, DVD, magnetic tape, optical data storage devices, and the like.
The computer readable storage medium can also be distributed over network coupled computer systems so that the computer readable code is stored and executed in a distributed fashion.
Program code embodied on a computer readable storage medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, Radio Frequency (RF), etc., or any suitable combination of the foregoing.
In some embodiments, the computer-readable storage medium may also be non-transitory.
The above embodiments are only preferred embodiments of the present invention, and the protection scope of the present invention is not limited thereby, and any insubstantial changes and substitutions made by those skilled in the art based on the present invention are within the protection scope of the present invention.