CN112950231A - XGboost algorithm-based abnormal user identification method, device and computer-readable storage medium - Google Patents
XGboost algorithm-based abnormal user identification method, device and computer-readable storage medium Download PDFInfo
- Publication number
- CN112950231A CN112950231A CN202110297781.5A CN202110297781A CN112950231A CN 112950231 A CN112950231 A CN 112950231A CN 202110297781 A CN202110297781 A CN 202110297781A CN 112950231 A CN112950231 A CN 112950231A
- Authority
- CN
- China
- Prior art keywords
- model
- data
- user identification
- algorithm
- abnormal user
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000004422 calculation algorithm Methods 0.000 title claims abstract description 42
- 230000002159 abnormal effect Effects 0.000 title claims abstract description 40
- 238000000034 method Methods 0.000 title claims abstract description 30
- 230000000694 effects Effects 0.000 claims abstract description 14
- 239000013598 vector Substances 0.000 claims abstract description 13
- 238000007781 pre-processing Methods 0.000 claims abstract description 9
- 238000012795 verification Methods 0.000 claims abstract description 8
- 238000004140 cleaning Methods 0.000 claims abstract description 6
- 238000013145 classification model Methods 0.000 claims abstract description 5
- 238000012821 model calculation Methods 0.000 claims abstract description 3
- 230000006870 function Effects 0.000 claims description 26
- 238000004891 communication Methods 0.000 claims description 17
- 239000011159 matrix material Substances 0.000 claims description 13
- 238000003066 decision tree Methods 0.000 claims description 12
- 238000012545 processing Methods 0.000 claims description 9
- 238000011156 evaluation Methods 0.000 claims description 8
- 238000004364 calculation method Methods 0.000 claims description 7
- 238000005457 optimization Methods 0.000 claims description 7
- 238000001514 detection method Methods 0.000 claims description 6
- 230000004913 activation Effects 0.000 claims description 4
- 238000002790 cross-validation Methods 0.000 claims description 3
- 238000009795 derivation Methods 0.000 claims description 3
- 238000012549 training Methods 0.000 claims description 3
- 238000010276 construction Methods 0.000 claims description 2
- 238000011478 gradient descent method Methods 0.000 claims description 2
- 238000013473 artificial intelligence Methods 0.000 abstract description 2
- 230000006399 behavior Effects 0.000 description 5
- 238000002474 experimental method Methods 0.000 description 4
- 238000004458 analytical method Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 2
- 238000013500 data storage Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000003203 everyday effect Effects 0.000 description 2
- 230000002452 interceptive effect Effects 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 238000012935 Averaging Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000012423 maintenance Methods 0.000 description 1
- 230000006855 networking Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000017105 transposition Effects 0.000 description 1
- 238000002759 z-score normalization Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q30/00—Commerce
- G06Q30/018—Certifying business or products
- G06Q30/0185—Product, service or business identity fraud
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/243—Classification techniques relating to the number of classes
- G06F18/24323—Tree-organised classifiers
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- Business, Economics & Management (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Accounting & Taxation (AREA)
- Entrepreneurship & Innovation (AREA)
- Development Economics (AREA)
- Economics (AREA)
- Finance (AREA)
- Marketing (AREA)
- Strategic Management (AREA)
- General Business, Economics & Management (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to the field of artificial intelligence, and particularly discloses an abnormal user identification method, abnormal user identification equipment and a computer readable storage medium based on an XGboost algorithm, wherein the method comprises the following implementation steps: data preprocessing and feature selection: acquiring user data to be identified in batches, and performing data preprocessing through data cleaning and feature engineering; establishing a model: constructing a classification model by using the processed characteristic vectors and the class labels as a sample set input by the model, calculating a predicted value, constructing a target function of an algorithm according to the predicted value output by the model calculation, and iterating to obtain an optimal loss function so as to obtain a final classification result; model parameter tuning and model verification: and optimizing the model parameters. The invention uses the multi-dimensional user data to more comprehensively cover the data; a weak learner CART algorithm is selected, so that the operation efficiency is improved; the model expression effect is enhanced, the generalization capability is enhanced, the accuracy and the recognition rate are improved, and the computing resources are saved.
Description
Technical Field
The invention relates to the field of artificial intelligence, in particular to an abnormal user identification method, system and device based on an XGboost algorithm.
Background
With the development of the big data era, competition among communication industry operators is increased, in order to absorb more users, the operators perform some marketing activities to attract new users registered from various channels, however, some abnormally registered users exist in the marketing activities, risk users such as mobile phone card non-use, card raising and fraud are handled, and in order to identify the abnormal users, the operators identify the abnormal users by utilizing a large amount of data generated by the users every day, such as user basic information, internet behavior data, communication consumption data and position data establishment algorithm model classification.
At present, the identification of abnormal users in the communication industry is mainly detected through single attribute data and a single algorithm, and the data aspect mainly comprises network data and mobile position data; the algorithm aspect mainly describes the abnormal users through statistical analysis and detects the abnormal users based on the analysis of a basic algorithm model.
The existing technology for detecting abnormal users in the communication industry has the following problems:
(1) data is single and coverage is incomplete.
The existing abnormal user detection technology in the industry depends on user position data or user internet data alone for network element detection, under the background of big data, the behaviors of users are more and more diverse, the interest preference is more and more diverse, and the identification rate and the accuracy rate are lower when single-dimensional data is used for analysis and detection.
(2) The detection method is relatively basic and has weak generalization capability.
In the prior art, description statistics is adopted, abnormal users are difficult to identify by counting the mean value, variance and the like of each index data of the users, the limitation on data formats is high by adopting a basic machine learning algorithm, discrete data cannot be used in part of models, continuous data cannot be used in part of models, and the adopted technical model is easy to over-fit and under-fit, unstable in model, weak in generalization capability and poor in model output expression effect.
(3) The calculation is complex and the resource occupation is large.
In the prior art, the processing steps of data and models are complex, so the calculation amount is large, the execution efficiency of model recursion and iteration is low, and the occupied computer resources are large.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides an abnormal user identification method, equipment and a computer readable storage medium based on an XGboost algorithm, aiming at fully utilizing mobile big data to be suitable for wider and comprehensive service scenes and solving the problems of single data and incomplete coverage; by using the integration algorithm, the weak learner is integrated, the generalization capability of the model and the expression effect of the model output are improved, the feature granularity is optimized in parallel, the algorithm efficiency is improved, and the calculation amount is reduced.
The technical scheme adopted by the invention is as follows: an abnormal user identification method based on an XGboost algorithm is specifically realized by the following steps:
s1, data preprocessing and feature selection: obtaining user data of users needing to be identified in batch within a specified period of time, performing data preprocessing through data cleaning and feature engineering, and outputting feature vectors and category labels;
s2, model establishment: constructing an integrated classification model by using the processed characteristic vectors and the class labels as a sample set D of model input, and calculating a predicted value by using the modelThen, an objective function of the algorithm is constructed according to the predicted value output by the model calculationWherein the front part of the objective functionRepresenting loss functions, the second halfThe regular term representing the target function is used for expressing the complexity function of the tree, the smaller the value is, the lower the complexity is, the stronger the generalization capability of the model is, T in the regular term is the number of leaf nodes of the tree, and gamma is controlMaking coefficients of the number of leaf nodes; the second half part is the L2 model square of the leaf node score omega, the L2 is the calculation square sum root-opening sign for preventing overfitting and enabling the optimization solution to be stable and quick, and the lambda is a regular term coefficient for ensuring that the leaf node score is not too large; iterating to obtain an optimal loss function so as to obtain a final classification result;
s3, model parameter optimization and model verification: and optimizing model parameters, performing multiple evaluation and verification on the trained model, taking parameters with the best detection effect in verification, and outputting the model.
Preferably, the user data includes: user personal information, whether the user is real-name or not, real-name age, attribution, network access time, activation time, startup time, apru value, calling times, called times, total call times, website access times, APP access times, usage flow, usage time interval, access IP, resident cell base station and resident cell field.
Preferably, in step S1, the data preprocessing is implemented as follows:
a1, data cleaning: removing repeated values of the extracted user data, and uniformly processing the user data according to different types of data processing missing values and data field formats;
a2, characteristic engineering: and carrying out standardization processing on the cleaned data, encoding classification variables, converting the classification variables into dummy variables, carrying out binarization on quantitative field characteristics, and converting text data into numerical data so as to construct a characteristic vector of the model.
Preferably, the step S2 is implemented by the following steps:
b1, construction model: the processed features are constructed as a sample set D { (x)i,yi)}(|D|=n,xi∈Rm,yiE to R), adopting K-fold cross validation to divide K subsets, and constructing an integrated classification model, wherein xiRepresenting a feature vector, yiRepresenting class labels, R representing a set of real numbers, RmA set of real numbers representing the mth sample set;
b2, model initialization: initializing the weights ω with a constant p0And a functionWherein y isiIs a sample label, gamma is an adjustment parameter, and N is the total number of samples;
b3, iterating and calculating a predicted value: characteristic xiAnd category label yiPredicted value of (2)Wherein F ═ { F (x) ═ ωq(x)}(q:Rm→T,ω∈RT) Representing a set of CART algorithms of the decision trees, K representing the number of the decision trees, T representing the number of leaf nodes on the decision trees, and each classification decision tree fkWeights ω corresponding to an independent tree structure q and leaves;
b4, iterating and calculating the error: objective functionWherein y isiIs the true value of the,is a predicted value; calculating the model result of the previous (t-1) times, training the model according to the residual error, adding a new function on the basis of the original model for each new model, and iterating the t th timeWherein C is a constant term, Ω (f)t) As a regularization term, ft(xi)=wq(xi) Is a function represented by the tree model structure part q and the leaf node sample weight w together; the method is obtained by adopting Taylor formula second-order expansion approximate expansion and combining with regular term expansion and constant term removalWhere γ and λ are tuning parameters, γ represents the weight of the L2 regularization term, the larger the model is, λ is a parameter used to control the node splitting threshold,Ij{i|q(xi) J represents the set of labels in the sample assigned to the jth leaf, letThe optimal value omega of the jth leaf can be obtained by carrying out partial derivation on the objective function'jAnd the minimum value of the objective function obj
B5, completing the model: obtaining an optimal value omega 'by adopting a gradient descent method'jAnd when the target function obj' is reached, the loss function is minimum, iteration is stopped, and a final model is output.
Preferably, in step S3, the method for model evaluation includes the following steps:
c1, confusion matrix: the confusion matrix represents the prediction category by columns, the total number of data predicted for that category by the total number of columns, the true attribution category of the sample by rows, and the total number of rows by the total number of data instances for that category, as shown in Table 1 below.
TABLE 1 confusion matrix
The Precision rate Precision is TP/(TP + FP), the higher the value is, the better the effect is, the higher the recall rate recall is TP/(TP + FN), the higher the model effect is, wherein TP represents the number of positive classes predicted by the positive classes, the truth is 0, the prediction is also 0, FN represents the number of negative classes predicted by the positive classes, the truth is 0, the prediction is 1, FP represents the number of positive classes predicted by the negative classes, the truth is 1, the prediction is 0, TN represents the number of negative classes predicted by the negative classes, the truth is 1, and the prediction is also 1;
c2, ROC curve and AUC values: the ROC curve is a curve in which the false positive rate is used as an abscissa and the true rate is used as an ordinate in the confusion matrix to represent the increasing relation of two variables, a threshold is given, samples larger than the threshold are divided into positive classes, samples smaller than the threshold are divided into negative classes, and the steeper the ROC curve is, the better the ROC curve is. The AUC value is the area under the ROC curve, the larger the AUC value is, the better the performance of the model is, and the AUC 1 corresponds to an ideal model.
An abnormal user identification device based on the XGboost algorithm comprises a storage device and a processor, wherein the storage device is used for storing one or more programs, when the one or more programs are executed by the processor, the processor realizes the abnormal user identification method based on the XGboost algorithm, and the device also preferably comprises a communication interface which is used for communication and data interactive transmission with an external device.
An abnormal user identification computer readable storage medium based on an XGboost algorithm comprises a computer readable storage medium storing at least one program, and when the program is executed by a processor, the abnormal user identification computer readable storage medium realizes the abnormal user identification method based on the XGboost algorithm.
The invention has the beneficial effects that: the integrated weak learner strengthens the model expression effect, inputs characteristic indexes of each latitude of a user, establishes an XGboost model, calculates a loss function by Taylor second-order derivation, adjusts a weight matrix and improves the model expression effect and the recognition rate; meanwhile, a proper loss function can be defined by customizing the loss function and inputting characteristic variables according to different service scenes, so that a proper weight matrix can be adjusted according to different service scenes, the method is convenient for user identification of more service scenes, and the identification capability of various abnormal users of governments and enterprises is improved.
The method has the following specific beneficial effects:
(1) by using multi-dimensional user data, more characteristics of users are extracted, more rules of the users are mined, and the system can cover the data more comprehensively;
(2) the weak learner selected by the algorithm is a CART algorithm, and can input discrete features and continuous features, so that the model data format breaks through the conventional limitation, and meanwhile, the operation efficiency is improved by the application and improvement of the CART algorithm;
(3) the scheme integrates a plurality of weak learners simultaneously, so that the model expression effect is enhanced, the generalization capability is enhanced, the accuracy and the recognition rate are improved, and the scheme is suitable for more service scenes, such as telecom fraud user recognition and off-network risk user recognition;
(4) and column sampling and parallel optimization characteristics are adopted, so that the calculation amount is reduced, and the calculation resources are saved.
Drawings
FIG. 1 is a schematic flow diagram of the present invention;
FIG. 2 is a flow chart of a technical implementation of the present invention.
Detailed Description
The present invention will be further described with reference to the accompanying drawings and the detailed description, and it should be noted that any combination of the embodiments or technical features described below can be used to form a new embodiment without conflict.
The abnormal users to be identified by the method mainly comprise risk users transacting mobile phone card nonuse or card maintenance, fraud and the like, and in order to identify the abnormal users, the abnormal users are identified by establishing an algorithm model by utilizing a large amount of data generated by mobile phone users every day, such as user basic information, internet surfing behavior data, communication consumption data and position data. Mobile phone user demographic data are mainly used: age, industry, real name, place of attribution, real name certificate place of attribution; consumption behavior data: monthly consumption, apru value, calling/called times and charges, total call times, etc.; and (3) networking behavior data: website access times, APP access times, usage flow, usage period, access IP and the like; position data: resident cell base station, resident cell, other data: and establishing an XGboost model by using data such as network access time, user state, starting time, activation time and the like, and outputting abnormal classified users.
Referring to fig. 1 and 2, the present invention is an abnormal user identification method, device and computer readable storage medium based on XGBoost algorithm, and the following is one embodiment of the present invention, and is applied to abnormal user identification of users of mobile big data platform in cantonese, and based on XGBoost algorithm in ensemble learning, the specific implementation flow is as follows:
s1, data preprocessing and feature selection: the method mainly comprises the following steps that data of 100 ten thousand user data registered in an activity period are randomly acquired based on a big data platform of a communication service provider, wherein the data are about three months, and the data mainly comprise fields such as whether a user is real-name or not, real-name age, attribution, network access time, activation time, starting time, apru value, calling times, called times, total call times, website access times, APP access times, use flow, use time interval, access IP (Internet protocol), resident cell base stations, resident cells and the like.
S11, data cleaning: removing repeated values of user data extracted from a big data platform, and uniformly processing the user data according to different types of data processing missing values and data field formats;
s12, characteristic engineering: z-score normalization of the cleaned dataWherein x' is the processed characteristic, x is the original characteristic, mu is the mean value, and sigma is the standard deviation; then, encoding the classification variable into a dummy variable, wherein the dummy variable is the real name, the user state, the industry and the attribution; binarizing the quantitative field characteristics, and converting text data into numerical data so as to construct a characteristic vector of the model;
s2, establishing a model: model input feature vector X ═ X1,x2,x3...xn)TThe category label Y ═ Y1,y2,y3...yn)TAnd T is a vector transposition notation, and the implementation is as follows:
s21, feature vector X ═ X1,x2,x3...xn)TIn the experiment, K is used for 10 and 10 folds of cross validation, wherein 9 subsets are used for training, 1 subset is used for testing, 10 experiments are carried out, 10 subsets are alternately used for testing, and the final result is obtained by averaging 10 experiments;
s22, model initialization, initialization weight omega with constant p0And a functionWhere yi is the sample labelGamma is an adjusting parameter, and N is the total number of the samples of 100 ten thousand;
s23, calculating a predicted value: iterative computation of feature xiAnd category label yiPredicted value of (2)Wherein F ═ { F (x) ═ ωq(x)}(q:Rm→T,ω∈RT) Representing a set of CART algorithms of the decision trees, K representing the number of the decision trees, T representing the number of leaf nodes on the decision trees, and each classification decision tree fkOutputting a prediction label value of a sample corresponding to an independent tree structure q and the weight omega of the leaf;
s24, calculating an error: according to the deformed objective functionWhere γ and λ are tuning parameters, γ represents the weight of the L2 regularization term, λ is a parameter used to control the node splitting threshold,Ij{i|q(xi) J represents the set of labels in the sample assigned to the jth leaf, letCalculating to obtain the optimal value omega 'of the jth leaf'jAnd the minimum value of the objective function objThe optimal weight ω matrix of the tree at this time is obtained.
S25, iterating to minimize the target loss function obj 'to obtain an optimal value omega'jA weight matrix, the method iterates for 1500 times to obtain an optimal value, stops iteration and outputs a final model137900 abnormal users are finally identified from 100 ten thousand users.
S3, model parameter optimization and model verification:
s31, model evaluation: performing confusion matrix evaluation on the trained model, and outputting Precision TP/(TP + FP) and recall TP/(TP + FN), wherein the Precision of the embodiment after 1500 iterations is 82.41%, and the recall is 80.05%;
s32, ROC curve and AUC values: the model plots an ROC curve by taking the false positive rate in the confusion matrix as an abscissa and the real rate as an ordinate to show the increase of two variables, wherein the AUC value is 0.804:
s33, model parameter optimization: the xgboost model parameters are optimized through the parameters eta, max _ depth, nround, subsample, alpha, lambda, min _ child _ weight, etc. in the model
Adjusting; in this embodiment, when the fixed parameter eta is 0.3, the subsample is 0.6, the parameters alpha and lambda are default values of 1, the maximum iteration number nround is 35, and the depth max _ depth of the tree is 42, the model effect is optimal, and the model evaluation final results of 10 experiments are as shown in table 2 below.
TABLE 2 evaluation results of the model
The XGboost algorithm-based abnormal user identification method comprises a storage device and a processor, wherein the storage device is used for storing one or more programs, and when the one or more programs are executed by the processor, the processor realizes the abnormal user identification method based on the XGboost algorithm.
The preferred device may also preferably include a communication interface for communicating with external devices and for interactive transmission of data.
It should be noted that the memory may include a high-speed RAM memory, and may also include a nonvolatile memory (nonvolatile memory), such as at least one disk memory.
In a specific implementation, if the memory, the processor and the communication interface are integrated on a chip, the memory, the processor and the communication interface can complete mutual communication through the internal interface. If the memory, the processor and the communication interface are implemented independently, the memory, the processor and the communication interface may be connected to each other through a bus and perform communication with each other.
The invention also discloses a computer readable storage medium which stores at least one program, and when the program is executed by a processor, the abnormal user identification method based on the XGboost algorithm is realized.
It should be understood that the computer-readable storage medium described above is any data storage device that can store data or programs which can thereafter be read by a computer system. Examples of computer-readable storage media include: read-only memory, random access memory, CD-ROM, HDD, DVD, magnetic tape, optical data storage devices, and the like.
The computer readable storage medium can also be distributed over network coupled computer systems so that the computer readable code is stored and executed in a distributed fashion.
Program code embodied on a computer readable storage medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, Radio Frequency (RF), etc., or any suitable combination of the foregoing.
In some embodiments, the computer-readable storage medium may also be non-transitory.
The above embodiments are only preferred embodiments of the present invention, and the protection scope of the present invention is not limited thereby, and any insubstantial changes and substitutions made by those skilled in the art based on the present invention are within the protection scope of the present invention.
Claims (7)
1. An abnormal user identification method based on an XGboost algorithm is characterized in that: the method comprises the following concrete steps:
s1, data preprocessing and feature selection: obtaining user data of users needing to be identified in batch within a specified period of time, performing data preprocessing through data cleaning and feature engineering, and outputting feature vectors and category labels;
s2, model establishment: feature vector sum class to be processedConstructing an integrated classification model by using a sample set D with an identification label as model input, and calculating a predicted value by using the modelThen, an objective function of the algorithm is constructed according to the predicted value output by the model calculationWherein the front part of the objective functionRepresenting loss functions, the second halfThe regular term is used for expressing the complexity function of the tree, T in the regular term is the number of leaf nodes of the tree, and gamma is a coefficient for controlling the number of the leaf nodes; the second half part is the L2 model square of the leaf node score omega, the L2 is the calculation square sum root-opening sign for preventing overfitting and enabling the optimization solution to be stable and quick, and the lambda is a regular term coefficient for ensuring that the leaf node score is not too large; iterating to obtain an optimal loss function so as to obtain a final classification result;
s3, model parameter optimization and model verification: and optimizing model parameters, performing multiple evaluation and verification on the trained model, taking parameters with the best detection effect in verification, and outputting the model.
2. The XGboost algorithm-based abnormal user identification method according to claim 1, wherein: the user data comprises: user personal information, whether the user is real-name or not, real-name age, attribution, network access time, activation time, startup time, apru value, calling times, called times, total call times, website access times, APP access times, usage flow, usage time interval, access IP, resident cell base station and resident cell field.
3. The XGboost algorithm-based abnormal user identification method according to claim 1, wherein: in step S1, the data preprocessing is implemented as follows:
a1, data cleaning: removing repeated values of the extracted user data, and uniformly processing the user data according to different types of data processing missing values and data field formats;
a2, characteristic engineering: and carrying out standardization processing on the cleaned data, encoding classification variables, converting the classification variables into dummy variables, carrying out binarization on quantitative field characteristics, and converting text data into numerical data so as to construct a characteristic vector of the model.
4. The XGboost algorithm-based abnormal user identification method according to claim 1, wherein: the step S2 is implemented as follows:
b1, construction model: the processed features are constructed as a sample set D { (x)i,yi)}(|D|=n,xi∈Rm,yiE to R), adopting K-fold cross validation to divide K subsets, and constructing an integrated classification model, wherein xiRepresenting a feature vector, yiRepresenting class labels, R representing a set of real numbers, RmA set of real numbers representing the mth sample set;
b2, model initialization: initializing the weights ω with a constant p0And a functionWherein y isiIs a sample label, gamma is an adjustment parameter, and N is the total number of samples;
b3, iterating and calculating a predicted value: characteristic xiAnd category label yiPredicted value of (2)Wherein F ═ { F (x) ═ ωq(x)}(q:Rm→T,ω∈RT) Representing the set of CART algorithm of the decision tree, K representing the number of the decision tree, T representing the number of leaf nodes on the decision tree,each classification decision tree fkWeights ω corresponding to an independent tree structure q and leaves;
b4, iterating and calculating the error: objective functionWherein y isiIs the true value of the,is a predicted value; calculating the model result of the previous (t-1) times, training the model according to the residual error, adding a new function on the basis of the original model for each new model, and iterating the t th timeWherein C is a constant term, Ω (f)t) As a regularization term, ft(xi)=wq(xi) Is a function represented by the tree model structure part q and the leaf node sample weight w together; the method is obtained by adopting Taylor formula second-order expansion approximate expansion and combining with regular term expansion and constant term removalWhere γ and λ are tuning parameters, γ represents the weight of the L2 regularization term, λ is a parameter used to control the node splitting threshold,Ij{i|q(xi) J represents the set of labels in the sample assigned to the jth leaf, letThe optimal value omega of the jth leaf can be obtained by carrying out partial derivation on the objective function'jAnd the minimum value of the objective function obj
B5, finishing dieType (2): obtaining an optimal value omega 'by adopting a gradient descent method'jAnd when the target function obj' is reached, the loss function is minimum, iteration is stopped, and a final model is output.
5. The XGboost algorithm-based abnormal user identification method according to claim 1, wherein: in step S3, the method for model evaluation includes the following steps:
c1, confusion matrix: the confusion matrix represents the prediction category by columns, the total number of the columns represents the total number of data predicted to be in the category, the rows represents the real attribution category of the sample, and the total number of the rows represents the total number of data instances in the category; the Precision rate Precision is TP/(TP + FP), the recall rate recall is TP/(TP + FN), wherein TP represents the number of positive classes predicted by the positive classes, the truth is 0, the prediction is also 0, FN represents the number of negative classes predicted by the positive classes, the truth is 0, the prediction is 1, FP represents the number of positive classes predicted by the negative classes, the truth is 1, the prediction is 0, TN represents the number of negative classes predicted by the negative classes, the truth is 1, and the prediction is also 1;
c2, ROC curve and AUC values: the ROC curve is a curve in which the false positive rate is used as an abscissa and the true rate is used as an ordinate in the confusion matrix to represent the increasing relation of two variables, a threshold is given, samples larger than the threshold are divided into positive classes, samples smaller than the threshold are divided into negative classes, the AUC value is the area under the ROC curve, and the AUC is 1 and corresponds to an ideal model.
6. An abnormal user identification device based on an XGboost algorithm is characterized in that: comprising storage means for storing one or more programs and a processor for implementing a XGBoost algorithm based abnormal user identification method as claimed in any one of claims 1 to 5 when said one or more programs are executed by said processor, said device further preferably comprising a communication interface for communication and data exchange with an external device.
7. An abnormal user identification computer-readable storage medium based on an XGboost algorithm, characterized in that: a computer-readable storage medium containing at least one program stored thereon, which when executed by a processor, implements an XGBoost algorithm-based abnormal user identification method according to any one of claims 1 to 5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110297781.5A CN112950231A (en) | 2021-03-19 | 2021-03-19 | XGboost algorithm-based abnormal user identification method, device and computer-readable storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110297781.5A CN112950231A (en) | 2021-03-19 | 2021-03-19 | XGboost algorithm-based abnormal user identification method, device and computer-readable storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112950231A true CN112950231A (en) | 2021-06-11 |
Family
ID=76227183
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110297781.5A Pending CN112950231A (en) | 2021-03-19 | 2021-03-19 | XGboost algorithm-based abnormal user identification method, device and computer-readable storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112950231A (en) |
Cited By (25)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113298746A (en) * | 2021-07-05 | 2021-08-24 | 北京邮电大学 | Supervised false color image synthesis method based on machine learning algorithm |
CN113435505A (en) * | 2021-06-28 | 2021-09-24 | 中电积至(海南)信息技术有限公司 | Construction method and device for safe user portrait |
CN113469428A (en) * | 2021-06-24 | 2021-10-01 | 珠海卓邦科技有限公司 | Water use property abnormality identification method and device, computer device and storage medium |
CN113554474A (en) * | 2021-08-11 | 2021-10-26 | 上海明略人工智能(集团)有限公司 | Model verification method and device, electronic equipment and computer-readable storage medium |
CN113569949A (en) * | 2021-07-28 | 2021-10-29 | 广州博冠信息科技有限公司 | Abnormal user identification method and device, electronic equipment and storage medium |
CN113762805A (en) * | 2021-09-23 | 2021-12-07 | 国网湖南省电力有限公司 | Mountain forest fire early warning method applied to power transmission line |
CN113837303A (en) * | 2021-09-29 | 2021-12-24 | 中国联合网络通信集团有限公司 | Black product user identification method, TEE node and computer readable storage medium |
CN113947028A (en) * | 2021-10-25 | 2022-01-18 | 浙大城市学院 | RBCC health management method based on XGboost and Datawig machine learning |
CN113965416A (en) * | 2021-12-21 | 2022-01-21 | 江苏移动信息系统集成有限公司 | Website security protection capability scheduling method and system based on workflow |
CN114239823A (en) * | 2021-12-17 | 2022-03-25 | 中国电信股份有限公司 | Modeling and using method of behavior prediction model of number card user and related equipment |
CN114253242A (en) * | 2021-12-21 | 2022-03-29 | 上海纽酷信息科技有限公司 | VPN-based Internet of things cloud equipment data acquisition system |
CN114282940A (en) * | 2021-12-17 | 2022-04-05 | 中国电信股份有限公司 | Method and apparatus for intention recognition, storage medium, and electronic device |
CN114358169A (en) * | 2021-12-30 | 2022-04-15 | 上海应用技术大学 | Colorectal cancer detection system based on XGboost |
CN114528946A (en) * | 2021-12-16 | 2022-05-24 | 浙江省新型互联网交换中心有限责任公司 | Autonomous domain system sibling relation recognition method |
CN114549026A (en) * | 2022-04-26 | 2022-05-27 | 浙江鹏信信息科技股份有限公司 | Method and system for identifying unknown fraud based on algorithm component library analysis |
CN115106615A (en) * | 2022-08-30 | 2022-09-27 | 苏芯物联技术(南京)有限公司 | Welding deviation real-time detection method and system based on intelligent working condition identification |
CN115174170A (en) * | 2022-06-23 | 2022-10-11 | 东北电力大学 | VPN encrypted flow identification method based on ensemble learning |
CN116611022A (en) * | 2023-04-21 | 2023-08-18 | 深圳乐行智慧产业有限公司 | Intelligent campus education big data fusion method and platform |
WO2023179014A1 (en) * | 2022-03-23 | 2023-09-28 | 中兴通讯股份有限公司 | Traffic identification method and apparatus, electronic device, and storage medium |
CN117150282A (en) * | 2023-09-16 | 2023-12-01 | 石家庄正和网络有限公司 | Secondhand equipment recycling evaluation method and system based on prediction model |
CN117235270A (en) * | 2023-11-16 | 2023-12-15 | 中国人民解放军国防科技大学 | Text classification method and device based on belief confusion matrix and computer equipment |
CN117373688A (en) * | 2023-11-07 | 2024-01-09 | 爱奥乐医疗器械(深圳)有限公司 | Chronic disease data processing method, device, electronic equipment and storage medium |
CN117422334A (en) * | 2023-10-27 | 2024-01-19 | 国网北京市电力公司 | Multi-level panoramic carbon efficiency analysis method and system based on multi-energy data |
CN117724949A (en) * | 2023-12-25 | 2024-03-19 | 北京新数科技有限公司 | Database capacity prediction method, system, equipment and readable storage medium based on XGBoost model |
CN118152949A (en) * | 2024-05-09 | 2024-06-07 | 联通时科(北京)信息技术有限公司 | Abnormal user identification method and device and readable storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109344998A (en) * | 2018-09-06 | 2019-02-15 | 盈盈(杭州)网络技术有限公司 | A kind of customer default probability forecasting method based on medical and beauty treatment scene |
CN110309771A (en) * | 2019-06-28 | 2019-10-08 | 南京丰厚电子有限公司 | A kind of EAS sound magnetic system tag recognition algorithm based on GBDT-INSGAII |
CN112202718A (en) * | 2020-09-03 | 2021-01-08 | 西安交通大学 | XGboost algorithm-based operating system identification method, storage medium and device |
CN112418653A (en) * | 2020-11-19 | 2021-02-26 | 重庆邮电大学 | Number portability and network diver identification system and method based on machine learning algorithm |
CN112464058A (en) * | 2020-11-30 | 2021-03-09 | 上海欣方智能系统有限公司 | XGboost algorithm-based telecommunication internet fraud identification method |
-
2021
- 2021-03-19 CN CN202110297781.5A patent/CN112950231A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109344998A (en) * | 2018-09-06 | 2019-02-15 | 盈盈(杭州)网络技术有限公司 | A kind of customer default probability forecasting method based on medical and beauty treatment scene |
CN110309771A (en) * | 2019-06-28 | 2019-10-08 | 南京丰厚电子有限公司 | A kind of EAS sound magnetic system tag recognition algorithm based on GBDT-INSGAII |
CN112202718A (en) * | 2020-09-03 | 2021-01-08 | 西安交通大学 | XGboost algorithm-based operating system identification method, storage medium and device |
CN112418653A (en) * | 2020-11-19 | 2021-02-26 | 重庆邮电大学 | Number portability and network diver identification system and method based on machine learning algorithm |
CN112464058A (en) * | 2020-11-30 | 2021-03-09 | 上海欣方智能系统有限公司 | XGboost algorithm-based telecommunication internet fraud identification method |
Non-Patent Citations (1)
Title |
---|
袁丽欣等: "基于XGBoost方法的社交网络异常用户检测技术", 《计算机应用研究》, vol. 37, no. 3, pages 814 - 817 * |
Cited By (33)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113469428A (en) * | 2021-06-24 | 2021-10-01 | 珠海卓邦科技有限公司 | Water use property abnormality identification method and device, computer device and storage medium |
CN113435505A (en) * | 2021-06-28 | 2021-09-24 | 中电积至(海南)信息技术有限公司 | Construction method and device for safe user portrait |
CN113298746A (en) * | 2021-07-05 | 2021-08-24 | 北京邮电大学 | Supervised false color image synthesis method based on machine learning algorithm |
CN113569949A (en) * | 2021-07-28 | 2021-10-29 | 广州博冠信息科技有限公司 | Abnormal user identification method and device, electronic equipment and storage medium |
CN113554474A (en) * | 2021-08-11 | 2021-10-26 | 上海明略人工智能(集团)有限公司 | Model verification method and device, electronic equipment and computer-readable storage medium |
CN113554474B (en) * | 2021-08-11 | 2024-08-20 | 上海明略人工智能(集团)有限公司 | Model verification method and device, electronic equipment and computer readable storage medium |
CN113762805A (en) * | 2021-09-23 | 2021-12-07 | 国网湖南省电力有限公司 | Mountain forest fire early warning method applied to power transmission line |
CN113837303A (en) * | 2021-09-29 | 2021-12-24 | 中国联合网络通信集团有限公司 | Black product user identification method, TEE node and computer readable storage medium |
CN113947028A (en) * | 2021-10-25 | 2022-01-18 | 浙大城市学院 | RBCC health management method based on XGboost and Datawig machine learning |
CN114528946A (en) * | 2021-12-16 | 2022-05-24 | 浙江省新型互联网交换中心有限责任公司 | Autonomous domain system sibling relation recognition method |
CN114239823A (en) * | 2021-12-17 | 2022-03-25 | 中国电信股份有限公司 | Modeling and using method of behavior prediction model of number card user and related equipment |
CN114282940A (en) * | 2021-12-17 | 2022-04-05 | 中国电信股份有限公司 | Method and apparatus for intention recognition, storage medium, and electronic device |
CN113965416A (en) * | 2021-12-21 | 2022-01-21 | 江苏移动信息系统集成有限公司 | Website security protection capability scheduling method and system based on workflow |
CN114253242A (en) * | 2021-12-21 | 2022-03-29 | 上海纽酷信息科技有限公司 | VPN-based Internet of things cloud equipment data acquisition system |
CN114253242B (en) * | 2021-12-21 | 2023-12-26 | 上海纽酷信息科技有限公司 | VPN-based cloud equipment data acquisition system for Internet of things |
CN114358169A (en) * | 2021-12-30 | 2022-04-15 | 上海应用技术大学 | Colorectal cancer detection system based on XGboost |
CN114358169B (en) * | 2021-12-30 | 2023-09-26 | 上海应用技术大学 | Colorectal cancer detection system based on XGBoost |
WO2023179014A1 (en) * | 2022-03-23 | 2023-09-28 | 中兴通讯股份有限公司 | Traffic identification method and apparatus, electronic device, and storage medium |
CN114549026A (en) * | 2022-04-26 | 2022-05-27 | 浙江鹏信信息科技股份有限公司 | Method and system for identifying unknown fraud based on algorithm component library analysis |
CN115174170A (en) * | 2022-06-23 | 2022-10-11 | 东北电力大学 | VPN encrypted flow identification method based on ensemble learning |
CN115174170B (en) * | 2022-06-23 | 2023-05-09 | 东北电力大学 | VPN encryption flow identification method based on ensemble learning |
CN115106615A (en) * | 2022-08-30 | 2022-09-27 | 苏芯物联技术(南京)有限公司 | Welding deviation real-time detection method and system based on intelligent working condition identification |
CN116611022B (en) * | 2023-04-21 | 2024-04-26 | 深圳乐行智慧产业有限公司 | Intelligent campus education big data fusion method and platform |
CN116611022A (en) * | 2023-04-21 | 2023-08-18 | 深圳乐行智慧产业有限公司 | Intelligent campus education big data fusion method and platform |
CN117150282B (en) * | 2023-09-16 | 2024-01-30 | 石家庄正和网络有限公司 | Secondhand equipment recycling evaluation method and system based on prediction model |
CN117150282A (en) * | 2023-09-16 | 2023-12-01 | 石家庄正和网络有限公司 | Secondhand equipment recycling evaluation method and system based on prediction model |
CN117422334A (en) * | 2023-10-27 | 2024-01-19 | 国网北京市电力公司 | Multi-level panoramic carbon efficiency analysis method and system based on multi-energy data |
CN117373688A (en) * | 2023-11-07 | 2024-01-09 | 爱奥乐医疗器械(深圳)有限公司 | Chronic disease data processing method, device, electronic equipment and storage medium |
CN117373688B (en) * | 2023-11-07 | 2024-06-04 | 爱奥乐医疗器械(深圳)有限公司 | Chronic disease data processing method, device, electronic equipment and storage medium |
CN117235270A (en) * | 2023-11-16 | 2023-12-15 | 中国人民解放军国防科技大学 | Text classification method and device based on belief confusion matrix and computer equipment |
CN117235270B (en) * | 2023-11-16 | 2024-02-02 | 中国人民解放军国防科技大学 | Text classification method and device based on belief confusion matrix and computer equipment |
CN117724949A (en) * | 2023-12-25 | 2024-03-19 | 北京新数科技有限公司 | Database capacity prediction method, system, equipment and readable storage medium based on XGBoost model |
CN118152949A (en) * | 2024-05-09 | 2024-06-07 | 联通时科(北京)信息技术有限公司 | Abnormal user identification method and device and readable storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112950231A (en) | XGboost algorithm-based abnormal user identification method, device and computer-readable storage medium | |
CN112633962B (en) | Service recommendation method and device, computer equipment and storage medium | |
CN112085615B (en) | Training method and device for graphic neural network | |
CN113627566B (en) | Phishing early warning method and device and computer equipment | |
CN110147389B (en) | Account processing method and device, storage medium and electronic device | |
CN112085172A (en) | Method and device for training graph neural network | |
CN111311030B (en) | User credit risk prediction method and device based on influence factor detection | |
CN112785005B (en) | Multi-objective task assistant decision-making method and device, computer equipment and medium | |
CN110866767A (en) | Method, device, equipment and medium for predicting satisfaction degree of telecommunication user | |
CN115130536A (en) | Training method of feature extraction model, data processing method, device and equipment | |
CN107704868A (en) | Tenant group clustering method based on Mobile solution usage behavior | |
CN111695084A (en) | Model generation method, credit score generation method, device, equipment and storage medium | |
CN117876018A (en) | Method, device, electronic equipment and storage medium for identifying and predicting potential customers | |
CN110855474B (en) | Network feature extraction method, device, equipment and storage medium of KQI data | |
CN111507461A (en) | Interpretability information determining method and device | |
CN112214675B (en) | Method, device, equipment and computer storage medium for determining user purchasing machine | |
CN112463964B (en) | Text classification and model training method, device, equipment and storage medium | |
CN111144430A (en) | Genetic algorithm-based card number identification method and device | |
CN112700277B (en) | Processing method of user behavior data and multi-behavior sequence conversion model training method | |
CN115239068A (en) | Target task decision method and device, electronic equipment and storage medium | |
CN114841588A (en) | Information processing method, device, electronic equipment and computer readable medium | |
CN113935407A (en) | Abnormal behavior recognition model determining method and device | |
CN113255231A (en) | Data processing method, device, equipment and storage medium | |
CN113806517A (en) | Outbound method, device, equipment and medium based on machine learning model | |
CN109308565B (en) | Crowd performance grade identification method and device, storage medium and computer equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |