CN112950231A - XGboost algorithm-based abnormal user identification method, device and computer-readable storage medium - Google Patents

XGboost algorithm-based abnormal user identification method, device and computer-readable storage medium Download PDF

Info

Publication number
CN112950231A
CN112950231A CN202110297781.5A CN202110297781A CN112950231A CN 112950231 A CN112950231 A CN 112950231A CN 202110297781 A CN202110297781 A CN 202110297781A CN 112950231 A CN112950231 A CN 112950231A
Authority
CN
China
Prior art keywords
model
data
user identification
algorithm
abnormal user
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110297781.5A
Other languages
Chinese (zh)
Inventor
苏如春
孙少峰
练镜锋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Hantele Communication Co ltd
Original Assignee
Guangzhou Hantele Communication Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Hantele Communication Co ltd filed Critical Guangzhou Hantele Communication Co ltd
Priority to CN202110297781.5A priority Critical patent/CN112950231A/en
Publication of CN112950231A publication Critical patent/CN112950231A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/018Certifying business or products
    • G06Q30/0185Product, service or business identity fraud
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Accounting & Taxation (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Development Economics (AREA)
  • Economics (AREA)
  • Finance (AREA)
  • Marketing (AREA)
  • Strategic Management (AREA)
  • General Business, Economics & Management (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the field of artificial intelligence, and particularly discloses an abnormal user identification method, abnormal user identification equipment and a computer readable storage medium based on an XGboost algorithm, wherein the method comprises the following implementation steps: data preprocessing and feature selection: acquiring user data to be identified in batches, and performing data preprocessing through data cleaning and feature engineering; establishing a model: constructing a classification model by using the processed characteristic vectors and the class labels as a sample set input by the model, calculating a predicted value, constructing a target function of an algorithm according to the predicted value output by the model calculation, and iterating to obtain an optimal loss function so as to obtain a final classification result; model parameter tuning and model verification: and optimizing the model parameters. The invention uses the multi-dimensional user data to more comprehensively cover the data; a weak learner CART algorithm is selected, so that the operation efficiency is improved; the model expression effect is enhanced, the generalization capability is enhanced, the accuracy and the recognition rate are improved, and the computing resources are saved.

Description

XGboost algorithm-based abnormal user identification method, device and computer-readable storage medium
Technical Field
The invention relates to the field of artificial intelligence, in particular to an abnormal user identification method, system and device based on an XGboost algorithm.
Background
With the development of the big data era, competition among communication industry operators is increased, in order to absorb more users, the operators perform some marketing activities to attract new users registered from various channels, however, some abnormally registered users exist in the marketing activities, risk users such as mobile phone card non-use, card raising and fraud are handled, and in order to identify the abnormal users, the operators identify the abnormal users by utilizing a large amount of data generated by the users every day, such as user basic information, internet behavior data, communication consumption data and position data establishment algorithm model classification.
At present, the identification of abnormal users in the communication industry is mainly detected through single attribute data and a single algorithm, and the data aspect mainly comprises network data and mobile position data; the algorithm aspect mainly describes the abnormal users through statistical analysis and detects the abnormal users based on the analysis of a basic algorithm model.
The existing technology for detecting abnormal users in the communication industry has the following problems:
(1) data is single and coverage is incomplete.
The existing abnormal user detection technology in the industry depends on user position data or user internet data alone for network element detection, under the background of big data, the behaviors of users are more and more diverse, the interest preference is more and more diverse, and the identification rate and the accuracy rate are lower when single-dimensional data is used for analysis and detection.
(2) The detection method is relatively basic and has weak generalization capability.
In the prior art, description statistics is adopted, abnormal users are difficult to identify by counting the mean value, variance and the like of each index data of the users, the limitation on data formats is high by adopting a basic machine learning algorithm, discrete data cannot be used in part of models, continuous data cannot be used in part of models, and the adopted technical model is easy to over-fit and under-fit, unstable in model, weak in generalization capability and poor in model output expression effect.
(3) The calculation is complex and the resource occupation is large.
In the prior art, the processing steps of data and models are complex, so the calculation amount is large, the execution efficiency of model recursion and iteration is low, and the occupied computer resources are large.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides an abnormal user identification method, equipment and a computer readable storage medium based on an XGboost algorithm, aiming at fully utilizing mobile big data to be suitable for wider and comprehensive service scenes and solving the problems of single data and incomplete coverage; by using the integration algorithm, the weak learner is integrated, the generalization capability of the model and the expression effect of the model output are improved, the feature granularity is optimized in parallel, the algorithm efficiency is improved, and the calculation amount is reduced.
The technical scheme adopted by the invention is as follows: an abnormal user identification method based on an XGboost algorithm is specifically realized by the following steps:
s1, data preprocessing and feature selection: obtaining user data of users needing to be identified in batch within a specified period of time, performing data preprocessing through data cleaning and feature engineering, and outputting feature vectors and category labels;
s2, model establishment: constructing an integrated classification model by using the processed characteristic vectors and the class labels as a sample set D of model input, and calculating a predicted value by using the model
Figure BDA0002984973800000021
Then, an objective function of the algorithm is constructed according to the predicted value output by the model calculation
Figure BDA0002984973800000022
Wherein the front part of the objective function
Figure BDA0002984973800000023
Representing loss functions, the second half
Figure BDA0002984973800000024
The regular term representing the target function is used for expressing the complexity function of the tree, the smaller the value is, the lower the complexity is, the stronger the generalization capability of the model is, T in the regular term is the number of leaf nodes of the tree, and gamma is controlMaking coefficients of the number of leaf nodes; the second half part is the L2 model square of the leaf node score omega, the L2 is the calculation square sum root-opening sign for preventing overfitting and enabling the optimization solution to be stable and quick, and the lambda is a regular term coefficient for ensuring that the leaf node score is not too large; iterating to obtain an optimal loss function so as to obtain a final classification result;
s3, model parameter optimization and model verification: and optimizing model parameters, performing multiple evaluation and verification on the trained model, taking parameters with the best detection effect in verification, and outputting the model.
Preferably, the user data includes: user personal information, whether the user is real-name or not, real-name age, attribution, network access time, activation time, startup time, apru value, calling times, called times, total call times, website access times, APP access times, usage flow, usage time interval, access IP, resident cell base station and resident cell field.
Preferably, in step S1, the data preprocessing is implemented as follows:
a1, data cleaning: removing repeated values of the extracted user data, and uniformly processing the user data according to different types of data processing missing values and data field formats;
a2, characteristic engineering: and carrying out standardization processing on the cleaned data, encoding classification variables, converting the classification variables into dummy variables, carrying out binarization on quantitative field characteristics, and converting text data into numerical data so as to construct a characteristic vector of the model.
Preferably, the step S2 is implemented by the following steps:
b1, construction model: the processed features are constructed as a sample set D { (x)i,yi)}(|D|=n,xi∈Rm,yiE to R), adopting K-fold cross validation to divide K subsets, and constructing an integrated classification model, wherein xiRepresenting a feature vector, yiRepresenting class labels, R representing a set of real numbers, RmA set of real numbers representing the mth sample set;
b2, model initialization: initializing the weights ω with a constant p0And a function
Figure BDA0002984973800000031
Wherein y isiIs a sample label, gamma is an adjustment parameter, and N is the total number of samples;
b3, iterating and calculating a predicted value: characteristic xiAnd category label yiPredicted value of (2)
Figure BDA0002984973800000032
Wherein F ═ { F (x) ═ ωq(x)}(q:Rm→T,ω∈RT) Representing a set of CART algorithms of the decision trees, K representing the number of the decision trees, T representing the number of leaf nodes on the decision trees, and each classification decision tree fkWeights ω corresponding to an independent tree structure q and leaves;
b4, iterating and calculating the error: objective function
Figure BDA0002984973800000041
Wherein y isiIs the true value of the,
Figure BDA0002984973800000042
is a predicted value; calculating the model result of the previous (t-1) times, training the model according to the residual error, adding a new function on the basis of the original model for each new model, and iterating the t th time
Figure BDA0002984973800000043
Wherein C is a constant term, Ω (f)t) As a regularization term, ft(xi)=wq(xi) Is a function represented by the tree model structure part q and the leaf node sample weight w together; the method is obtained by adopting Taylor formula second-order expansion approximate expansion and combining with regular term expansion and constant term removal
Figure BDA0002984973800000044
Where γ and λ are tuning parameters, γ represents the weight of the L2 regularization term, the larger the model is, λ is a parameter used to control the node splitting threshold,
Figure BDA0002984973800000045
Ij{i|q(xi) J represents the set of labels in the sample assigned to the jth leaf, let
Figure BDA0002984973800000046
The optimal value omega of the jth leaf can be obtained by carrying out partial derivation on the objective function'jAnd the minimum value of the objective function obj
Figure BDA0002984973800000047
B5, completing the model: obtaining an optimal value omega 'by adopting a gradient descent method'jAnd when the target function obj' is reached, the loss function is minimum, iteration is stopped, and a final model is output.
Preferably, in step S3, the method for model evaluation includes the following steps:
c1, confusion matrix: the confusion matrix represents the prediction category by columns, the total number of data predicted for that category by the total number of columns, the true attribution category of the sample by rows, and the total number of rows by the total number of data instances for that category, as shown in Table 1 below.
Figure BDA0002984973800000048
TABLE 1 confusion matrix
The Precision rate Precision is TP/(TP + FP), the higher the value is, the better the effect is, the higher the recall rate recall is TP/(TP + FN), the higher the model effect is, wherein TP represents the number of positive classes predicted by the positive classes, the truth is 0, the prediction is also 0, FN represents the number of negative classes predicted by the positive classes, the truth is 0, the prediction is 1, FP represents the number of positive classes predicted by the negative classes, the truth is 1, the prediction is 0, TN represents the number of negative classes predicted by the negative classes, the truth is 1, and the prediction is also 1;
c2, ROC curve and AUC values: the ROC curve is a curve in which the false positive rate is used as an abscissa and the true rate is used as an ordinate in the confusion matrix to represent the increasing relation of two variables, a threshold is given, samples larger than the threshold are divided into positive classes, samples smaller than the threshold are divided into negative classes, and the steeper the ROC curve is, the better the ROC curve is. The AUC value is the area under the ROC curve, the larger the AUC value is, the better the performance of the model is, and the AUC 1 corresponds to an ideal model.
An abnormal user identification device based on the XGboost algorithm comprises a storage device and a processor, wherein the storage device is used for storing one or more programs, when the one or more programs are executed by the processor, the processor realizes the abnormal user identification method based on the XGboost algorithm, and the device also preferably comprises a communication interface which is used for communication and data interactive transmission with an external device.
An abnormal user identification computer readable storage medium based on an XGboost algorithm comprises a computer readable storage medium storing at least one program, and when the program is executed by a processor, the abnormal user identification computer readable storage medium realizes the abnormal user identification method based on the XGboost algorithm.
The invention has the beneficial effects that: the integrated weak learner strengthens the model expression effect, inputs characteristic indexes of each latitude of a user, establishes an XGboost model, calculates a loss function by Taylor second-order derivation, adjusts a weight matrix and improves the model expression effect and the recognition rate; meanwhile, a proper loss function can be defined by customizing the loss function and inputting characteristic variables according to different service scenes, so that a proper weight matrix can be adjusted according to different service scenes, the method is convenient for user identification of more service scenes, and the identification capability of various abnormal users of governments and enterprises is improved.
The method has the following specific beneficial effects:
(1) by using multi-dimensional user data, more characteristics of users are extracted, more rules of the users are mined, and the system can cover the data more comprehensively;
(2) the weak learner selected by the algorithm is a CART algorithm, and can input discrete features and continuous features, so that the model data format breaks through the conventional limitation, and meanwhile, the operation efficiency is improved by the application and improvement of the CART algorithm;
(3) the scheme integrates a plurality of weak learners simultaneously, so that the model expression effect is enhanced, the generalization capability is enhanced, the accuracy and the recognition rate are improved, and the scheme is suitable for more service scenes, such as telecom fraud user recognition and off-network risk user recognition;
(4) and column sampling and parallel optimization characteristics are adopted, so that the calculation amount is reduced, and the calculation resources are saved.
Drawings
FIG. 1 is a schematic flow diagram of the present invention;
FIG. 2 is a flow chart of a technical implementation of the present invention.
Detailed Description
The present invention will be further described with reference to the accompanying drawings and the detailed description, and it should be noted that any combination of the embodiments or technical features described below can be used to form a new embodiment without conflict.
The abnormal users to be identified by the method mainly comprise risk users transacting mobile phone card nonuse or card maintenance, fraud and the like, and in order to identify the abnormal users, the abnormal users are identified by establishing an algorithm model by utilizing a large amount of data generated by mobile phone users every day, such as user basic information, internet surfing behavior data, communication consumption data and position data. Mobile phone user demographic data are mainly used: age, industry, real name, place of attribution, real name certificate place of attribution; consumption behavior data: monthly consumption, apru value, calling/called times and charges, total call times, etc.; and (3) networking behavior data: website access times, APP access times, usage flow, usage period, access IP and the like; position data: resident cell base station, resident cell, other data: and establishing an XGboost model by using data such as network access time, user state, starting time, activation time and the like, and outputting abnormal classified users.
Referring to fig. 1 and 2, the present invention is an abnormal user identification method, device and computer readable storage medium based on XGBoost algorithm, and the following is one embodiment of the present invention, and is applied to abnormal user identification of users of mobile big data platform in cantonese, and based on XGBoost algorithm in ensemble learning, the specific implementation flow is as follows:
s1, data preprocessing and feature selection: the method mainly comprises the following steps that data of 100 ten thousand user data registered in an activity period are randomly acquired based on a big data platform of a communication service provider, wherein the data are about three months, and the data mainly comprise fields such as whether a user is real-name or not, real-name age, attribution, network access time, activation time, starting time, apru value, calling times, called times, total call times, website access times, APP access times, use flow, use time interval, access IP (Internet protocol), resident cell base stations, resident cells and the like.
S11, data cleaning: removing repeated values of user data extracted from a big data platform, and uniformly processing the user data according to different types of data processing missing values and data field formats;
s12, characteristic engineering: z-score normalization of the cleaned data
Figure BDA0002984973800000071
Wherein x' is the processed characteristic, x is the original characteristic, mu is the mean value, and sigma is the standard deviation; then, encoding the classification variable into a dummy variable, wherein the dummy variable is the real name, the user state, the industry and the attribution; binarizing the quantitative field characteristics, and converting text data into numerical data so as to construct a characteristic vector of the model;
s2, establishing a model: model input feature vector X ═ X1,x2,x3...xn)TThe category label Y ═ Y1,y2,y3...yn)TAnd T is a vector transposition notation, and the implementation is as follows:
s21, feature vector X ═ X1,x2,x3...xn)TIn the experiment, K is used for 10 and 10 folds of cross validation, wherein 9 subsets are used for training, 1 subset is used for testing, 10 experiments are carried out, 10 subsets are alternately used for testing, and the final result is obtained by averaging 10 experiments;
s22, model initialization, initialization weight omega with constant p0And a function
Figure BDA0002984973800000081
Where yi is the sample labelGamma is an adjusting parameter, and N is the total number of the samples of 100 ten thousand;
s23, calculating a predicted value: iterative computation of feature xiAnd category label yiPredicted value of (2)
Figure BDA0002984973800000082
Wherein F ═ { F (x) ═ ωq(x)}(q:Rm→T,ω∈RT) Representing a set of CART algorithms of the decision trees, K representing the number of the decision trees, T representing the number of leaf nodes on the decision trees, and each classification decision tree fkOutputting a prediction label value of a sample corresponding to an independent tree structure q and the weight omega of the leaf;
s24, calculating an error: according to the deformed objective function
Figure BDA0002984973800000083
Where γ and λ are tuning parameters, γ represents the weight of the L2 regularization term, λ is a parameter used to control the node splitting threshold,
Figure BDA0002984973800000084
Ij{i|q(xi) J represents the set of labels in the sample assigned to the jth leaf, let
Figure BDA0002984973800000085
Calculating to obtain the optimal value omega 'of the jth leaf'jAnd the minimum value of the objective function obj
Figure BDA0002984973800000086
The optimal weight ω matrix of the tree at this time is obtained.
S25, iterating to minimize the target loss function obj 'to obtain an optimal value omega'jA weight matrix, the method iterates for 1500 times to obtain an optimal value, stops iteration and outputs a final model
Figure BDA0002984973800000087
137900 abnormal users are finally identified from 100 ten thousand users.
S3, model parameter optimization and model verification:
s31, model evaluation: performing confusion matrix evaluation on the trained model, and outputting Precision TP/(TP + FP) and recall TP/(TP + FN), wherein the Precision of the embodiment after 1500 iterations is 82.41%, and the recall is 80.05%;
s32, ROC curve and AUC values: the model plots an ROC curve by taking the false positive rate in the confusion matrix as an abscissa and the real rate as an ordinate to show the increase of two variables, wherein the AUC value is 0.804:
s33, model parameter optimization: the xgboost model parameters are optimized through the parameters eta, max _ depth, nround, subsample, alpha, lambda, min _ child _ weight, etc. in the model
Adjusting; in this embodiment, when the fixed parameter eta is 0.3, the subsample is 0.6, the parameters alpha and lambda are default values of 1, the maximum iteration number nround is 35, and the depth max _ depth of the tree is 42, the model effect is optimal, and the model evaluation final results of 10 experiments are as shown in table 2 below.
Figure BDA0002984973800000091
TABLE 2 evaluation results of the model
The XGboost algorithm-based abnormal user identification method comprises a storage device and a processor, wherein the storage device is used for storing one or more programs, and when the one or more programs are executed by the processor, the processor realizes the abnormal user identification method based on the XGboost algorithm.
The preferred device may also preferably include a communication interface for communicating with external devices and for interactive transmission of data.
It should be noted that the memory may include a high-speed RAM memory, and may also include a nonvolatile memory (nonvolatile memory), such as at least one disk memory.
In a specific implementation, if the memory, the processor and the communication interface are integrated on a chip, the memory, the processor and the communication interface can complete mutual communication through the internal interface. If the memory, the processor and the communication interface are implemented independently, the memory, the processor and the communication interface may be connected to each other through a bus and perform communication with each other.
The invention also discloses a computer readable storage medium which stores at least one program, and when the program is executed by a processor, the abnormal user identification method based on the XGboost algorithm is realized.
It should be understood that the computer-readable storage medium described above is any data storage device that can store data or programs which can thereafter be read by a computer system. Examples of computer-readable storage media include: read-only memory, random access memory, CD-ROM, HDD, DVD, magnetic tape, optical data storage devices, and the like.
The computer readable storage medium can also be distributed over network coupled computer systems so that the computer readable code is stored and executed in a distributed fashion.
Program code embodied on a computer readable storage medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, Radio Frequency (RF), etc., or any suitable combination of the foregoing.
In some embodiments, the computer-readable storage medium may also be non-transitory.
The above embodiments are only preferred embodiments of the present invention, and the protection scope of the present invention is not limited thereby, and any insubstantial changes and substitutions made by those skilled in the art based on the present invention are within the protection scope of the present invention.

Claims (7)

1. An abnormal user identification method based on an XGboost algorithm is characterized in that: the method comprises the following concrete steps:
s1, data preprocessing and feature selection: obtaining user data of users needing to be identified in batch within a specified period of time, performing data preprocessing through data cleaning and feature engineering, and outputting feature vectors and category labels;
s2, model establishment: feature vector sum class to be processedConstructing an integrated classification model by using a sample set D with an identification label as model input, and calculating a predicted value by using the model
Figure RE-FDA0003053166250000011
Then, an objective function of the algorithm is constructed according to the predicted value output by the model calculation
Figure RE-FDA0003053166250000012
Wherein the front part of the objective function
Figure RE-FDA0003053166250000013
Representing loss functions, the second half
Figure RE-FDA0003053166250000014
The regular term is used for expressing the complexity function of the tree, T in the regular term is the number of leaf nodes of the tree, and gamma is a coefficient for controlling the number of the leaf nodes; the second half part is the L2 model square of the leaf node score omega, the L2 is the calculation square sum root-opening sign for preventing overfitting and enabling the optimization solution to be stable and quick, and the lambda is a regular term coefficient for ensuring that the leaf node score is not too large; iterating to obtain an optimal loss function so as to obtain a final classification result;
s3, model parameter optimization and model verification: and optimizing model parameters, performing multiple evaluation and verification on the trained model, taking parameters with the best detection effect in verification, and outputting the model.
2. The XGboost algorithm-based abnormal user identification method according to claim 1, wherein: the user data comprises: user personal information, whether the user is real-name or not, real-name age, attribution, network access time, activation time, startup time, apru value, calling times, called times, total call times, website access times, APP access times, usage flow, usage time interval, access IP, resident cell base station and resident cell field.
3. The XGboost algorithm-based abnormal user identification method according to claim 1, wherein: in step S1, the data preprocessing is implemented as follows:
a1, data cleaning: removing repeated values of the extracted user data, and uniformly processing the user data according to different types of data processing missing values and data field formats;
a2, characteristic engineering: and carrying out standardization processing on the cleaned data, encoding classification variables, converting the classification variables into dummy variables, carrying out binarization on quantitative field characteristics, and converting text data into numerical data so as to construct a characteristic vector of the model.
4. The XGboost algorithm-based abnormal user identification method according to claim 1, wherein: the step S2 is implemented as follows:
b1, construction model: the processed features are constructed as a sample set D { (x)i,yi)}(|D|=n,xi∈Rm,yiE to R), adopting K-fold cross validation to divide K subsets, and constructing an integrated classification model, wherein xiRepresenting a feature vector, yiRepresenting class labels, R representing a set of real numbers, RmA set of real numbers representing the mth sample set;
b2, model initialization: initializing the weights ω with a constant p0And a function
Figure RE-FDA0003053166250000021
Wherein y isiIs a sample label, gamma is an adjustment parameter, and N is the total number of samples;
b3, iterating and calculating a predicted value: characteristic xiAnd category label yiPredicted value of (2)
Figure RE-FDA0003053166250000022
Wherein F ═ { F (x) ═ ωq(x)}(q:Rm→T,ω∈RT) Representing the set of CART algorithm of the decision tree, K representing the number of the decision tree, T representing the number of leaf nodes on the decision tree,each classification decision tree fkWeights ω corresponding to an independent tree structure q and leaves;
b4, iterating and calculating the error: objective function
Figure RE-FDA0003053166250000023
Wherein y isiIs the true value of the,
Figure RE-FDA0003053166250000024
is a predicted value; calculating the model result of the previous (t-1) times, training the model according to the residual error, adding a new function on the basis of the original model for each new model, and iterating the t th time
Figure RE-FDA0003053166250000025
Wherein C is a constant term, Ω (f)t) As a regularization term, ft(xi)=wq(xi) Is a function represented by the tree model structure part q and the leaf node sample weight w together; the method is obtained by adopting Taylor formula second-order expansion approximate expansion and combining with regular term expansion and constant term removal
Figure RE-FDA0003053166250000026
Where γ and λ are tuning parameters, γ represents the weight of the L2 regularization term, λ is a parameter used to control the node splitting threshold,
Figure RE-FDA0003053166250000027
Ij{i|q(xi) J represents the set of labels in the sample assigned to the jth leaf, let
Figure RE-FDA0003053166250000031
The optimal value omega of the jth leaf can be obtained by carrying out partial derivation on the objective function'jAnd the minimum value of the objective function obj
Figure RE-FDA0003053166250000032
B5, finishing dieType (2): obtaining an optimal value omega 'by adopting a gradient descent method'jAnd when the target function obj' is reached, the loss function is minimum, iteration is stopped, and a final model is output.
5. The XGboost algorithm-based abnormal user identification method according to claim 1, wherein: in step S3, the method for model evaluation includes the following steps:
c1, confusion matrix: the confusion matrix represents the prediction category by columns, the total number of the columns represents the total number of data predicted to be in the category, the rows represents the real attribution category of the sample, and the total number of the rows represents the total number of data instances in the category; the Precision rate Precision is TP/(TP + FP), the recall rate recall is TP/(TP + FN), wherein TP represents the number of positive classes predicted by the positive classes, the truth is 0, the prediction is also 0, FN represents the number of negative classes predicted by the positive classes, the truth is 0, the prediction is 1, FP represents the number of positive classes predicted by the negative classes, the truth is 1, the prediction is 0, TN represents the number of negative classes predicted by the negative classes, the truth is 1, and the prediction is also 1;
c2, ROC curve and AUC values: the ROC curve is a curve in which the false positive rate is used as an abscissa and the true rate is used as an ordinate in the confusion matrix to represent the increasing relation of two variables, a threshold is given, samples larger than the threshold are divided into positive classes, samples smaller than the threshold are divided into negative classes, the AUC value is the area under the ROC curve, and the AUC is 1 and corresponds to an ideal model.
6. An abnormal user identification device based on an XGboost algorithm is characterized in that: comprising storage means for storing one or more programs and a processor for implementing a XGBoost algorithm based abnormal user identification method as claimed in any one of claims 1 to 5 when said one or more programs are executed by said processor, said device further preferably comprising a communication interface for communication and data exchange with an external device.
7. An abnormal user identification computer-readable storage medium based on an XGboost algorithm, characterized in that: a computer-readable storage medium containing at least one program stored thereon, which when executed by a processor, implements an XGBoost algorithm-based abnormal user identification method according to any one of claims 1 to 5.
CN202110297781.5A 2021-03-19 2021-03-19 XGboost algorithm-based abnormal user identification method, device and computer-readable storage medium Pending CN112950231A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110297781.5A CN112950231A (en) 2021-03-19 2021-03-19 XGboost algorithm-based abnormal user identification method, device and computer-readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110297781.5A CN112950231A (en) 2021-03-19 2021-03-19 XGboost algorithm-based abnormal user identification method, device and computer-readable storage medium

Publications (1)

Publication Number Publication Date
CN112950231A true CN112950231A (en) 2021-06-11

Family

ID=76227183

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110297781.5A Pending CN112950231A (en) 2021-03-19 2021-03-19 XGboost algorithm-based abnormal user identification method, device and computer-readable storage medium

Country Status (1)

Country Link
CN (1) CN112950231A (en)

Cited By (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113298746A (en) * 2021-07-05 2021-08-24 北京邮电大学 Supervised false color image synthesis method based on machine learning algorithm
CN113435505A (en) * 2021-06-28 2021-09-24 中电积至(海南)信息技术有限公司 Construction method and device for safe user portrait
CN113469428A (en) * 2021-06-24 2021-10-01 珠海卓邦科技有限公司 Water use property abnormality identification method and device, computer device and storage medium
CN113554474A (en) * 2021-08-11 2021-10-26 上海明略人工智能(集团)有限公司 Model verification method and device, electronic equipment and computer-readable storage medium
CN113569949A (en) * 2021-07-28 2021-10-29 广州博冠信息科技有限公司 Abnormal user identification method and device, electronic equipment and storage medium
CN113762805A (en) * 2021-09-23 2021-12-07 国网湖南省电力有限公司 Mountain forest fire early warning method applied to power transmission line
CN113837303A (en) * 2021-09-29 2021-12-24 中国联合网络通信集团有限公司 Black product user identification method, TEE node and computer readable storage medium
CN113947028A (en) * 2021-10-25 2022-01-18 浙大城市学院 RBCC health management method based on XGboost and Datawig machine learning
CN113965416A (en) * 2021-12-21 2022-01-21 江苏移动信息系统集成有限公司 Website security protection capability scheduling method and system based on workflow
CN114239823A (en) * 2021-12-17 2022-03-25 中国电信股份有限公司 Modeling and using method of behavior prediction model of number card user and related equipment
CN114253242A (en) * 2021-12-21 2022-03-29 上海纽酷信息科技有限公司 VPN-based Internet of things cloud equipment data acquisition system
CN114282940A (en) * 2021-12-17 2022-04-05 中国电信股份有限公司 Method and apparatus for intention recognition, storage medium, and electronic device
CN114358169A (en) * 2021-12-30 2022-04-15 上海应用技术大学 Colorectal cancer detection system based on XGboost
CN114528946A (en) * 2021-12-16 2022-05-24 浙江省新型互联网交换中心有限责任公司 Autonomous domain system sibling relation recognition method
CN114549026A (en) * 2022-04-26 2022-05-27 浙江鹏信信息科技股份有限公司 Method and system for identifying unknown fraud based on algorithm component library analysis
CN115106615A (en) * 2022-08-30 2022-09-27 苏芯物联技术(南京)有限公司 Welding deviation real-time detection method and system based on intelligent working condition identification
CN115174170A (en) * 2022-06-23 2022-10-11 东北电力大学 VPN encrypted flow identification method based on ensemble learning
CN116611022A (en) * 2023-04-21 2023-08-18 深圳乐行智慧产业有限公司 Intelligent campus education big data fusion method and platform
WO2023179014A1 (en) * 2022-03-23 2023-09-28 中兴通讯股份有限公司 Traffic identification method and apparatus, electronic device, and storage medium
CN117150282A (en) * 2023-09-16 2023-12-01 石家庄正和网络有限公司 Secondhand equipment recycling evaluation method and system based on prediction model
CN117235270A (en) * 2023-11-16 2023-12-15 中国人民解放军国防科技大学 Text classification method and device based on belief confusion matrix and computer equipment
CN117373688A (en) * 2023-11-07 2024-01-09 爱奥乐医疗器械(深圳)有限公司 Chronic disease data processing method, device, electronic equipment and storage medium
CN117422334A (en) * 2023-10-27 2024-01-19 国网北京市电力公司 Multi-level panoramic carbon efficiency analysis method and system based on multi-energy data
CN117724949A (en) * 2023-12-25 2024-03-19 北京新数科技有限公司 Database capacity prediction method, system, equipment and readable storage medium based on XGBoost model
CN118152949A (en) * 2024-05-09 2024-06-07 联通时科(北京)信息技术有限公司 Abnormal user identification method and device and readable storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109344998A (en) * 2018-09-06 2019-02-15 盈盈(杭州)网络技术有限公司 A kind of customer default probability forecasting method based on medical and beauty treatment scene
CN110309771A (en) * 2019-06-28 2019-10-08 南京丰厚电子有限公司 A kind of EAS sound magnetic system tag recognition algorithm based on GBDT-INSGAII
CN112202718A (en) * 2020-09-03 2021-01-08 西安交通大学 XGboost algorithm-based operating system identification method, storage medium and device
CN112418653A (en) * 2020-11-19 2021-02-26 重庆邮电大学 Number portability and network diver identification system and method based on machine learning algorithm
CN112464058A (en) * 2020-11-30 2021-03-09 上海欣方智能系统有限公司 XGboost algorithm-based telecommunication internet fraud identification method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109344998A (en) * 2018-09-06 2019-02-15 盈盈(杭州)网络技术有限公司 A kind of customer default probability forecasting method based on medical and beauty treatment scene
CN110309771A (en) * 2019-06-28 2019-10-08 南京丰厚电子有限公司 A kind of EAS sound magnetic system tag recognition algorithm based on GBDT-INSGAII
CN112202718A (en) * 2020-09-03 2021-01-08 西安交通大学 XGboost algorithm-based operating system identification method, storage medium and device
CN112418653A (en) * 2020-11-19 2021-02-26 重庆邮电大学 Number portability and network diver identification system and method based on machine learning algorithm
CN112464058A (en) * 2020-11-30 2021-03-09 上海欣方智能系统有限公司 XGboost algorithm-based telecommunication internet fraud identification method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
袁丽欣等: "基于XGBoost方法的社交网络异常用户检测技术", 《计算机应用研究》, vol. 37, no. 3, pages 814 - 817 *

Cited By (33)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113469428A (en) * 2021-06-24 2021-10-01 珠海卓邦科技有限公司 Water use property abnormality identification method and device, computer device and storage medium
CN113435505A (en) * 2021-06-28 2021-09-24 中电积至(海南)信息技术有限公司 Construction method and device for safe user portrait
CN113298746A (en) * 2021-07-05 2021-08-24 北京邮电大学 Supervised false color image synthesis method based on machine learning algorithm
CN113569949A (en) * 2021-07-28 2021-10-29 广州博冠信息科技有限公司 Abnormal user identification method and device, electronic equipment and storage medium
CN113554474A (en) * 2021-08-11 2021-10-26 上海明略人工智能(集团)有限公司 Model verification method and device, electronic equipment and computer-readable storage medium
CN113554474B (en) * 2021-08-11 2024-08-20 上海明略人工智能(集团)有限公司 Model verification method and device, electronic equipment and computer readable storage medium
CN113762805A (en) * 2021-09-23 2021-12-07 国网湖南省电力有限公司 Mountain forest fire early warning method applied to power transmission line
CN113837303A (en) * 2021-09-29 2021-12-24 中国联合网络通信集团有限公司 Black product user identification method, TEE node and computer readable storage medium
CN113947028A (en) * 2021-10-25 2022-01-18 浙大城市学院 RBCC health management method based on XGboost and Datawig machine learning
CN114528946A (en) * 2021-12-16 2022-05-24 浙江省新型互联网交换中心有限责任公司 Autonomous domain system sibling relation recognition method
CN114239823A (en) * 2021-12-17 2022-03-25 中国电信股份有限公司 Modeling and using method of behavior prediction model of number card user and related equipment
CN114282940A (en) * 2021-12-17 2022-04-05 中国电信股份有限公司 Method and apparatus for intention recognition, storage medium, and electronic device
CN113965416A (en) * 2021-12-21 2022-01-21 江苏移动信息系统集成有限公司 Website security protection capability scheduling method and system based on workflow
CN114253242A (en) * 2021-12-21 2022-03-29 上海纽酷信息科技有限公司 VPN-based Internet of things cloud equipment data acquisition system
CN114253242B (en) * 2021-12-21 2023-12-26 上海纽酷信息科技有限公司 VPN-based cloud equipment data acquisition system for Internet of things
CN114358169A (en) * 2021-12-30 2022-04-15 上海应用技术大学 Colorectal cancer detection system based on XGboost
CN114358169B (en) * 2021-12-30 2023-09-26 上海应用技术大学 Colorectal cancer detection system based on XGBoost
WO2023179014A1 (en) * 2022-03-23 2023-09-28 中兴通讯股份有限公司 Traffic identification method and apparatus, electronic device, and storage medium
CN114549026A (en) * 2022-04-26 2022-05-27 浙江鹏信信息科技股份有限公司 Method and system for identifying unknown fraud based on algorithm component library analysis
CN115174170A (en) * 2022-06-23 2022-10-11 东北电力大学 VPN encrypted flow identification method based on ensemble learning
CN115174170B (en) * 2022-06-23 2023-05-09 东北电力大学 VPN encryption flow identification method based on ensemble learning
CN115106615A (en) * 2022-08-30 2022-09-27 苏芯物联技术(南京)有限公司 Welding deviation real-time detection method and system based on intelligent working condition identification
CN116611022B (en) * 2023-04-21 2024-04-26 深圳乐行智慧产业有限公司 Intelligent campus education big data fusion method and platform
CN116611022A (en) * 2023-04-21 2023-08-18 深圳乐行智慧产业有限公司 Intelligent campus education big data fusion method and platform
CN117150282B (en) * 2023-09-16 2024-01-30 石家庄正和网络有限公司 Secondhand equipment recycling evaluation method and system based on prediction model
CN117150282A (en) * 2023-09-16 2023-12-01 石家庄正和网络有限公司 Secondhand equipment recycling evaluation method and system based on prediction model
CN117422334A (en) * 2023-10-27 2024-01-19 国网北京市电力公司 Multi-level panoramic carbon efficiency analysis method and system based on multi-energy data
CN117373688A (en) * 2023-11-07 2024-01-09 爱奥乐医疗器械(深圳)有限公司 Chronic disease data processing method, device, electronic equipment and storage medium
CN117373688B (en) * 2023-11-07 2024-06-04 爱奥乐医疗器械(深圳)有限公司 Chronic disease data processing method, device, electronic equipment and storage medium
CN117235270A (en) * 2023-11-16 2023-12-15 中国人民解放军国防科技大学 Text classification method and device based on belief confusion matrix and computer equipment
CN117235270B (en) * 2023-11-16 2024-02-02 中国人民解放军国防科技大学 Text classification method and device based on belief confusion matrix and computer equipment
CN117724949A (en) * 2023-12-25 2024-03-19 北京新数科技有限公司 Database capacity prediction method, system, equipment and readable storage medium based on XGBoost model
CN118152949A (en) * 2024-05-09 2024-06-07 联通时科(北京)信息技术有限公司 Abnormal user identification method and device and readable storage medium

Similar Documents

Publication Publication Date Title
CN112950231A (en) XGboost algorithm-based abnormal user identification method, device and computer-readable storage medium
CN112633962B (en) Service recommendation method and device, computer equipment and storage medium
CN112085615B (en) Training method and device for graphic neural network
CN113627566B (en) Phishing early warning method and device and computer equipment
CN110147389B (en) Account processing method and device, storage medium and electronic device
CN112085172A (en) Method and device for training graph neural network
CN111311030B (en) User credit risk prediction method and device based on influence factor detection
CN112785005B (en) Multi-objective task assistant decision-making method and device, computer equipment and medium
CN110866767A (en) Method, device, equipment and medium for predicting satisfaction degree of telecommunication user
CN115130536A (en) Training method of feature extraction model, data processing method, device and equipment
CN107704868A (en) Tenant group clustering method based on Mobile solution usage behavior
CN111695084A (en) Model generation method, credit score generation method, device, equipment and storage medium
CN117876018A (en) Method, device, electronic equipment and storage medium for identifying and predicting potential customers
CN110855474B (en) Network feature extraction method, device, equipment and storage medium of KQI data
CN111507461A (en) Interpretability information determining method and device
CN112214675B (en) Method, device, equipment and computer storage medium for determining user purchasing machine
CN112463964B (en) Text classification and model training method, device, equipment and storage medium
CN111144430A (en) Genetic algorithm-based card number identification method and device
CN112700277B (en) Processing method of user behavior data and multi-behavior sequence conversion model training method
CN115239068A (en) Target task decision method and device, electronic equipment and storage medium
CN114841588A (en) Information processing method, device, electronic equipment and computer readable medium
CN113935407A (en) Abnormal behavior recognition model determining method and device
CN113255231A (en) Data processing method, device, equipment and storage medium
CN113806517A (en) Outbound method, device, equipment and medium based on machine learning model
CN109308565B (en) Crowd performance grade identification method and device, storage medium and computer equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination