CN115700787A - Abnormal object identification method and device, electronic equipment and storage medium - Google Patents

Abnormal object identification method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN115700787A
CN115700787A CN202110795543.7A CN202110795543A CN115700787A CN 115700787 A CN115700787 A CN 115700787A CN 202110795543 A CN202110795543 A CN 202110795543A CN 115700787 A CN115700787 A CN 115700787A
Authority
CN
China
Prior art keywords
information
model
sample data
recognition model
target object
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110795543.7A
Other languages
Chinese (zh)
Inventor
孔令凯
李晟
李关乐
高艳铭
白义
冯烨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
China Mobile Chengdu ICT Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
China Mobile Chengdu ICT Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd, China Mobile Chengdu ICT Co Ltd filed Critical China Mobile Communications Group Co Ltd
Priority to CN202110795543.7A priority Critical patent/CN115700787A/en
Publication of CN115700787A publication Critical patent/CN115700787A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Image Analysis (AREA)

Abstract

The application discloses a method and a device for identifying an abnormal object, electronic equipment and a storage medium, wherein the method comprises the following steps: obtaining first information and second information of at least one target object; the first information is operation behavior information, and the second information is activity participation information; obtaining sample data based on the first information and the second information of the at least one target object; training the identification model to be trained by using the sample data to obtain an identification model; determining an abnormal object in the at least one target object based on the first information, the second information and the recognition model of the at least one target object. According to the method and the system, the behavior characteristics of the users can be fully excavated by utilizing the user behavior characteristic information, abnormal users in the website users can be identified based on the data such as the operation behavior information, the activity participation information and the like of the users in the website, and the problems of cash register evasion, CP self-consumption, marketing activity audit and the like of various websites in various service scenes are solved.

Description

Abnormal object identification method and device, electronic equipment and storage medium
Technical Field
The embodiment of the application relates to the technical field of computers, in particular to an abnormal object identification method and device, electronic equipment and a storage medium.
Background
With the continuous development of Personal Computer (PC) and mobile phone services, the scale of users is continuously enlarged, from the promotion and monitoring of partners to the event auditing to the operation analysis, the demands for sorting, cleaning and filtering the abnormal data of websites are urgent, and under some circumstances, a large number of channel arbitrage users, consumer Products (CP) from consumers, users participating in marketing, and other abnormal users may exist in websites, and these users may have negative effects on data statistics, service income, and the like of website services, and therefore, an effective method for identifying the abnormal users is required.
Disclosure of Invention
In order to solve the above technical problem, embodiments of the present application provide a method and an apparatus for identifying an abnormal object, an electronic device, and a storage medium.
The embodiment of the application provides a method for identifying an abnormal object, which comprises the following steps:
obtaining first information and second information of at least one target object; the first information is operation behavior information, and the second information is activity participation information;
obtaining sample data based on the first information and the second information of the at least one target object;
training the recognition model to be trained by using the sample data to obtain a recognition model;
determining an abnormal object among the at least one target object based on the first information, the second information of the at least one target object and the recognition model.
In an optional embodiment of the present application, the sample data includes positive sample data and negative sample data;
the obtaining sample data based on the first information and the second information of the at least one target object includes:
constructing an input information table based on the first information and the second information of the at least one target object;
matching a plurality of abnormal characteristic information based on the input information table as positive sample data in the sample data;
and matching a plurality of pieces of normal characteristic information based on the input information table to serve as negative sample data in the sample data.
In an optional embodiment of the present application, the determining an abnormal object in the at least one target object based on the first information and the second information of the at least one target object and the recognition model includes:
and inputting the input information table into the recognition model, and determining an abnormal object in the at least one target object by using the recognition model.
In an optional embodiment of the present application, the sample data includes a plurality of feature indicators, and training the recognition model to be trained by using the sample data includes:
determining a correlation between each characteristic index of the plurality of characteristic indexes, and determining at least one important index of the plurality of characteristic indexes based on the correlation;
and training the recognition model to be trained by utilizing the sample data corresponding to the at least one important index in the sample data.
In an optional embodiment of the present application, the identification model to be trained includes a gradient lifting decision tree GBDT model, and training the identification model to be trained by using the sample data to obtain an identification model includes:
dividing the sample data according to the operation behavior characteristics and the activity participation characteristics to obtain an operation behavior characteristic set and an activity participation characteristic set;
respectively establishing a first GBDT model corresponding to the operation behavior characteristics and a second GBDT model corresponding to the activity participation characteristics;
traversing the first GBDT model by using the operation behavior feature set to obtain first features output by leaf nodes of the first GBDT model;
traversing the second GBDT model by using the activity participation characteristic set to obtain second characteristics output by leaf nodes of the second GBDT model.
In an optional embodiment of the present application, the identification model to be trained further includes a logistic regression LR model, and the training the identification model to be trained by using the sample data to obtain the identification model further includes:
training the LR model by using the first characteristic and the second characteristic to obtain a trained LR model; the output of the LR model includes identification information of a training subject included in the sample data and a type identification of the training subject.
In an optional embodiment of the present application, training the recognition model to be trained by using the sample data to obtain a recognition model includes:
training the recognition model to be trained by using a K-fold intersection method based on the sample data to obtain K target recognition models with different parameters;
and selecting the target recognition model with the maximum harmonic mean value from the K target recognition models with different parameters as a recognition model.
In an optional embodiment of the present application, after the training of the recognition model to be trained by using the sample data and obtaining the recognition model, the method further includes:
testing the identification model by using test sample data to determine whether the index of the identification model meets a preset condition; and/or determining whether the recognition result of the recognition model is correct or not based on the third information of each object in the at least one target object;
if the index of the recognition model does not meet the preset condition and/or the recognition result of the recognition model is incorrect, continuing to optimize the recognition model to obtain an optimized recognition model;
the determining an abnormal object in the at least one target object based on the first information, the second information and the recognition model of the at least one target object comprises:
determining an abnormal object of the at least one target object based on the first information, the second information of the at least one target object and the optimized recognition model.
The embodiment of the present application further provides an apparatus for identifying an abnormal object, where the apparatus includes:
a first obtaining unit configured to obtain first information and second information of at least one target object; the first information is operation behavior information, and the second information is activity participation information;
a second obtaining unit, configured to obtain sample data based on the first information and the second information of the at least one target object;
the training unit is used for training the identification model to be trained by utilizing the sample data to obtain the identification model;
a determining unit, configured to determine an abnormal object in the at least one target object based on the first information and the second information of the at least one target object and the recognition model.
In an optional embodiment of the present application, the sample data includes positive sample data and negative sample data; the second obtaining unit is specifically configured to: constructing an input information table based on the first information and the second information of the at least one target object; matching a plurality of abnormal characteristic information based on the input information table as positive sample data in the sample data; and matching a plurality of pieces of normal characteristic information based on the input information table to serve as negative sample data in the sample data.
In an optional embodiment of the present application, the determining unit is specifically configured to: and inputting the input information table into the recognition model, and determining an abnormal object in the at least one target object by using the recognition model.
In an optional embodiment of the present application, the sample data includes a plurality of feature indicators, and the training unit is specifically configured to: determining a correlation between each characteristic index of the plurality of characteristic indexes, and determining at least one important index of the plurality of characteristic indexes based on the correlation; and training the identification model to be trained by using the sample data corresponding to the at least one important index in the sample data.
In an optional embodiment of the present application, the recognition model to be trained includes a gradient boosting decision tree GBDT model, and the training unit is specifically configured to: dividing the sample data according to the operation behavior characteristics and the activity participation characteristics to obtain an operation behavior characteristic set and an activity participation characteristic set; respectively establishing a first GBDT model corresponding to the operation behavior characteristics and a second GBDT model corresponding to the activity participation characteristics; traversing the first GBDT model by using the operation behavior feature set to obtain first features output by leaf nodes of the first GBDT model; traversing the second GBDT model by using the activity participation characteristic set to obtain second characteristics output by leaf nodes of the second GBDT model.
In an optional embodiment of the present application, the recognition model to be trained further includes a logistic regression LR model, and the training unit is further specifically configured to: training the LR model by using the first characteristic and the second characteristic to obtain a trained LR model; the output of the LR model includes identification information of a training object included in the sample data and a type identification of the training object.
In an optional embodiment of the present application, the training unit is specifically configured to: training the recognition model to be trained by using a K-fold intersection method based on the sample data to obtain K target recognition models with different parameters; and selecting the target recognition model with the maximum harmonic mean value from the K target recognition models with different parameters as a recognition model.
In an optional embodiment of the present application, the training unit trains the recognition model to be trained by using the sample data, and after obtaining the recognition model, the apparatus further includes:
testing the identification model by using test sample data to determine whether the index of the identification model meets a preset condition; and/or determining whether the recognition result of the recognition model is correct or not based on the third information of each object in the at least one target object;
if the index of the recognition model does not meet the preset condition and/or the recognition result of the recognition model is incorrect, continuing to optimize the recognition model to obtain an optimized recognition model;
the determining an abnormal object in the at least one target object based on the first information, the second information and the recognition model of the at least one target object comprises:
determining an abnormal object of the at least one target object based on the first information, the second information of the at least one target object and the optimized recognition model.
An embodiment of the present application further provides an electronic device, where the electronic device includes: the abnormal object identification method comprises a memory and a processor, wherein the memory is stored with computer executable instructions, and the processor can realize the abnormal object identification method of the embodiment when the processor runs the computer executable instructions on the memory.
The embodiment of the present application further provides a computer storage medium, where the storage medium stores executable instructions, and the executable instructions, when executed by a processor, implement the method for identifying an abnormal object according to the above embodiment.
According to the technical scheme of the embodiment of the application, first information and second information of at least one target object are obtained; the first information is operation behavior information, and the second information is activity participation information; obtaining sample data based on the first information and the second information of the at least one target object; training the identification model to be trained by using the sample data to obtain an identification model; determining an abnormal object among the at least one target object based on the first information, the second information of the at least one target object and the recognition model. According to the technical scheme of the embodiment of the application, the behavior characteristics of the users can be fully mined by utilizing the user behavior characteristic information, abnormal users in website users can be identified based on data such as operation behavior information and activity participation information of the users in the website, and the problems of cash register avoidance, CP self-consumption, marketing activity audit and the like of various websites in various service scenes are solved.
Drawings
Fig. 1 is a schematic flowchart of an identification method for an abnormal object according to an embodiment of the present application;
FIG. 2 is a schematic diagram of two GBDT trees provided in an embodiment of the present application;
fig. 3 is a schematic diagram of an identification process of an abnormal object according to an embodiment of the present application;
fig. 4 is a schematic structural composition diagram of an abnormal object recognition apparatus according to an embodiment of the present application;
fig. 5 is a schematic structural component diagram of an electronic device according to an embodiment of the present application.
Detailed Description
So that the manner in which the above recited features and aspects of the present invention can be understood in detail, a more particular description of the embodiments of the invention, briefly summarized above, may be had by reference to the appended drawings, which are included to illustrate, but are not intended to limit the embodiments of the invention.
In general, abnormal users are largely classified into three major categories:
in the channel cooperation popularization, some channels may develop a large number of zombie users and invalid users for the purpose of settlement, increase the access load of portals and APPs, consume a large amount of settlement expenditure, and need an effective method to support the channel arbitrage user identification.
In the content promotion process, some CPs may be present in order to improve income sharing or achieve settlement assessment targets, and the CP self-consumption is screened by means of an abnormal model aiming at the condition of self-consumption of self-contained content in batches, so that the content can be ensured to be well promoted.
The users participating in the marketing campaign need to check the effectiveness of the users who win the prize in the online marketing campaign by means of an abnormal user identification method, so that marketing resources can be effectively protected, the rights and interests of active users are guaranteed, and the campaign perception and effect are better improved.
The technical scheme of the embodiment of the application can fuse basic attributes and behavior information of the client, the CP and the users participating in the activity, the information is processed, derived and counted and then used as an identification model to be input, the identification model is used for identifying the three listed abnormal users, the identification model is optimized, and an abnormal object set is output.
Fig. 1 is a schematic flowchart of an identification method for an abnormal object according to an embodiment of the present application; as shown in fig. 1, the method for identifying an abnormal object provided in the embodiment of the present application includes the following steps:
step 101: obtaining first information and second information of at least one target object; the first information is operation behavior information, and the second information is activity participation information.
In the embodiment of the application, the target object is an internet user, the behavior information of the user is extracted from a plurality of channels in advance, and the behavior information mainly comprises operation behavior information and activity participation information of the user. The operation behavior information may be registration information, comment information, approval information, and the like of the user, the activity participation information may be information of the user participating in the marketing activity, such as purchase information, forwarding information, and the like, and the operation behavior information and the activity participation information in the present application may also be other information.
Specifically, when the behavior information of the user is extracted, the operation behavior information and the activity participation information of the user for T consecutive days can be extracted by using the registration number in the user information table as the unique identifier, and the extracted data is used as the original data set of the behavior information of the user.
Step 102: and obtaining sample data based on the first information and the second information of the at least one target object.
In an optional embodiment of the present application, the sample data includes positive sample data and negative sample data;
the step 102 may be specifically implemented by the following steps:
constructing an input information table based on the first information and the second information of the at least one target object;
matching a plurality of abnormal characteristic information based on the input information table as positive sample data in the sample data;
and matching a plurality of pieces of normal characteristic information based on the input information table to serve as negative sample data in the sample data.
In the embodiment of the application, after the user behavior information is extracted through multiple channels to obtain the original data set of the user behavior information, the original data set of the user behavior information can be summarized and derived through a statistical means to form an information table required by the input of the recognition model to be trained, meanwhile, a plurality of user characteristic information confirmed to be abnormal are matched and identified to serve as positive samples for training the recognition model to be trained, and the rest user characteristic information serves as negative samples for training the recognition model to be trained.
Step 103: and training the recognition model to be trained by using the sample data to obtain the recognition model.
In an optional embodiment of the present application, the sample data includes a plurality of characteristic indicators, and the step 103 may be specifically implemented in the following manner:
determining a correlation between each of the plurality of characteristic indicators, and determining at least one important indicator in the plurality of characteristic indicators based on the correlation;
and training the identification model to be trained by using the sample data corresponding to the at least one important index in the sample data.
Specifically, after the positive sample and the negative sample in the sample data are extracted, the validity of the sample data needs to be checked and processed. In an optional embodiment of the present application, the Pearson coefficient may be used to calculate the correlation of the index, and extract an important index according to the correlation of the index, thereby reducing the data redundancy.
The Pearson correlation coefficient reflects the linear correlation between two variables, and the value range is [ -1,1], wherein 1 represents the positive correlation of the two variables completely, 0 represents the no linear relation of the two variables completely, and-1 represents the negative correlation of the two variables completely, namely, one variable rises and the other variable falls simultaneously. The closer the correlation coefficient of the two variables is to 0, the weaker the correlation between the two variables, and the correlation calculation formula is as follows:
Figure BDA0003162588940000081
wherein X and Y represent two pairs of continuous variables, respectively.
The correlation determination is based on:
generally, | r | >0.95 represents a significant correlation of two variables, | r | >0.8 represents a high correlation of two variables, 0.5< = | r | <0.8 represents a moderate correlation of two variables; 0.3< = | r | <0.5 represents a low correlation of the two variables; l r | <0.3 represents that the relationship between the two variables is very weak, and the two variables can be considered to be irrelevant.
By calculating the Pearson correlation coefficient between the index variables, the indexes which have relatively high correlation and are relatively unimportant among the variables can be removed, so that the index redundancy of the model is reduced.
In an optional embodiment of the present application, the recognition model to be trained includes a gradient boosting decision tree GBDT model, and the step 103 may be specifically implemented as follows:
dividing the sample data according to the operation behavior characteristics and the activity participation characteristics to obtain an operation behavior characteristic set and an activity participation characteristic set;
respectively establishing a first GBDT model corresponding to the operation behavior characteristics and a second GBDT model corresponding to the activity participation characteristics;
traversing the first GBDT model by using the operation behavior feature set to obtain first features output by leaf nodes of the first GBDT model;
traversing the second GBDT model by using the activity participation characteristic set to obtain second characteristics output by leaf nodes of the second GBDT model.
In the embodiment of the application, because the input of the recognition model relates to operation behavior information and activity participation information, the feature dimensionality of the recognition model training is very high, and a Logistic Regression (LR) algorithm can be adopted, but because the learning capability of the LR model is limited, a large amount of feature engineering needs to be performed, effective features and feature combinations are extracted, and the nonlinear learning capability of the model is improved. The Gradient Boosting Decision Tree (GBDT) is an iterative Decision Tree algorithm, belongs to members of an ensemble learning Boosting family, has the advantages of high classification accuracy, good generalization capability and the like, is a nonlinear model, and is based on a Boosting thought in ensemble learning, a new Decision Tree is established in the Gradient direction for reducing residual errors in each iteration, and how many Decision trees are generated by iteration. Therefore, the GBDT can find various characteristics with distinctiveness and characteristic combinations, and the time and labor cost of characteristic engineering are greatly saved. Therefore, the embodiment of the present application selects a fusion algorithm of GBDT and LR as the identification model of the embodiment of the present application.
The basic idea of GBDT is: based on the forward distribution algorithm, each iteration is to reduce the residual error of the previous time. And to eliminate the residual, a new model can be built in the gradient direction where the residual is reduced. Therefore, in the gradient enhancement process, the goal of each new model is to reduce the residual error of the previous model to the gradient direction, which is greatly different from the traditional Boost algorithm for weighting the correct and wrong samples. Therefore, the GBDT can achieve a higher accuracy with a relatively small parameter adjustment time. In addition, the GBDT adopts a robust loss function, and the robustness to abnormal data is very high.
The input information of the model in the embodiment of the application mainly comprises user operation behavior information and user activity participation information, but the characteristics corresponding to the user activity participation information are too sparse. Therefore, the feature indexes corresponding to the two parts of information need to be respectively built into trees, and the condition that the feature weight is inclined is avoided.
The training steps of the recognition model to be trained in the embodiment of the application are as follows:
preprocessing sample data x i = (= (msisn, flag, A1, A2, …, an, B1, B2, …, bm)), where msisdn represents a registration number, flag represents An abnormal number identifier, i.e., a predicted value, A1 to An represent operational behavior characteristics of the user, and B1 to Bm represent activity participation behavior characteristics of the user. Sample data x i Dividing according to the operation behavior characteristics and the activity participation characteristics, and outputting the following results:
the operation behavior feature set is as follows: x is the number of iA I.e. (msidsn, flag, A1, A2, …, an)
An activity engagement feature set: x is a radical of a fluorine atom iB I.e. (msidsn, flag, B1, B2, …, bm)
Through the division of the sample data, two training sets of T1 and T2 are obtained, as shown in fig. 2, in the embodiment of the present application, the operation behavior feature x in T1 can be respectively processed in the manner of fig. 2 iA And x iB And training the activity participation characteristics, and respectively establishing corresponding GBDT trees.
After the GBDT tree of the two characteristics is constructed, the operation behavior characteristic set x in the sample data is set iA And an activity engagement feature set x iB And respectively traversing the corresponding GBDT trees, wherein each output leaf node is the characteristic of one LR. Respectively using 0,1 to represent whether the sample falls into the leaf node, and constructing the input x of the LR model input : (msisdn 1, flag, C1, C2, …, ck), where k is the number of GBDT leaf nodes.
Taking the example of establishing a GBDT (guaranteed bit rate) tree of the user operation behavior characteristics, the specific steps are as follows:
a) Inputting a training sample set
Figure BDA0003162588940000101
Loss function:
Figure BDA0003162588940000102
iteration times are as follows: and M.
Wherein x is i =(A i1 ,A i2 ,…,A in ) Namely, the operation behavior feature set of the user is obtained; y is i E {0,1},0 represents a negative sample and 1 represents a positive sample. F (x) is the predicted value of the model F; n is the number of samples.
The loss function of the GBDT tree for the user operation behavior characteristics takes the form:
Figure BDA0003162588940000103
b) The weak learner is initialized.
Figure BDA0003162588940000111
c) For the mth iteration:
1) Calculating the negative gradient of the loss function as r im Estimated value of (a):
Figure BDA0003162588940000112
2) The training set is updated to
Figure BDA0003162588940000113
For training a model h m (x) Fitting r im
3) Leaf node region values are estimated using a linear search, i.e., the following function is optimized:
Figure BDA0003162588940000114
Figure BDA0003162588940000115
wherein R is jm Is a leaf node region J, J m As the number of leaves, b jm Is the output value of a leaf node.
4) Updating the model:
F m (x)=F m-1 (x)+γ m h m (x) (7)
5) Strong learner F after outputting M rounds of iteration M (x)。
In an optional embodiment of the present application, the recognition model to be trained further includes a logistic regression LR model, and step 103 may be specifically implemented by the following steps:
training the LR model by using the first characteristic and the second characteristic to obtain a trained LR model; the output of the LR model includes identification information of a training subject included in the sample data and a type identification of the training subject.
Specifically, features obtained according to two GBDT trees are used as input of an LR model, the LR model is established to predict whether a registration number in each sample data set is an abnormal user, and an example of an output result is: msisdn1: flag. Wherein, flag =0 represents a non-abnormal number, and flag =1 represents belonging to an abnormal number.
In the embodiment of the application, the specific steps for establishing the LR model are as follows:
a) Inputting a training sample set
Figure BDA0003162588940000116
Loss function:
Figure BDA0003162588940000117
step size, i.e. learning rate: α, maximum number of iterations: max _ iter, error limit tol.
Wherein x is i =(C i1 ,C i2 ,…,C in ) Namely, the fusion feature set is output by the GBDT; y is i E {0,1},0 represents a negative sample, 1 represents a positive sample, and N is the number of samples.
The loss function takes the form of the log-likelihood loss:
Figure BDA0003162588940000121
wherein,
Figure BDA0003162588940000122
b) Initializing parameter theta 012 ,…θ k ) All 1 vectors can be set.
c) And judging whether the error is less than tol or not for the jth iteration. If yes, terminating the training, otherwise, operating:
updating
Figure BDA0003162588940000123
d) And outputting an LR model final parameter theta.
In an optional embodiment of the present application, the step 103 may be specifically implemented by the following steps:
training the recognition model to be trained by using a K-fold intersection method based on the sample data to obtain K target recognition models with different parameters;
and selecting the target recognition model with the maximum harmonic mean value from the K target recognition models with different parameters as a recognition model.
Specifically, in the GBDT tree, the number and attribute dimensions of the GBDT tree and the depth of the tree need to be adjusted manually. Model training can be performed by using a K-fold intersection method (such as a ten-fold intersection method), and the GBDT parameters corresponding to the classification result with the highest accuracy are selected as the optimization result, i.e., the larger the F1 is, the better the model identification effect is. Wherein
Figure BDA0003162588940000124
Precision is the accuracy and recalling is the Recall.
In an optional embodiment of the present application, after the step 103 is executed, the obtained recognition model may be further optimized by using the following method:
testing the identification model by using test sample data to determine whether the index of the identification model meets a preset condition; and/or determining whether the recognition result of the recognition model is correct or not based on the third information of each object in the at least one target object;
if the index of the recognition model does not meet the preset condition and/or the recognition result of the recognition model is incorrect, continuing to optimize the recognition model to obtain an optimized recognition model;
in the embodiment of the application, the optimized algorithm rule is tested by using the test sample, on one hand, indexes such as the recognition error rate, the accuracy rate, the recall rate and the like of the recognition model can be calculated according to the matching comparison between the recognized abnormal user and the known positive and negative samples; and on the other hand, the user self attribute and the behavior characteristic are combined, and then whether the prediction of the recognition model is reasonable or not is verified. And finally, confirming the optimal algorithm rule of the recognition model by integrating the two indexes.
Step 104: determining an abnormal object in the at least one target object based on the first information, the second information and the recognition model of the at least one target object.
In an optional embodiment of the present application, the step 104 may be specifically implemented by:
and inputting the input information table into the recognition model, and determining an abnormal object in the at least one target object by using the recognition model.
Specifically, in the embodiment of the present application, after the trained recognition model is obtained, the input information table obtained based on the operation behavior information and the activity participation information of the plurality of users is input to the trained recognition model, that is, whether each of the plurality of users is an abnormal user can be recognized by using the trained recognition model.
In an optional embodiment of the present application, in a case that the recognition model is an optimized recognition model, the present application can determine an abnormal object in the at least one target object based on the first information and the second information of the at least one target object and the optimized recognition model.
Specifically, in the embodiment of the application, after the identification model is obtained, the identification model can be further optimized, and whether the plurality of users are abnormal users or not can be predicted by using the optimized identification model, so that the prediction accuracy is improved.
According to the technical scheme, the basic attribute and behavior information of the client, the CP and the users participating in the activity can be fused, the information is processed and subjected to derivative statistics and then serves as model input, the abnormal users are identified by the aid of the identification model, the algorithm is optimized, and an abnormal object set is output.
Fig. 3 is a schematic diagram of a training process of a recognition model to be trained according to an embodiment of the present application, and as shown in fig. 2, the training process of the recognition model to be trained includes the following steps:
step 301: and (6) data acquisition.
Determining a data acquisition channel, and extracting behavior information of a plurality of users from a plurality of channels.
Step 302: and acquiring basic information, operation behavior information and participation activity information.
And extracting basic information, operation behavior information and participation activity information of the users from the extracted behavior information of the plurality of users.
Step 303: and establishing a characteristic index information table (namely an input information table) required by the model.
After extracting the basic information, the operation behavior information and the activity participation information of the user, summarizing and deriving an original data set of the user behavior information through a statistical means to form an information table required by the input of the recognition model to be trained.
Step 304: and (4) extracting GBDT characteristics.
After a characteristic index information table required by the model is established, sample data input by the model is established according to the characteristic index information table, and the sample data is divided to obtain a sample data set used for training the operation behavior information GBDT tree and a sample data set used for training the activity participation information GBDT tree.
And respectively training the two GBDT trees by utilizing the two sample data sets to obtain leaf nodes output by the two GBDT trees.
Step 305: and (4) training an LR model.
And training the LR model by using the characteristic pairs output by the leaf nodes of the output of the two GBDT trees as the input of the LR model.
Step 306: and (6) outputting the model.
The LR model can output and predict whether the registration number in each sample data set is an abnormal user, and the output result is msisdn1: flag. Wherein, flag =0 represents a non-abnormal number, and flag =1 represents belonging to an abnormal number.
Step 307: and judging whether the model is reasonable.
And determining whether the prediction of the model corresponding to the abnormal user is reasonable or not according to the output of the LR model. The method for determining whether the prediction result is correct mainly includes two methods, step 308 and step 309.
Step 308: and reversely verifying the identification accuracy of the abnormal user according to the basic attribute and the behavior information of the user.
And determining whether the user is an abnormal user or not by the model training personnel according to the basic attribute behavior information of the user and the like, and judging the determination result and the prediction result of the model to determine whether the prediction of the model is correct or not.
Step 309: and (4) optimizing the algorithm.
And circularly executing the steps 304 to 307 to optimize the model under the condition that the accuracy of the prediction result of the model is judged to be low.
Step 310: an algorithm rule is determined.
And under the condition that the algorithm model is optimized to ensure that the accuracy of the obtained model meets the condition, taking the finally optimized algorithm model as a final recognition model, and subsequently recognizing the abnormal user by using the final recognition model.
Fig. 4 is a schematic structural composition diagram of an abnormal object recognition apparatus 400 provided in the embodiment of the present application, and as shown in fig. 4, the abnormal object recognition apparatus 400 includes:
a first obtaining unit 401, configured to obtain first information and second information of at least one target object; the first information is operation behavior information, and the second information is activity participation information;
a second obtaining unit 402, configured to obtain sample data based on the first information and the second information of the at least one target object;
a training unit 403, configured to train, by using the sample data, an identification model to be trained, to obtain an identification model;
a determining unit 404, configured to determine an abnormal object in the at least one target object based on the first information and the second information of the at least one target object and the recognition model.
In an optional embodiment of the present application, the sample data includes positive sample data and negative sample data; the second obtaining unit 402 is specifically configured to: constructing an input information table based on the first information and the second information of the at least one target object; matching a plurality of abnormal characteristic information based on the input information table as positive sample data in the sample data; and matching a plurality of pieces of normal characteristic information based on the input information table to serve as negative sample data in the sample data.
In an optional embodiment of the application, the determining unit 404 is specifically configured to: and inputting the input information table into the recognition model, and determining an abnormal object in the at least one target object by using the recognition model.
In an optional embodiment of the present application, the sample data includes a plurality of feature indicators, and the training unit 403 is specifically configured to: determining a correlation between each characteristic index of the plurality of characteristic indexes, and determining at least one important index of the plurality of characteristic indexes based on the correlation; and training the identification model to be trained by using the sample data corresponding to the at least one important index in the sample data.
In an optional embodiment of the present application, the recognition model to be trained includes a gradient boosting decision tree GBDT model, and the training unit 403 is specifically configured to: dividing the sample data according to the operation behavior characteristics and the activity participation characteristics to obtain an operation behavior characteristic set and an activity participation characteristic set; respectively establishing a first GBDT model corresponding to the operation behavior characteristics and a second GBDT model corresponding to the activity participation characteristics; traversing the first GBDT model by using the operation behavior feature set to obtain first features output by leaf nodes of the first GBDT model; traversing the second GBDT model by using the activity participation characteristic set to obtain second characteristics output by leaf nodes of the second GBDT model.
In an optional embodiment of the present application, the recognition model to be trained further includes a logistic regression LR model, and the training unit 403 is further specifically configured to: training the LR model by using the first characteristic and the second characteristic to obtain a trained LR model; the output of the LR model includes identification information of a training subject included in the sample data and a type identification of the training subject.
In an optional embodiment of the present application, the training unit 403 is specifically configured to: training the recognition model to be trained by using a K-fold intersection method based on the sample data to obtain K target recognition models with different parameters; and selecting the target recognition model with the maximum harmonic mean value from the K target recognition models with different parameters as a recognition model.
In an optional implementation manner of this application, the training unit 403 trains the recognition model to be trained by using the sample data, and after obtaining the recognition model, the apparatus further includes:
an optimizing unit 405, configured to test the recognition model by using test sample data, and determine whether an index of the recognition model satisfies a preset condition; and/or determining whether the recognition result of the recognition model is correct or not based on third information of each object in the at least one target object; if the index of the recognition model does not meet the preset condition and/or the recognition result of the recognition model is incorrect, continuing to optimize the recognition model to obtain an optimized recognition model;
the determining unit 404 is further specifically configured to: determining an abnormal object of the at least one target object based on the first information, the second information of the at least one target object and the optimized recognition model.
It should be understood by those skilled in the art that the implementation functions of the units in the abnormal object recognition apparatus 400 shown in fig. 4 can be understood by referring to the related description of the abnormal object recognition method. The functions of the units in the apparatus 400 for identifying an abnormal object shown in fig. 4 may be implemented by a program running on a processor, or may be implemented by a specific logic circuit.
The embodiment of the application also provides the electronic equipment. Fig. 5 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present application, and as shown in fig. 5, the electronic device includes: a communication component 503 for data transmission, at least one processor 501 and a memory 502 for storing computer programs capable of running on the processor 501. The various components in the terminal are coupled together by a bus system 504. It is understood that the bus system 504 is used to enable communications among the components. The bus system 504 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 504 in fig. 5.
Wherein the processor 501 executes the computer program to perform at least the steps of the method shown in fig. 1.
It will be appreciated that the memory 502 can be either volatile memory or nonvolatile memory, and can include both volatile and nonvolatile memory. Among them, the nonvolatile Memory may be a Read Only Memory (ROM), a Programmable Read Only Memory (PROM), an Erasable Programmable Read-Only Memory (EPROM), an Electrically Erasable Programmable Read-Only Memory (EEPROM), a magnetic random access Memory (FRAM), a magnetic random access Memory (Flash Memory), a magnetic surface Memory, an optical Disc, or a Compact Disc Read-Only Memory (CD-ROM); the magnetic surface storage may be disk storage or tape storage. Volatile Memory can be Random Access Memory (RAM), which acts as external cache Memory. By way of illustration and not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), synchronous Static Random Access Memory (SSRAM), dynamic Random Access Memory (DRAM), synchronous Dynamic Random Access Memory (SDRAM), double Data Rate Synchronous Dynamic Random Access Memory (DDRSDRAM), enhanced Synchronous Dynamic Random Access Memory (ESDRAM), enhanced Synchronous Dynamic Random Access Memory (Enhanced DRAM), synchronous Dynamic Random Access Memory (SLDRAM), direct Memory (DRmb Access), and Random Access Memory (DRAM). The memory 502 described in embodiments herein is intended to comprise, without being limited to, these and any other suitable types of memory.
The method disclosed in the embodiments of the present application may be applied to the processor 501, or implemented by the processor 501. The processor 501 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in software form in the processor 501. The processor 501 described above may be a general purpose processor, a DSP, or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like. The processor 501 may implement or perform the methods, steps, and logic blocks disclosed in the embodiments of the present application. A general purpose processor may be a microprocessor or any conventional processor or the like. The steps of the method disclosed in the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software modules may be located in a storage medium located in the memory 502, and the processor 501 reads the information in the memory 502 and performs the steps of the aforementioned methods in conjunction with its hardware.
In an exemplary embodiment, the electronic Device may be implemented by one or more Application Specific Integrated Circuits (ASICs), DSPs, programmable Logic Devices (PLDs), complex Programmable Logic Devices (CPLDs), FPGAs, general purpose processors, controllers, MCUs, microprocessors (microprocessors), or other electronic components for performing the aforementioned call recording method.
An embodiment of the present application further provides a computer-readable storage medium, on which a computer program is stored, where the computer program is configured to, when executed by a processor, perform at least the steps of the method shown in fig. 1. The computer readable storage medium may be specifically a memory. The memory may be memory 502 as shown in fig. 5.
The technical solutions described in the embodiments of the present application can be arbitrarily combined without conflict.
In the several embodiments provided in the present application, it should be understood that the disclosed method and intelligent device may be implemented in other ways. The above-described device embodiments are merely illustrative, for example, the division of the unit is only a logical functional division, and there may be other division ways in actual implementation, such as: multiple units or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or units may be electrical, mechanical or other forms.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed on a plurality of network units; some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, all functional units in the embodiments of the present application may be integrated into one second processing unit, or each unit may be separately regarded as one unit, or two or more units may be integrated into one unit; the integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.
The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application.

Claims (11)

1. A method for identifying an abnormal object, the method comprising:
obtaining first information and second information of at least one target object; the first information is operation behavior information, and the second information is activity participation information;
obtaining sample data based on the first information and the second information of the at least one target object;
training the identification model to be trained by using the sample data to obtain an identification model;
determining an abnormal object in the at least one target object based on the first information, the second information and the recognition model of the at least one target object.
2. The method of claim 1, wherein the sample data comprises positive and negative sample data;
the obtaining sample data based on the first information and the second information of the at least one target object includes:
constructing an input information table based on the first information and the second information of the at least one target object;
matching a plurality of abnormal characteristic information based on the input information table as positive sample data in the sample data;
and matching a plurality of pieces of normal characteristic information based on the input information table to serve as negative sample data in the sample data.
3. The method of claim 2, wherein the determining abnormal objects in the at least one target object based on the first information and the second information of the at least one target object and the recognition model comprises:
and inputting the input information table into the recognition model, and determining an abnormal object in the at least one target object by using the recognition model.
4. The method according to any one of claims 1 to 3, wherein the sample data includes a plurality of feature indicators, and the training of the recognition model to be trained using the sample data includes:
determining a correlation between each characteristic index of the plurality of characteristic indexes, and determining at least one important index of the plurality of characteristic indexes based on the correlation;
and training the identification model to be trained by using the sample data corresponding to the at least one important index in the sample data.
5. The method according to any one of claims 1 to 3, wherein the recognition model to be trained comprises a gradient-boosting decision tree (GBDT) model, and the training of the recognition model to be trained using the sample data to obtain a recognition model comprises:
dividing the sample data according to the operation behavior characteristics and the activity participation characteristics to obtain an operation behavior characteristic set and an activity participation characteristic set;
respectively establishing a first GBDT model corresponding to the operation behavior characteristics and a second GBDT model corresponding to the activity participation characteristics;
traversing the first GBDT model by using the operation behavior feature set to obtain first features output by leaf nodes of the first GBDT model;
traversing the second GBDT model by using the activity participation characteristic set to obtain second characteristics output by leaf nodes of the second GBDT model.
6. The method of claim 5, wherein the recognition model to be trained further comprises a Logistic Regression (LR) model, and wherein training the recognition model to be trained using the sample data to obtain a recognition model further comprises:
training the LR model by using the first characteristic and the second characteristic to obtain a trained LR model; the output of the LR model includes identification information of a training subject included in the sample data and a type identification of the training subject.
7. The method according to any one of claims 1 to 3, wherein training a recognition model to be trained using the sample data to obtain a recognition model comprises:
training the recognition model to be trained by using a K-fold intersection method based on the sample data to obtain K target recognition models with different parameters;
and selecting the target recognition model with the maximum harmonic mean value from the K target recognition models with different parameters as a recognition model.
8. The method according to any one of claims 1 to 3, wherein after training a recognition model to be trained by using the sample data and obtaining the recognition model, the method further comprises:
testing the identification model by using test sample data to determine whether the index of the identification model meets a preset condition; and/or determining whether the recognition result of the recognition model is correct or not based on the third information of each object in the at least one target object;
if the index of the recognition model does not meet the preset condition and/or the recognition result of the recognition model is incorrect, continuing to optimize the recognition model to obtain an optimized recognition model;
the determining an abnormal object in the at least one target object based on the first information, the second information and the recognition model of the at least one target object comprises:
determining an abnormal object of the at least one target object based on the first information, the second information of the at least one target object and the optimized recognition model.
9. An apparatus for identifying an abnormal object, the apparatus comprising:
a first obtaining unit configured to obtain first information and second information of at least one target object; the first information is operation behavior information, and the second information is activity participation information;
a second obtaining unit, configured to obtain sample data based on the first information and the second information of the at least one target object;
the training unit is used for training the identification model to be trained by utilizing the sample data to obtain the identification model;
a determining unit, configured to determine an abnormal object in the at least one target object based on the first information and the second information of the at least one target object and the recognition model.
10. An electronic device, characterized in that the electronic device comprises: a memory having computer-executable instructions stored thereon and a processor operable to implement the method of any of claims 1 to 8 when executing the computer-executable instructions on the memory.
11. A computer storage medium having stored thereon executable instructions that when executed by a processor implement the method of any one of claims 1 to 8.
CN202110795543.7A 2021-07-14 2021-07-14 Abnormal object identification method and device, electronic equipment and storage medium Pending CN115700787A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110795543.7A CN115700787A (en) 2021-07-14 2021-07-14 Abnormal object identification method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110795543.7A CN115700787A (en) 2021-07-14 2021-07-14 Abnormal object identification method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN115700787A true CN115700787A (en) 2023-02-07

Family

ID=85120359

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110795543.7A Pending CN115700787A (en) 2021-07-14 2021-07-14 Abnormal object identification method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN115700787A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117521042A (en) * 2024-01-05 2024-02-06 创旗技术有限公司 High-risk authorized user identification method based on ensemble learning

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117521042A (en) * 2024-01-05 2024-02-06 创旗技术有限公司 High-risk authorized user identification method based on ensemble learning
CN117521042B (en) * 2024-01-05 2024-05-14 创旗技术有限公司 High-risk authorized user identification method based on ensemble learning

Similar Documents

Publication Publication Date Title
CN111444952B (en) Sample recognition model generation method, device, computer equipment and storage medium
CN108090567B (en) Fault diagnosis method and device for power communication system
EP3413221A1 (en) Risk assessment method and system
CN110442712B (en) Risk determination method, risk determination device, server and text examination system
WO2018040068A1 (en) Knowledge graph-based semantic analysis system and method
CN112801498B (en) Training method of risk identification model, risk identification method, device and equipment
CN106027577A (en) Exception access behavior detection method and device
CN113221104B (en) Detection method of abnormal behavior of user and training method of user behavior reconstruction model
CN112508580A (en) Model construction method and device based on rejection inference method and electronic equipment
CN112966865B (en) Number-carrying network-switching prediction method, device and equipment
CN113298121B (en) Message sending method and device based on multi-data source modeling and electronic equipment
CN110322254B (en) Online fraud identification method, device, medium and electronic equipment
CN110348471B (en) Abnormal object identification method, device, medium and electronic equipment
CN111797320A (en) Data processing method, device, equipment and storage medium
CN111695938A (en) Product pushing method and system
CN110162958B (en) Method, apparatus and recording medium for calculating comprehensive credit score of device
CN105405051B (en) Financial event prediction method and device
CN111951008A (en) Risk prediction method and device, electronic equipment and readable storage medium
WO2018036402A1 (en) Method and device for determining key variable in model
CN115700787A (en) Abnormal object identification method and device, electronic equipment and storage medium
CN113010785A (en) User recommendation method and device
CN116739795A (en) Knowledge graph-based insurance risk assessment method and device and electronic equipment
CN115186759A (en) Model training method and user classification method
CN115907936A (en) Money laundering risk self-evaluation method and system
CN114581209A (en) Method, device and equipment for training financial analysis model and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination