CN111612037A

CN111612037A - Abnormal user detection method, device, medium and electronic equipment

Info

Publication number: CN111612037A
Application number: CN202010331880.6A
Authority: CN
Inventors: 胡青宇; 古承炬; 何振; 尹小亮; 林育芳; 陈炯其
Original assignee: Ping An Zhitong Consulting Co Ltd Shanghai Branch
Current assignee: Ping An Zhitong Consulting Co Ltd Shanghai Branch
Priority date: 2020-04-24
Filing date: 2020-04-24
Publication date: 2020-09-01

Abstract

The application relates to the technical field of artificial intelligence, can be applied to an intelligent security scene, and provides an abnormal user detection method, which comprises the following steps: searching a plurality of sub data sets from a first level to a target level in sequence from a user behavior characteristic data set of a previous level as a plurality of user behavior characteristic data sets of a next level; clustering user samples in the searched sub data sets to obtain user sample cluster, and then calculating outline coefficients of the sub data sets; acquiring a subdata set with the contour coefficient larger than a preset threshold value as a characteristic data set to be detected; inputting each feature data set to be detected into an isolated forest anomaly detection model to obtain a prediction anomaly user sample; and scoring the feature data set to be detected to determine abnormal user samples in the user behavior feature data set of the first level based on the scoring. The embodiment of the application effectively improves the accuracy and reliability of the detection of the abnormal user.

Description

Abnormal user detection method, device, medium and electronic equipment

Technical Field

The application relates to the technical field of artificial intelligence, in particular to an abnormal user detection method, device, medium and electronic equipment.

Background

With the rise and the vigorous development of Artificial Intelligence (AI), the landing application of the AI technology in the abnormal detection scene is more and more concerned by various industries. However, the artificial intelligence technology is used for anomaly detection, but the technical difficulty that data labels are few or no labels exists, so that the unsupervised learning technology is almost the only choice. And the isolated forest algorithm is commonly used in the unsupervised learning technology. Under normal conditions, the isolated forest algorithm can output global relative outliers; the abnormal users often pretend to be normal users, and only appear to be abnormal on local characteristics; when it is not known which behaviors abnormal users will have abnormality, development features are often required as much as possible, and dimension disasters are caused; although the characteristic dimensionality is enriched, unidentifiable noise is brought to the isolated forest algorithm model; thus, a global relative outlier does not always mean a true outlier user. Thus, in the prior art, when the abnormal user is detected, the detection accuracy and reliability are low.

Therefore, there is a need to provide a new abnormal user detection scheme.

It is to be noted that the information disclosed in the above background section is only for enhancement of understanding of the background of the present application and therefore may include information that does not constitute prior art known to a person of ordinary skill in the art.

Disclosure of Invention

The application aims to provide an abnormal user detection scheme, and then accuracy and reliability of abnormal user detection are effectively improved at least to a certain extent.

According to an aspect of the present application, there is provided an abnormal user detection method, including:

starting from a first-stage user behavior characteristic data set, searching a plurality of sub data sets from a previous-stage user behavior characteristic data set from the first stage to a target stage in sequence to serve as a plurality of next-stage user behavior characteristic data sets, wherein the plurality of sub data sets in each stage have the same characteristic number between every two sub data sets and have at least one characteristic different from each other;

clustering user samples in the searched sub data sets to obtain user sample cluster clusters, and calculating the profile coefficient of the sub data sets based on the user sample cluster clusters;

acquiring the subdata set with the contour coefficient larger than a preset threshold value as a characteristic data set to be detected;

inputting each feature data set to be detected into an isolated forest anomaly detection model to obtain a user sample with abnormal prediction in each feature data set to be detected;

and scoring the characteristic data sets to be detected by using the user sample cluster corresponding to each characteristic data set to be detected and the abnormal user sample to be predicted so as to determine the abnormal user sample in the first-level user behavior characteristic data set based on the score.

In an exemplary embodiment of the application, the plurality of user behavior feature data sets of the next stage includes:

the number of the behavior characteristics in the user behavior characteristic data set of the next stage is less than the number of the behavior characteristics in the user behavior characteristic data set of the previous stage by a first preset number.

In an exemplary embodiment of the present application, the method further comprises:

the number of the behavior characteristics in the target-level user behavior characteristic data set is greater than or equal to a second preset number.

In an exemplary embodiment of the present application, the user behavior feature data set includes:

a fraud detection related activity feature dataset comprising fraud detection related activity feature datasets of a set of user samples corresponding to a group of fraudulent users.

In an exemplary embodiment of the present application, after clustering the user samples in the searched sub data set to obtain a user sample cluster, calculating an outline coefficient of the sub data set based on the user sample cluster includes:

based on the formula:

calculating a contour coefficient s (i) of each user sample in the subset, wherein a (i) is a distance average value from each user sample to other user samples in the cluster of belonged user samples, and b (i) is a minimum value of distance average values from each user sample to user samples in the cluster of non-belonged user samples;

and calculating the average value of the profile coefficients of all the user samples in the sub data set, wherein the average value is the profile coefficient of the sub data set.

In an exemplary embodiment of the present application, the scoring the feature data sets to be detected by using the user sample cluster corresponding to each feature data set to be detected and the predicted abnormal user sample to determine the abnormal user sample in the first-level user behavior feature data set based on the scoring includes:

calculating the abnormal confidence of each feature data set to be detected by using the user sample cluster corresponding to each feature data set to be detected and the abnormal prediction user sample;

and determining an abnormal user sample in the first-level user behavior feature data set based on the abnormal confidence.

In an exemplary embodiment of the present application, the calculating the abnormal confidence of each feature data set to be detected by using the user sample cluster corresponding to each feature data set to be detected and the predicted abnormal user sample includes:

based on the formula

And calculating the abnormal confidence β of each feature data set to be detected, wherein the Xn1 is the distance between the predicted abnormal user sample and the central point of the user sample cluster to which the abnormal user sample belongs, the Xn2 is the distance between all the non-abnormal user samples in the user sample cluster and the central point, and the n is the number of the abnormal user samples in the feature data set to be detected.

In an exemplary embodiment of the present application, the determining an abnormal user sample in the first-level user behavior feature data set based on the abnormal confidence level includes:

acquiring the characteristic data set to be detected corresponding to the abnormal confidence degree lower than a preset confidence degree threshold value as an abnormal characteristic data set;

determining the predicted abnormal user sample in the abnormal characteristic data set as an abnormal user sample in the first-level user behavior characteristic data set; further comprising storing the anomalous user samples into a blockchain.

According to an aspect of the present application, an abnormal user detection apparatus includes:

the searching module is used for searching a plurality of sub data sets from the user behavior characteristic data set of the first stage to the target stage in sequence from the user behavior characteristic data set of the previous stage as a plurality of user behavior characteristic data sets of the next stage, wherein the plurality of sub data sets of each stage have the same characteristic number and at least one characteristic is different;

the calculating module is used for clustering the user samples in the searched sub data sets to obtain user sample cluster clusters, and then calculating the outline coefficient of the sub data sets based on the user sample cluster clusters;

the acquisition module is used for acquiring the subdata set with the contour coefficient larger than a preset threshold value as a characteristic data set to be detected;

the prediction module is used for inputting each feature data set to be detected into an isolated forest anomaly detection model to obtain a prediction anomaly user sample in each feature data set to be detected;

and the determining module is used for scoring the characteristic data sets to be detected by using the user sample cluster corresponding to each characteristic data set to be detected and the abnormal user sample to be predicted so as to determine the abnormal user sample in the first-level user behavior characteristic data set based on the scoring.

According to an aspect of the application, there is provided a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the method of any of the above.

According to an aspect of the present application, there is provided an electronic device including:

a processor; and

a memory for storing computer program instructions for the processor; wherein the processor is configured to perform any of the methods described above via execution of the computer program instructions.

The method comprises the steps of firstly, starting from a first-stage user behavior characteristic data set, sequentially searching a plurality of subdata sets from the first stage to a target stage from a previous-stage user behavior characteristic data set as a plurality of subsequent-stage user behavior characteristic data sets, wherein the plurality of subdata sets in each stage have the same characteristic number and at least one characteristic is different from each other; the subdata sets corresponding to the low-dimensional characteristic sampling spaces of all levels can be generated by sequentially sampling and combining from the first level to the target level, subsequent detection is carried out to avoid dimensional disasters, meanwhile, the plurality of subdata sets of the previous level are used as a plurality of user behavior characteristic data sets of the next level, gradient correlation sampling search can be carried out, and the data set of the next level can further explain the data set of the previous level. Secondly, clustering user samples in the searched sub data sets to obtain user sample cluster clusters, and calculating the profile coefficient of the sub data sets based on the user sample cluster clusters; the clustering effect of the user samples in each sub data set can be evaluated through the contour coefficient, namely whether the cohesion and the separation of the user samples in all the user sample cluster are relatively excellent or not can be judged. Then, acquiring a subdata set with the contour coefficient larger than a preset threshold value as a characteristic data set to be detected; the feature data set to be detected, of which the clustering effect of the user samples meets the requirement, can be acquired, that is, the cohesion and the separation of the user samples in all the user sample clustering clusters are relatively excellent, and further, the optimized clustering group to which each user sample belongs is determined. Then, inputting each feature data set to be detected into an isolated forest anomaly detection model to obtain a prediction anomaly user sample in each feature data set to be detected; the abnormal detection model of the isolated forest is used for selecting an abnormal user sample based on a low-dimensionality characteristic subspace (to-be-detected characteristic data set) with a good clustering effect, and is low in workload and low in noise. Finally, scoring the characteristic data sets to be detected by using the user sample cluster corresponding to each characteristic data set to be detected and the predicted abnormal user samples so as to determine the abnormal user samples in the first-level user behavior characteristic data set based on the scoring; through clustering, a plurality of user sample cluster clusters corresponding to each characteristic data set to be detected are obtained, and a label of a group to which each user sample in a subspace of each characteristic data set to be detected belongs can be marked; meanwhile, through the anomaly detection of the isolated forest, each user sample in the subspace of each feature data set to be detected can be marked with a label for judging whether the user sample is abnormal or not. Furthermore, each feature data set to be detected can be subjected to anomaly evaluation by predicting the abnormal user samples and the user sample cluster, so that the reliability of selecting the abnormal user samples based on the feature subspace (the feature data set to be detected) is further ensured. Therefore, the accuracy and reliability of the abnormal user detection are effectively improved, and the accuracy and reliability of the abnormal user detection based on artificial intelligence are guaranteed.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application. It is obvious that the drawings in the following description are only some embodiments of the application, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.

Fig. 1 schematically shows a flow chart of an abnormal user detection method.

Fig. 2 schematically shows an application scenario example of an abnormal user detection method.

Fig. 3 schematically shows a flow chart of an abnormal user sample determination method.

Fig. 4 schematically shows a block diagram of an abnormal user detection apparatus.

Fig. 5 schematically shows an example block diagram of an electronic device for implementing the above-described abnormal user detection method.

Fig. 6 schematically illustrates a computer-readable storage medium for implementing the above-described abnormal user detection method.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the application. One skilled in the relevant art will recognize, however, that the subject matter of the present application can be practiced without one or more of the specific details, or with other methods, components, devices, steps, and so forth. In other instances, well-known technical solutions have not been shown or described in detail to avoid obscuring aspects of the present application.

Furthermore, the drawings are merely schematic illustrations of the present application and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus their repetitive description will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.

In the present exemplary embodiment, a method for detecting an abnormal user is first provided, where the method for detecting an abnormal user may be run on a server, or may be run on a server cluster or a cloud server, and of course, a person skilled in the art may also run the method of the present invention on other platforms as needed, and this is not particularly limited in the present exemplary embodiment. Referring to fig. 1, the abnormal user detection method may include the steps of:

step S110, starting from a first-stage user behavior characteristic data set, searching a plurality of sub data sets from a first-stage user behavior characteristic data set to a target stage in sequence from the previous-stage user behavior characteristic data set as a plurality of next-stage user behavior characteristic data sets, wherein the plurality of sub data sets in each stage have the same characteristic number and at least one characteristic is different from each other;

step S120, clustering the user samples in the searched sub data sets to obtain user sample cluster, and calculating the contour coefficient of the sub data sets based on the user sample cluster;

step S130, acquiring the subdata set with the contour coefficient larger than a preset threshold value as a characteristic data set to be detected;

step S140, inputting each feature data set to be detected into an isolated forest anomaly detection model to obtain a user sample for predicting anomaly in each feature data set to be detected;

step S150, scoring the characteristic data sets to be detected by using the user sample cluster corresponding to each characteristic data set to be detected and the abnormal user sample to be predicted, so as to determine the abnormal user sample in the first-level user behavior characteristic data set based on the scoring.

In the above abnormal user detection method, first, starting from a first-stage user behavior feature data set, sequentially searching a plurality of sub data sets from a first-stage user behavior feature data set to a target stage from a previous-stage user behavior feature data set as a plurality of next-stage user behavior feature data sets, wherein the plurality of sub data sets in each stage have the same feature number between every two sub data sets and have at least one feature different from each other; the subdata sets corresponding to the low-dimensional characteristic sampling spaces of all levels can be generated by sequentially sampling and combining from the first level to the target level, subsequent detection is carried out to avoid dimensional disasters, meanwhile, the plurality of subdata sets of the previous level are used as a plurality of user behavior characteristic data sets of the next level, gradient correlation sampling search can be carried out, and the data set of the next level can further explain the data set of the previous level. Secondly, clustering user samples in the searched sub data sets to obtain user sample cluster clusters, and calculating the profile coefficient of the sub data sets based on the user sample cluster clusters; the clustering effect of the user samples in each sub data set can be evaluated through the contour coefficient, namely whether the cohesion and the separation of the user samples in all the user sample cluster are relatively excellent or not can be judged. Then, acquiring a subdata set with the contour coefficient larger than a preset threshold value as a characteristic data set to be detected; the feature data set to be detected, of which the clustering effect of the user samples meets the requirement, can be acquired, that is, the cohesion and the separation of the user samples in all the user sample clustering clusters are relatively excellent, and further, the optimized clustering group to which each user sample belongs is determined. Then, inputting each feature data set to be detected into an isolated forest anomaly detection model to obtain a prediction anomaly user sample in each feature data set to be detected; the abnormal user sample selection of the isolated forest abnormal detection model based on the low-dimensionality and good clustering effect feature subspace (to-be-detected feature data set) is realized, the workload is low, and the noise is low. Finally, scoring the characteristic data sets to be detected by using the user sample cluster corresponding to each characteristic data set to be detected and the predicted abnormal user samples so as to determine the abnormal user samples in the first-level user behavior characteristic data set based on the scoring; the cluster of the user samples corresponding to each feature data set to be detected can reflect the cluster group to which each sample belongs, the abnormal user sample can reflect the abnormal condition of the user sample in each cluster group by prediction, and further, the abnormal evaluation can be performed on each feature data set to be detected by predicting the abnormal user sample and the cluster of the user sample, so that the reliability of selecting the abnormal user sample based on the feature subspace (the feature data set to be detected) is further ensured. Thereby effectively improving the accuracy and reliability of the abnormal user detection. The scheme of the application can be applied to scenes such as security monitoring and the like of intelligent security, so that the construction of an intelligent city is promoted.

Hereinafter, each step in the above-described abnormal user detection method in the present exemplary embodiment will be explained and explained in detail with reference to the drawings.

In step S110, starting from the first-stage user behavior feature data set, a plurality of sub data sets are sequentially searched from the first-stage user behavior feature data set to the target stage from the previous-stage user behavior feature data set as a plurality of next-stage user behavior feature data sets, where the plurality of sub data sets in each stage have the same number of features and at least one feature is different from each other.

In the embodiment of this example, referring to fig. 2, in a scenario, after the server 201 obtains the first-level user behavior feature data set from the server 202, the server 201 may search, starting from the first-level user behavior feature data set, a plurality of sub data sets from a previous-level user behavior feature data set from the first level to a target level in sequence as a plurality of next-level user behavior feature data sets, where feature numbers of the sub data sets in the plurality of sub data sets at each level are the same and at least one feature of the sub data sets is different. The server 201 and the server 202 may be any terminals having a function of executing program instructions and storing the program instructions, such as a cloud server, a mobile phone, a computer, and the like.

The first-level user behavior feature data set may be a behavior feature data set of the acquired user set to be detected, for example, financial transaction related behavior feature data of a customer of a mobile banking. The first-level user behavior feature data set is an initial data set and is used for detecting abnormal users. For the first-level user behavior feature data set, a fraud object is often disguised as a normal user, only local features are shown as abnormal, and feature dimension disasters can be caused if abnormal recognition is directly carried out.

Starting from a first-stage user behavior characteristic data set, searching a plurality of sub-data sets from a first-stage user behavior characteristic data set to a target stage in sequence to serve as a plurality of next-stage user behavior characteristic data sets, namely, using a plurality of sub-data sets (the sub-data sets in the plurality of sub-data sets have the same characteristic number and at least one characteristic is different) obtained by searching the first-stage user behavior characteristic data set as a plurality of second-stage user behavior characteristic data sets, then respectively searching each second-stage user behavior characteristic data set to obtain a plurality of sub-data sets (the sub-data sets in the plurality of sub-data sets have the same characteristic number and at least one characteristic is different) as a plurality of third-stage user behavior characteristic data sets, and sequentially reaching the target stage to obtain sub-data sets corresponding to each stage from the first stage to the target stage, the search is completed.

It is possible to specify: start _ len: the maximum feature number (the feature number corresponding to the sub data set obtained by searching in the user behavior feature data set of the first level) in the subspace of the sub data set, start _ len < input feature number (the feature number corresponding to the first level), end _ len: the minimum number of features (the number of features corresponding to a target level) in the subspace of the sub data set, where end _ len ═ start _ len (that is, when the number of features corresponding to the first level is less than the number of features corresponding to the target level); step _ len: and when the subdata data sets are searched each time, reducing the number of the features, namely the difference value of the number of the features in the user behavior feature data sets of two adjacent levels.

When an original feature space (a first-level user behavior feature data set) is input, with step _ len as the length, all feature subsets (sub-data sets) are searched from a feature subset (a second-level user behavior feature data set) with the length of start _ len to a feature subset (a target-level user behavior feature data set) with the length of end _ len in sequence, and the search is completed.

In one example, start _ len ═ 5, end _ len ═ 3, and step _ len ═ 1 are specified. When the original feature space of the user behavior feature data set input to the first stage includes features [ f1, f2, f3, f4, f5 and f6], 6 sub data sets [ f1, f2, f3, f4 and f5], [ f1, f2, f3, f4 and f6], [ f1, f2, f3, f5 and f6], [ f1, f2, f4, f5 and f6], [ f1, f3, f4, f5 and f6], [ f2, f3, f4 and f5 and f6] with the feature number of 5 can be searched, and then the 6 sub data sets are used as the user behavior feature data sets of the second stage to be searched respectively until the fourth stage (target stage) is completed.

Therefore, the subdata sets corresponding to the low-dimensional characteristic sampling spaces of all levels can be generated by sequentially sampling and combining the first level to the target level, subsequent detection is carried out to avoid dimensional disasters, meanwhile, the plurality of subdata sets of the previous level are used as the plurality of user behavior characteristic data sets of the next level, gradient correlation sampling search can be carried out, and the data set of the next level can further explain the data set of the previous level.

In one embodiment, the plurality of user behavior feature data sets of the next stage includes:

That is, when searching the sub data sets from the user behavior feature data sets of each level, the number of behavior features in the sub data sets is less than the number of features in the user behavior feature data sets by a first predetermined number, for example, the number of behavior features in a user behavior feature data set of a certain level is 6, and the number of behavior features in the searched sub data sets is less than 1 than 6, so that the number of behavior features in the sub data sets is 5, that is, the number of behavior features in a user behavior feature data set of a level after a certain level is 5.

The first preset number can be set according to requirements, and the larger the first preset number is, the larger the search interval is, and the faster the search speed is.

In one embodiment, the method further comprises:

The second preset number can be set according to requirements, the number of the behavior characteristics in the target-level user behavior characteristic data set is larger than or equal to the second preset number, the number of the searched characteristic numbers in the subdata data set can be ensured to be larger than or equal to the second preset number, and the effectiveness of the data set is ensured.

In one embodiment, the user behavior feature dataset comprises:

The fraud detection-related behavior is preset to monitor the relevant data for the fraudulent user, for example, the fraudulent user is usually in a certain age group, and when the fraudulent behavior is not associated with the gender to a high degree, the fraud detection-related behavior includes the age characteristic data but not the gender characteristic data. Therefore, the accuracy and the detection efficiency of the detection of the fraudulent user can be ensured.

Further, a group of cheating users, that is, a plurality of users have a joint relationship in the cheating behavior, and belong to a group plan. Feature data sets of fraud detection related behaviour of a set of user samples corresponding to a group of fraudulent users: i.e., user-related behavior that manifests as joint fraud anomalies, such as, for example, related behavior in the process of joint guarantee in bank lending, or joint fraud-related behavior in the process of insurance claims settlement.

Because the fraud behavior often has a group partner committing risk, the group partner fraud-based fraud abnormal user detection can be effectively carried out based on the subsequent embodiment of the application.

In step S120, after clustering the user samples in the searched sub data sets to obtain user sample cluster, calculating the profile coefficient of the sub data set based on the user sample cluster.

In the embodiment of the present example, the user samples in the searched sub data set are clustered, and the clustering algorithm model may be a clustering algorithm model clustering algorithm based on distance: k means, the optimal model can be determined using the elbow method. And forming compact or discrete user sample cluster on the local and low-dimensional sub data set through clustering, so as to obtain the cluster to which each user sample belongs in the corresponding sub data set.

The contour coefficient of the sub data set is calculated based on the user sample cluster, which can be that the contour coefficient of each user sample in the user sample cluster is calculated respectively, and then the contour coefficients of the sub data set are obtained by averaging.

Therefore, the clustering effect of the user samples in each sub data set can be evaluated through the contour coefficient, namely whether the cohesion and the separation of the user samples in all the user sample clustering clusters are relatively excellent or not is judged, and the larger the contour coefficient is, the better the clustering effect of the user samples in the sub data set is.

In one embodiment, after clustering the user samples in the searched sub data sets to obtain user sample cluster clusters, calculating the profile coefficients of the sub data sets based on the user sample cluster clusters includes:

based on the formula:

For example, user samples in the sub data set are clustered to obtain k user sample cluster clusters, and for each user sample vector in the cluster, the contour coefficients s (i) of the user sample vectors are respectively calculated.

For point i of one of the user sample vectors:

calculate a (i) average (the distance of the i vector to other points in all the clusters to which it belongs);

calculating b (i) ═ min (the average distance of the i vector to all points not in its own cluster);

then the i vector contour coefficients are:

it can be seen that the value of the profile factor is between [ -1,1], and that approaching 1 means that both the cohesion and the separation are relatively good. If si is close to 1, the clustering of the sample i is reasonable; si is close to-1, indicating that sample i should be more classified into another cluster; si is approximately 0, indicating that sample i is on the boundary of two clusters. Then a (i): the average value of the dissimilarity degree of the vectors from the i point to other points in the same cluster; b (i): the minimum of the average dissimilarity of the i-vector to the other clusters.

And averaging the contour coefficients of all the points to obtain the total contour coefficient of the clustering result of the data set to be searched.

In step S130, the sub data set with the contour coefficient greater than the predetermined threshold is obtained as the feature data set to be detected.

In the embodiment of the present example, the predetermined threshold is a threshold set according to the rationality requirement of abnormality monitoring, and the threshold may be, for example, 0.6 or 0.8 or the like. Acquiring a subdata set with the contour coefficient larger than a preset threshold value as a characteristic data set to be detected; the feature data set to be detected, of which the clustering effect of the user samples meets the requirement, can be acquired, that is, the cohesion and the separation of the user samples in all the user sample clustering clusters are relatively excellent, and further, the optimized clustering group to which each user sample belongs is determined.

In one embodiment, for the subdata sets after each level of search, the subdata sets with the outline coefficient larger than the preset threshold are reserved, then, the subdata sets with the outline coefficient smaller than the preset threshold are used as the user behavior characteristic data sets of the next level, the subdata sets corresponding to the next level are continuously searched, and searching is carried out sequentially until the target level is reached.

Therefore, the data set to be searched for which the contour coefficient is smaller than the preset threshold value does not exist in a certain level from the first level to the target level, and the search is directly finished. And then, all the subdata sets with the profile coefficients larger than the preset threshold value can be abandoned to be calculated, and repeated calculation is avoided.

In step S140, each feature data set to be detected is input into an isolated forest anomaly detection model, so as to obtain a user sample with abnormal prediction in each feature data set to be detected.

In the embodiment of the present example, the isolated forest anomaly detection model is a machine learning model based on an isolated forest algorithm, which is trained in advance. And detecting the characteristic data sets to be detected with low dimensionality by using an isolated forest anomaly detection model to obtain the prediction anomaly user samples in the space.

Therefore, the isolated forest abnormity detection model can select an abnormity user sample based on a low-dimensionality characteristic subspace (to-be-detected characteristic data set) with a good clustering effect, and is low in workload and low in noise.

The isolated forest anomaly detection model is characterized in that a characteristic data set sample to be detected is used as input, an anomaly user sample in the characteristic data set sample to be detected is output by a training model through adjusting model parameters, and when the prediction accuracy of the model reaches a preset threshold value, the training is completed.

In step S150, the user sample cluster corresponding to each feature data set to be detected and the abnormal user sample are used to score the feature data set to be detected, so as to determine the abnormal user sample in the first-level user behavior feature data set based on the score.

In the embodiment of the present example, a plurality of user sample cluster clusters corresponding to each feature data set to be detected are obtained through clustering, and each user sample in the subspace of each feature data set to be detected can be labeled with a group to which the user sample belongs; meanwhile, through the anomaly detection of the isolated forest, each user sample in the subspace of each feature data set to be detected can be marked with a label for judging whether the user sample is abnormal or not. And then each feature data set to be detected is subjected to anomaly scoring by predicting an abnormal user sample and a user sample cluster.

For example, the number of the user sample cluster of the abnormal user sample can be predicted by the user sample cluster corresponding to the feature data set to be detected, and is used as a score, and the higher the score is, the more abnormal the feature data set to be detected is.

Furthermore, based on the scores, the feature data set to be detected with the high score can be determined to be abnormal, and then the feature data set to be detected with the abnormal score can be determined to be an abnormal user sample, wherein the abnormal user sample can be represented as abnormal fraud or abnormal with repayment risk, and the like. And the reliability of selecting the abnormal user sample based on the feature subspace (the feature data set to be detected) is further ensured. Therefore, the accuracy and reliability of the abnormal user detection are effectively improved, and the accuracy and reliability of the abnormal user detection based on artificial intelligence are guaranteed.

In an embodiment, referring to fig. 3, scoring the feature data sets to be detected by using the user sample cluster corresponding to each feature data set to be detected and the predicted abnormal user sample, so as to determine the abnormal user sample in the first-level user behavior feature data set based on the scoring includes:

step S310, calculating the abnormal confidence of each feature data set to be detected by using the user sample cluster corresponding to each feature data set to be detected and the abnormal prediction user sample;

step S320, determining an abnormal user sample in the first-level user behavior feature data set based on the abnormal confidence.

And the abnormal confidence degree predicts the distance from the abnormal user sample to the center point of the cluster to which the abnormal user sample belongs on the basis of the characteristic data sets to be detected, and evaluates the abnormal degree of the group in which the abnormal user sample is predicted in the sampling space corresponding to each characteristic data set to be detected. The lower the anomaly confidence, the more compact the cluster in the feature data set to be detected, and thus the higher the anomaly. Furthermore, the abnormal characteristic data set to be detected in the first-level user behavior characteristic data set can be determined based on the abnormal confidence level, and the abnormal user sample in the abnormal characteristic data set to be detected is determined as the final abnormal user sample.

In one embodiment, calculating the abnormal confidence of each feature data set to be detected by using the user sample cluster corresponding to each feature data set to be detected and the predicted abnormal user sample includes:

based on the formula

Based on the formula

The abnormal confidence of each feature data set to be detected can be calculated according to the distance between the predicted abnormal user sample and the center point of the user sample cluster to which the abnormal user sample belongs and the distance between all non-abnormal user samples in the user sample cluster and the center point.

In one embodiment, determining an abnormal user sample in the first-level user behavior feature dataset based on the abnormal confidence level comprises:

determining the predicted abnormal user sample in the abnormal characteristic data set as an abnormal user sample in the first-level user behavior characteristic data set; further comprising storing the abnormal user sample into the block chain.

And obtaining corresponding digest information based on the abnormal user sample, specifically, obtaining the digest information by performing hash processing on the abnormal user sample, for example, by using the sha256s algorithm. Uploading summary information to the blockchain can ensure the safety and the fair transparency of the user. The user device may download the summary information from the blockchain to verify whether the abnormal user sample is tampered.

The predetermined confidence threshold is a threshold set according to the requirement of the degree of abnormality. In the feature data set to be detected corresponding to the abnormal confidence lower than the preset confidence threshold, the more compact the cluster to which all the abnormal prediction user samples belong, the closer the whole feature data set to be detected is to the abnormal prediction user samples, and the more the cluster is, the more the whole feature data set to be detected is, the more the cluster is, and the feature data set can be used as the abnormal feature data set. Furthermore, the predicted abnormal user sample in the abnormal feature data set can be accurately determined as the abnormal user sample in the first-level user behavior feature data set.

In short, the embodiment of the application can realize the isolated forest anomaly detection method based on the anomaly deviation degree evaluation of the subspace group. Instead of directly detecting global anomalies, anomaly detection is performed on a sampling space of features, and the anomaly degree of a population where each sampling space anomaly point is located is evaluated. The method can effectively solve the dimension disaster problem and simultaneously reliably locate the abnormal clues.

The application also provides an abnormal user detection device. Referring to fig. 4, the abnormal user detecting apparatus may include a searching module 410, a calculating module 420, an obtaining module 430, a predicting module 440, and a determining module 450. Wherein:

the searching module 410 may be configured to search, starting from a first-stage user behavior feature data set, a plurality of sub data sets from a previous-stage user behavior feature data set to a target stage in sequence, as a plurality of next-stage user behavior feature data sets, where feature numbers of the sub data sets in the plurality of sub data sets at each stage are the same, and at least one feature of the sub data sets is different from each other;

the calculating module 420 may be configured to cluster the user samples in the searched sub data sets to obtain user sample cluster clusters, and then calculate the profile coefficients of the sub data sets based on the user sample cluster clusters;

the obtaining module 430 may be configured to obtain the sub data set with the contour coefficient greater than a predetermined threshold as a feature data set to be detected;

the prediction module 440 may be configured to input each feature data set to be detected into an isolated forest anomaly detection model, so as to obtain a user sample with abnormal prediction in each feature data set to be detected;

the determining module 450 may be configured to score the feature data sets to be detected by using the user sample cluster corresponding to each feature data set to be detected and the abnormal user sample to be predicted, so as to determine the abnormal user sample in the first-level user behavior feature data set based on the score.

The specific details of each module in the above abnormal user detection apparatus have been described in detail in the corresponding abnormal user detection method, and therefore are not described herein again.

It should be noted that although in the above detailed description several modules or units of the device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the application. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.

Moreover, although the steps of the methods herein are depicted in the drawings in a particular order, this does not require or imply that the steps must be performed in this particular order, or that all of the depicted steps must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions, etc.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present application can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which can be a personal computer, a server, a mobile terminal, or a network device, etc.) to execute the method according to the embodiments of the present application.

In an exemplary embodiment of the present application, there is also provided an electronic device capable of implementing the above method.

As will be appreciated by one skilled in the art, aspects of the present invention may be embodied as a system, method or program product. Thus, various aspects of the invention may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.

An electronic device 500 according to this embodiment of the invention is described below with reference to fig. 5. The electronic device 500 shown in fig. 5 is only an example and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.

As shown in fig. 5, the electronic device 500 is embodied in the form of a general purpose computing device. The components of the electronic device 500 may include, but are not limited to: the at least one processing unit 510, the at least one memory unit 520, and a bus 530 that couples various system components including the memory unit 520 and the processing unit 510.

Wherein the storage unit stores program code that is executable by the processing unit 510 to cause the processing unit 510 to perform steps according to various exemplary embodiments of the present invention as described in the above section "exemplary methods" of the present specification. For example, the processing unit 510 may perform the following as shown in fig. 1:

The memory unit 520 may include a readable medium in the form of a volatile memory unit, such as a random access memory unit (RAM)5201 and/or a cache memory unit 5202, and may further include a read only memory unit (ROM) 5203.

Storage unit 520 may also include a program/utility 5204 having a set (at least one) of program modules 5205, such program modules 5205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

Bus 530 may be one or more of any of several types of bus structures including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.

The electronic device 500 may also communicate with one or more external devices 700 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a client to interact with the electronic device 500, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device 500 to communicate with one or more other computing devices. Such communication may occur through input/output (I/O) interface 550, and may also include a display unit 540 coupled to input/output (I/O) interface 550. Also, the electronic device 500 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the internet) via the network adapter 560. As shown, the network adapter 560 communicates with the other modules of the electronic device 500 over the bus 530. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the electronic device 500, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present application can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to make a computing device (which can be a personal computer, a server, a terminal device, or a network device, etc.) execute the method according to the embodiments of the present application.

In an exemplary embodiment of the present application, referring to fig. 6, there is also provided a computer readable storage medium having stored thereon a program product capable of implementing the above-described method of the present specification. In some possible embodiments, aspects of the invention may also be implemented in the form of a program product comprising program code means for causing a terminal device to carry out the steps according to various exemplary embodiments of the invention described in the above section "exemplary methods" of the present description, when said program product is run on the terminal device.

Referring to fig. 6, a program product 600 for implementing the above method according to an embodiment of the present invention is described, which may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a terminal device, such as a personal computer. However, the program product of the present invention is not limited in this regard and, in the present document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

A computer readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

The block chain referred by the application is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.

Program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the client computing device, partly on the client device, as a stand-alone software package, partly on the client computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the client computing device over any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., over the internet using an internet service provider).

Furthermore, the above-described figures are merely schematic illustrations of processes involved in methods according to exemplary embodiments of the invention, and are not intended to be limiting. It will be readily understood that the processes shown in the above figures are not intended to indicate or limit the chronological order of the processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, e.g., in multiple modules.

Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.

Claims

1. An abnormal user detection method, comprising:

2. The method of claim 1, wherein the plurality of user behavior feature data sets of the subsequent stage comprise:

3. The method of claim 1, wherein the user behavior feature dataset comprises:

4. The method of claim 1, wherein after clustering the user samples in the searched sub data set to obtain a user sample cluster, calculating the profile coefficient of the sub data set based on the user sample cluster comprises:

based on the formula:

5. The method according to claim 1, wherein the scoring the feature data sets to be detected by using the user sample cluster corresponding to each feature data set to be detected and the predicted abnormal user sample to determine the abnormal user sample in the first-level user behavior feature data set based on the scoring comprises:

6. The method according to claim 5, wherein the calculating the abnormal confidence level of each feature data set to be detected by using the user sample cluster corresponding to each feature data set to be detected and the predicted abnormal user sample comprises:

based on the formula

7. The method of claim 5, wherein determining the anomalous user sample in the first level of user behavior feature dataset based on the anomalous confidence level comprises:

determining the predicted abnormal user sample in the abnormal characteristic data set as an abnormal user sample in the first-level user behavior characteristic data set;

further comprising storing the anomalous user samples into a blockchain.

8. An abnormal user detection apparatus, comprising:

9. A computer readable storage medium having computer program instructions stored thereon, wherein the computer program instructions, when executed by a processor, implement the method of any of claims 1-7.

10. An electronic device, comprising:

a processor; and

a memory for storing computer program instructions for the processor; wherein the processor is configured to perform the method of any one of claims 1-7 via execution of the computer program instructions.