CN111614690A

CN111614690A - Abnormal behavior detection method and device

Info

Publication number: CN111614690A
Application number: CN202010465586.4A
Authority: CN
Inventors: 汲丽; 钱沁莹; 魏国富; 葛胜利; 钟丹阳
Original assignee: Information and Data Security Solutions Co Ltd
Current assignee: Information and Data Security Solutions Co Ltd
Priority date: 2020-05-28
Filing date: 2020-05-28
Publication date: 2020-09-01
Anticipated expiration: 2040-05-28
Also published as: CN111614690B

Abstract

The invention provides an abnormal behavior detection method and device, wherein the method comprises the following steps: 1) acquiring original data corresponding to a user to be detected, wherein the original data comprises: the device attribute information, the wind control data and the service data of the user; 2) identifying a first abnormal user in the users to be detected by utilizing an ARIMA model based on the stable sequence corresponding to the original data; 3) acquiring a second abnormal user in the users to be detected by using a clustering algorithm based on the original data; 4) and carrying out risk rating on the first abnormal user and the second abnormal user by using a clustering algorithm of density and grids to obtain the user to be detected with high abnormal risk. By applying the embodiment of the invention, the safety performance is improved.

Description

Abnormal behavior detection method and device

Technical Field

The invention relates to the technical field of network security, in particular to an abnormal behavior detection method and device.

Background

Today, as the internet is developed more and more, people shop on the internet, so that the e-commerce platform often has a large number of visiting customers, and in order to attract more users to shop, merchants on the e-commerce platform often launch various preferential activities including but not limited to cash vouchers, discount coupons, cash back coupons, gift products and the like. These benefits attract the attention of various lawless persons while attracting normal users, and thus create attacks on the e-commerce platform such as pulling wool, stealing numbers, placing orders for customers, stealing member rights, and leaking personal information, and therefore how to identify these behaviors is an urgent technical problem to be solved.

In the prior art, the invention patent application with application number 201911200324.9 discloses an IP group identification method and device for user login abnormality, and the method comprises the following steps: acquiring login logs, counting the login logs in each preset period, and acquiring login frequency sequences of all IPs; training an isolated forest algorithm by taking the login frequency sequence as a sample set to obtain the score of each IP address; aiming at each score, acquiring a mode of the score, and acquiring a login log set corresponding to the mode; screening out a frequency sequence of the log logs corresponding to the mode from the log frequency sequence, and carrying out binarization processing on the screened frequency sequence to obtain a mark of each IP in each period; and acquiring a kappa coefficient among the data of the log collection by using a kappa algorithm according to the mark of each IP in each period, wherein the log collection with the kappa coefficient larger than a preset threshold value is used as a log abnormal group. The black production behavior independent of each other between IPs can be recognized.

However, the conventional technology can only discover an abnormal group according to an IP address, and has a problem of low security because the data type is small and other types of abnormal groups cannot be discovered.

Disclosure of Invention

The technical problem to be solved by the invention is how to improve the safety.

The invention solves the technical problems through the following technical means:

the invention provides an abnormal behavior detection method, which comprises the following steps:

1) acquiring original data corresponding to a user to be detected, wherein the original data comprises: the device attribute information, the wind control data and the service data of the user;

2) identifying a first abnormal user in the users to be detected by utilizing an ARIMA model based on the stable sequence corresponding to the original data;

3) acquiring a second abnormal user in the users to be detected by using a clustering algorithm based on the original data;

4) and carrying out risk rating on the first abnormal user and the second abnormal user by using a clustering algorithm of density and grids to obtain the user to be detected with high abnormal risk.

By applying the embodiment of the invention, the original data comprises the equipment attribute information, the wind control data and the service data of the user, and when the abnormal user is found based on the data, the used data types are richer, so that more dimensions are considered when the abnormal user is found, more abnormal users can be found, and the safety performance is further improved.

Optionally, the step 2) includes:

21) acquiring original data corresponding to a user to be detected; denoising the original data to obtain denoised original data; according to the operation session of the user to be detected, sequencing the original data of the user to be detected according to the scene sequence corresponding to each operation scene in the operation session, obtaining the scene sequence of the user to be detected, and converting the scene sequence into a stable sequence, wherein the original data comprises: the device attribute information, the wind control data and the service data of the user;

22) determining a corresponding baseline fixed value by using a simple average algorithm aiming at a stationary sequence with the data length smaller than a first preset threshold value in the stationary sequence;

23) acquiring an autocorrelation coefficient chart and a partial autocorrelation chart of each stationary sequence, and determining a current optimal level and a current optimal order according to a curve with the highest accuracy in the autocorrelation coefficient chart and a curve with the highest accuracy in the partial autocorrelation chart; establishing a current ARIMA model according to the optimal hierarchy and the optimal order; repeatedly iterating the current ARIMA model to obtain a target ARIMA model, and predicting the predicted value of the user to be detected by using the target ARIMA model;

24) predicting a predicted value of the user to be detected by using an exponential smoothing method to predict the baseline model;

25) calculating the mean square error of the target ARIMA model and the mean square error of the baseline model predicted by an exponential smoothing method, taking the model with lower mean square error as the baseline model, and acquiring a baseline value corresponding to the baseline model; and calculating the ratio of the actual value of each user to be detected to the baseline value, and judging the user to be detected to be the first abnormal user under the condition that the ratio is greater than a second preset threshold value.

Optionally, before step 23), the method further includes:

and calculating scene missing rate in the scene sequence, and marking the scene sequence as an abnormal sequence under the condition that the scene missing rate exceeds a set confidence interval range, wherein the confidence interval range is calculated according to the scene missing rate of all users to be detected.

Optionally, before step 25), the method further comprises:

sending a verification code to the user for verification aiming at the first abnormal user with the order placing rate or the order returning rate exceeding a third preset threshold value, or,

calculating a weighted probability value of the current operation session according to the occurrence probability of each scene in the historical operation session of each user and the scene category of the trunk in the current operation session of the user, and sending a verification code to the user for verification when the weighted probability value exceeds a fourth preset threshold value;

and deleting the user from the first abnormal user after the user passes the verification.

Optionally, the step 3) includes:

31) acquiring original data corresponding to a user to be detected; denoising the original data to obtain denoised original data, associating the original data according to specific data in the original data, taking a set of the associated original data as a sample, and further obtaining a plurality of samples, wherein the specific data comprises: one or a combination of a mobile phone number, a user ID and an IP address, wherein the original data comprises: the device attribute information, the wind control data and the service data of the user;

32) determining the number of clustering central points by using a weighted probability distribution model, and performing K-means clustering processing on the samples for a plurality of times based on the central points;

33) fitting the SSE value into a function curve, calculating a minimum extreme point of the SSE value according to a second derivative of the function curve, and taking a k value corresponding to the minimum extreme point as a target k value;

34) and calculating the distribution average value of the sample points in the peer-to-peer group, taking the average value as a base line of the peer-to-peer group, calculating the deviation degree corresponding to each sample point according to the distance between each sample point in the peer-to-peer group and the base line of the peer-to-peer group, and taking the point with the deviation degree larger than a fifth preset threshold value as a second abnormal user.

Optionally, the step 32) includes:

A. randomly selecting a sample from input samples as a first central point according to a current k value, taking the central point as a current central point, and adding the current central point into a central point set M;

B. calculating the distance between the current central point and other sample points, and adding the other sample points with the minimum distance into the current cluster corresponding to the current central point;

C. randomly taking one sample point from other sample points except the sample point in the current cluster as a current central point by using a weighted probability distribution model, and returning to execute the step 21) until k central points are obtained, wherein k is a preset integer larger than two;

D. and D, taking a k value different from the current k value as the current k value, and returning to execute the step A until a plurality of k values are obtained.

Optionally, the step 4) includes:

mapping the original data corresponding to the first abnormal user and the original data corresponding to the second abnormal user as data points into an n-dimensional space;

uniformly dividing the n-dimensional space into n-dimensional spaces with grid structures, and acquiring n as a maximum connected cluster corresponding to a dense grid connected in the space, wherein the dense grid is a grid containing data points of which the number is greater than a sixth preset threshold;

for each maximum connected cluster, reducing the dimension of the maximum connected cluster to a low-dimensional space for clustering again to obtain the minimum description of each maximum connected cluster;

and carrying out risk rating on the corresponding minimum description according to the number of the data points in the minimum description, and taking the data points contained in the minimum description with the risk rating higher than a preset level as the users to be detected with high abnormal risk.

Optionally, the obtaining n is a maximum connected cluster corresponding to a dense grid connected in the space, and includes:

counting the current grids, taking the current grids as dense grids under the condition that the number of data points contained in the grids is larger than a sixth preset threshold value, and adding the current grids into the current cluster;

obtaining the dense grids adjacent to each grid in the current cluster, adding the adjacent dense grids into the current cluster, and returning to execute the step of obtaining the dense grids adjacent to each grid in the current cluster until no dense grids adjacent to the current cluster exist;

taking other dense grids in the n-dimensional space except the current grid as the current grid, and returning to execute the step of adding the current grid into the current cluster until all the dense grids in the n-dimensional space are added into the cluster;

and respectively taking the obtained clusters as corresponding maximum connected clusters.

Optionally, for each maximum connected cluster, reducing the dimension of the maximum connected cluster to a low-dimensional space for clustering again, to obtain a minimum description of each maximum connected cluster, including:

reducing the dimensions of the data points in the maximum connected cluster into a low-dimensional space, and regarding each target dimension reduction result in the target dimension reduction results obtained after dimension reduction, taking a dense unit in the target dimension reduction results as an initial unit and taking the initial unit as a current unit, wherein the low-dimensional space is a space with a dimension lower than that of n, and the initial unit is a grid unit with the maximum number of data points contained in the target dimension reduction results;

taking the current unit as a starting point, acquiring adjacent units communicated with the current unit in each dimension of a low-dimensional space, adding a set of the current unit and the adjacent units into a current area, taking the adjacent units adjacent to the current unit as the current unit, returning to execute the step of acquiring the adjacent units communicated with the current unit in each dimension of the low-dimensional space by taking the current unit as the starting point until all adjacent units corresponding to the current unit are acquired;

taking other dense units outside the current area as initial units, and returning to execute the step of taking the initial units as current units until no isolated grid units exist in the low-dimensional space;

for each current area of each low-dimensional space, judging whether the times of whether the dense units in the current area are all present in the current areas in other low-dimensional spaces are larger than a sixth preset threshold, if so, deleting the current area from other low-dimensional spaces until all the current areas are compared;

and taking the cluster formed by the compared grid units as the minimum description of the corresponding maximum connected cluster.

The embodiment of the invention provides an abnormal behavior detection device, which comprises:

a first obtaining module, configured to obtain raw data corresponding to a user to be detected, where the raw data includes: the device attribute information, the wind control data and the service data of the user;

the identification module is used for identifying a first abnormal user in the users to be detected by utilizing an ARIMA model based on the stable sequence corresponding to the original data;

the second acquisition module is used for acquiring a second abnormal user in the users to be detected by using a clustering algorithm based on the original data;

and the rating module is used for carrying out risk rating on the first abnormal user and the second abnormal user by utilizing a clustering algorithm of density and grids to obtain the user to be detected with high abnormal risk.

Optionally, the identification module is configured to:

Optionally, the second obtaining module is configured to:

Optionally, the rating module is configured to:

41) mapping the original data corresponding to the first abnormal user and the original data corresponding to the second abnormal user as data points into an n-dimensional space;

42) uniformly dividing the n-dimensional space into n-dimensional spaces with grid structures, and acquiring n as a maximum connected cluster corresponding to a dense grid connected in the space, wherein the dense grid is a grid containing data points of which the number is greater than a sixth preset threshold;

43) for each maximum connected cluster, reducing the dimension of the maximum connected cluster to a low-dimensional space for clustering again to obtain the minimum description of each maximum connected cluster;

44) and carrying out risk rating on the corresponding minimum description according to the number of the data points included in the minimum description, and taking the data points included in the minimum description with the risk rating higher than a preset level as users to be detected with high abnormal risk.

Optionally, the rating module is configured to:

The invention has the advantages that:

In addition, for a first abnormal user identified by the scene sequence and a second abnormal user identified by the K-Means clustering algorithm operated for multiple times, the clustering algorithm combining the data point density and the grids is adopted to map data to a low-dimensional space for sum evaluation, and the low-dimensional space can better show the clustering relation among the data points, so that the model effect can be more accurate.

Drawings

Fig. 1 is a schematic flow chart of a method for detecting abnormal behavior according to an embodiment of the present invention;

fig. 2 is a schematic diagram illustrating a principle of detection of a first abnormal user in an abnormal behavior detection method according to an embodiment of the present invention;

fig. 3 is a schematic diagram illustrating a principle of detecting a second abnormal user in an abnormal behavior detection method according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of an abnormal behavior detection apparatus according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the embodiments of the present invention, and it is obvious that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example 1

Fig. 1 is a schematic flow chart of a method for detecting abnormal behavior according to an embodiment of the present invention, where the method includes:

s101 (not shown in the figure): acquiring original data corresponding to a user to be detected, wherein the original data comprises: the device attribute information, the wind control data and the service data of the user.

Illustratively, first, data is extracted from a service system of the platform, a text log of the platform, and other related data sources. And then removing noise abnormal data, such as data irrelevant to user access behaviors, test data or accessed data of other platforms except the platform to be monitored, and only retaining user click data generated when a user accesses the platform to be monitored, wherein the user click data is original data. The raw data includes the following three aspects:

firstly, the method comprises the following steps: device attribute information, which is mainly used to identify whether a device is legitimate or compliant. For example, when a user operates an e-commerce APP, the user performs embedding processing on an important level in an operation flow, each user triggers a preset point once to generate scene information where the user is located and equipment attribute information, the scene information and the equipment attribute information serve as data, fields are divided by commas, users are divided by linefeed, and files are stored according to a csv format. In general, the fields of the device attribute information include: device ID (deviced _ ID), device model number (product _ names), scene information, Mac address (Mac _ addresses), APP name (label), version number (versioning), APP size (appsize), first installation time (firstinstaltime), battery health (health), state of charge (gained), current state of charge (power), power standard (scale), state of charge (status), voltage (voltage), battery configuration (technology), screen resolution (Density), screen physical size (physical), screen resolution (resolution), memory size (multimedia), current cpu number (cpu), cpu frequency (bootmips), cpu architecture (processor), cpu total number (cpu _ area), cpu attribute 1(cpu attribute), cpu attribute 2 (product attribute), camera module (product _ 2), camera module (product _ security _ module 2), camera module (product _ security _ 1), camera module (security _ security), and camera module (security _ module) including the camera module (product), and the like

(cydiaresult), root authority (root), sandbox (sandbox), simulator (simulator), static (static), maximum available sound volume (maxvolulaceavailability), maximum sound volume (maxvolulaceralearm), sound card information (maxvolumeddtmf), sound card information (maxvolurememusic), maximum notification sound volume (maxvolumentnotification), sound card information (maxvoluuse), maximum alarm sound volume (maxvolumering), sound card information (maxvolumesystem), sound card information (maxvoluvevolvacearability), sound card information (maxvolulacesound), sound card information (ringing), bluetooth history connection number (hasconnect), bluetooth information (bluetooth-visible or not), bluetooth information (discourse) or not, whether bluetooth information (bluetooth-available) is obtained, bluetooth information (bluetooth-supported function (2 bluetooth) or not, bluetooth information (bluetooth-supported by bluetooth), bluetooth information (2 bluetooth-supported by bluetooth function) or not (bluetooth information (bluetooth-supported by bluetooth) Bluetooth information-whether or not advertisement extension (isLeExtendedVertingSupported), bluetooth information-whether or not regular advertisement (isLePeriodicAdVertingSupported) is supported, bluetooth information-whether or not hybrid advertisement (isMultipleAdvermentSupported) is supported, bluetooth information-whether or not offload filtering (isOffloaddFilterSupported) is supported, bluetooth information-whether or not scan offload batch processing (isOffloaddScanBatchIngSupported) is supported, application number (APPLIST _ COUNT), System application number (Sypplist _ COUNT), Security Module Attribute (sensor _ COUNT), sim card information (sim _ mes), International Mobile Subscriber Identity (IMSI), International Mobile Equipment Identity (IMEI)

The security module attribute 1 and the security module attribute 2 are both self-contained information in the mobile phone system.

Secondly, the method comprises the following steps: the method comprises the steps that wind control data comprise all request information and personal information of a user, the user operates the E-commerce APP as one piece of data each time, src _ user' is used as a main key, fields are divided by commas, users are divided by line changing, and files are stored according to the format of csv. The fields of the device attribute information include: a user name (src _ user), a timestamp (event _ timestamp), a browser allocation ID (browser _ client _ ID), a business link (business _ hierarchy), a cell phone number (cellphone _ no), a cookie _ ID (), a time channel (ch _ event _ channel), an event type (ch _ event _ type), a system (ch _ system), an IP address (ipaddr), a city where the IP is located (ipip _ city), a province where the IP is located (ipip _ provision), a digital identity recognition frame (openid), a user agent (user agent), a hit rule number (count), a login channel (login _ channel), application version information (APP _ version), an openname (openada _ name), a hit rule number (count), an event (event _ ID), a rule group flag (flag), a validity message (APP _ version), whether the device is a virtual device (event _ device), and whether the device is a virtual device (virtual device) Authentication method (login _ way), login channel (login _ channel)

Thirdly, the method comprises the following steps: the business data comprises all information of orders, returned orders, order details and the like of the user, the operation of the user on the orders each time is used as one piece of data, src _ user' is used as a main key, fields are divided by commas, the users are divided by line feed, and the file is stored according to the format of csv. The fields of the device attribute information include: user name (src _ user), timestamp (eval _ timestamp), order number (order _ id), telephone number (cellphone _ no), order scene (ch _ business _ hierarchy), system (ch _ system), IP address (ipaddr), IP city (ipip _ city), IP province (ipip _ province), and digital identification frame (DID)

(openid), user agent (user), commodity set (goods _ set), coupon name (), order channel (event _ channel), ordering channel (order _ channel), order commodity amount (order _ amount), consignee mobile phone number (order _ cell _ no), order number (order _ no), order commodity amount (order _ qty), order type (order _ type), consignee address (receipt _ address), restaurant name (reserve _ name), commodity name (goods _ name), business ring (ch _ business _ security), business system (ch _ system), event evaluation status code (event _ code), login time (login _ time), authentication mode (login _ way), order time (order _ time), request time (SSd), and coupon (OID)

Then, extracting scene information of each user to be detected, wherein the scene information in the original data comprises: registering (001), registering, acquiring a verification code (002), logging in (003), logging in and acquiring the verification code (004), logging out (005) an order placing commodity (006), submitting an order, verifying (007) payment, verifying (008) refund, verifying (009) receipt completion (010), signing in, verifying (011), acquiring a benefit code, verifying (012), and verifying (013) use of the benefit code.

S102 (not shown in the figure): identifying a first abnormal user in the users to be detected by utilizing an ARIMA model based on the stable sequence corresponding to the original data;

fig. 2 is a schematic diagram illustrating a principle of detection of a first abnormal user in the abnormal behavior detection method according to the embodiment of the present invention, as shown in fig. 2, 21), and scene information is dispersed in collected raw data. The same device ID can be used as association, then the scene information is collected, and then the scenes are sequenced according to the scene sequence of the scene corresponding to each operation: for example, after the user normally completes one order placement, the user can perform sequence processing on the operation scenes in the same operation session according to the time sequence, such as the sequence of one order placement by the client (003-. In the step, the two-degree information of the original data is extracted and added into the feature engineering as an 'individual feature value', so that the information in the features can be enriched, and the prediction accuracy of the model is improved.

The histogram can then be used for data visualization, and then the autocorrelation and partial autocorrelation maps of the histogram for each sequence of scenes are obtained. Then judging and identifying the stationarity of each scene sequence by using the existing algorithm according to the autocorrelation graph and the partial autocorrelation graph, wherein if the two graphs are both trailing, namely have an attenuation trend but the result is not 0 after visualization, the scene sequence is a non-stationary sequence, and the non-stationary scene sequence data can be differentiated to obtain a stationary sequence corresponding to the scene sequence; it will be appreciated that the stationary sequence does not require differential processing.

22) And performing simple average prediction baseline method on the part which does not meet the requirement of the data length, namely the stationary sequence with the data length smaller than the first preset threshold, wherein the length of the whole stationary sequence is influenced because the data length is too small, but the actual scene value of each moment before the stationary sequence of the ID is calculated by using simple average, such as arithmetic average method, geometric average method and weighted average method, for the completeness of the data to obtain a baseline fixed value, wherein the threshold is set as the length covering 90% of the whole data as the threshold, namely 7, in consideration of the full utilization of the whole data.

23) And training a target ARIMA (Autoregressive Integrated Moving Average Model) Model aiming at the stable sequence with the data length not less than a first preset threshold value in the stable sequence, and predicting a predicted value of the user to be detected by using the target ARIMA Model.

And acquiring the autocorrelation coefficient graph and the partial autocorrelation graph of each stationary sequence. The autocorrelation function is to compare an ordered random variable sequence with itself, which reflects the correlation between values of the same sequence at different time sequences. The method comprises the following steps that a strict correlation between two variables is calculated by a partial autocorrelation function, and the correlation degree between the two variables is obtained after the interference of an intermediate variable is eliminated, wherein required parameters are dataset-time sequence data, a value range of a p _ values-p value, a list or array type, a value range of a d _ values-d value, a list or array type, a value range of a q _ values-q value, a list or array type and the proportion of train _ pro-training data; .

Then, the autocorrelation function and the partial autocorrelation function can be calculated and drawn by tools such as SPSS, MATLAB and the like, and how the corresponding values of p and q should be taken can be judged through the images. The ordinate of the image is a correlation coefficient, the abscissa is an order, and it can be seen that a certain periodic relationship exists between the order and the correlation coefficient. And the values of p and q are the minimum number of cycles. If the values of p and q are found to be unable to pass model test in the subsequent calculation, the values of p and q are readjusted and the calculation is carried out again. And repeatedly iterating the current ARIMA model to obtain a target ARIMA model, and predicting the predicted value of the user to be detected by using the target ARIMA model.

the baseline prediction function of the time sequence data is an exponential smoothing method prediction baseline model; wherein the required parameters are time series data, thld deviation function, p _ max-p maximum, d _ max-d maximum, q _ max-q maximum, train _ pro-training sample ratio.

And predicting the predicted value of the user to be detected by using the exponential smoothing method to predict the baseline model.

It is emphasized that exponential smoothing predicts the baseline model as an existing model.

25) And selecting an optimal model from the target ARIMA model and the exponential smoothing method prediction baseline model as a baseline model, and identifying a first abnormal user in the users to be detected by using the baseline model.

Then, obtaining a mean square error value of the prediction baseline model by an exponential smoothing method, and comparing the mean square error value with a mean square error value of a target ARIMA model; the baseline model is taken as the model with the minimum mean square error.

And finally, detecting abnormal users by using the baseline model.

During detection, the ratio of the actual value to the baseline value of each user to be detected is calculated,

and under the condition that the ratio is larger than a second preset threshold value, judging that the user to be detected is an abnormal user.

By applying the embodiment of the invention, the corresponding stable sequences are generated from the original data of all the users to be detected, the target ARIMA model is trained, and then the optimal baseline model is screened out by combining the exponential smoothing method prediction baseline model, and the baseline model covers all the users to be detected by combining with a simple average algorithm, so that the detection range is wider, and more abnormal behaviors can be detected.

The embodiment of the invention starts from the access habit of the user, finds the abnormality by the personal behavior of the user, compares the currently input behavior data with the previous regular behavior data, and finds the abnormal output in time.

Moreover, internal threats have been the most prevalent type of network attack. The internal abnormal event is mostly a small-probability event and is required to be very accurate, the detection is carried out by relying on known rules for a long time, and the rules and experiences of numerous experts are built in a detection engine. The UEBA technology used in the prior art depends on expert experience for a long time for exploration, the related range is limited, the rule threshold is determined by known rules, so that the recall rate is lower and the accuracy rate is required to be improved.

In the embodiment of the invention, the mean square error is taken as an evaluation standard, and the optimal solution of the ARIMA model and the moving average model with optimal parameters is taken as a baseline; and the actual value/baseline value is used as a deviation value, a dynamic behavior baseline is established to discover the deviation of the internal user from the personal normal mode, the risk level of the user is judged according to the value accumulated by the risk, and more accurate abnormality is discovered from focusing on the data content to the context relationship, behavior analysis and the like.

Further, in the extraction of the scene sequence, because of the condition limitations of hardware, privacy, and the like, there is always a part of data missing, so the scene missing rate can be calculated from the number of missing scenes in the scene sequence, for example, in the 003-. Then, the scene missing rates of all the users to be detected are counted, and then the missing confidence interval can be calculated, for example, the confidence interval can be confirmed by using a quartile method. When the confidence interval is exceeded, the information loss is considered to be more strict and is marked as 0, and risks exist; within this interval, a tolerance of 1 can be obtained. If the scene is missing more, the scene sequence can be marked as an abnormal sequence.

In the model feature engineering, a time guide advance condition is added aiming at the processing of the sequence to counteract the meaningless influence of the disordered data on the model;

the dynamic baseline form is creatively applied to the processing of the source data for definition, the mean square error is taken as an evaluation standard, the optimal solution of the ARIMA model and the basic model of the optimal parameters is taken as a baseline, and the result is more readable and reasonable;

still further, the method further comprises:

sending a verification code to the user for verification aiming at the user with the order placing rate or the order returning rate exceeding a third preset threshold value, or,

and calculating a weighted probability value of the current operation session according to the occurrence probability of each scene in the historical operation session of each user and the scene category of the trunk in the current operation session of the user, and sending a verification code to the user for verification when the weighted probability value exceeds a fourth preset threshold value.

The embodiment of the invention combines the service data to establish a 'washing and whitening mechanism', and immediately identifies the user which can not be accurately positioned, thereby reducing the number of complaints of the user. The data source is selected in a diversified manner, and a whitening mechanism is established, so that the probability of causing the 'good person' is greatly reduced, and the system is more intelligent;

s103 (not shown in the figure): and acquiring a second abnormal user from the users to be detected by using a clustering algorithm based on the original data.

Exemplarily, fig. 3 is a schematic diagram illustrating a principle of detection of a first abnormal user in an abnormal behavior detection method according to an embodiment of the present invention, as shown in fig. 3, denoising raw data may be performed, then correlation processing is performed on the raw data according to specific data in the raw data, a set of the correlated raw data is used as a sample, and then a plurality of samples are obtained, where the specific data includes: one or a combination of a mobile phone number, a user ID and an IP address.

For example, a set of data in one or a combination of device attribute information, wind control data, and service data that use or correspond to the same mobile phone number is used as a sample, and a set of the data that correspond to the same user ID may also be used as a sample, so as to avoid the technical problems of large data processing amount and low efficiency caused by that multiple pieces of data or multiple types of data of the same user are respectively processed as separate data.

31) Determining the number of clustering central points by using a weighted probability distribution model, and carrying out K-means clustering processing on the samples for a plurality of times based on the central points.

A: in the iteration corresponding to the current k value, randomly selecting a sample A from 1000 input samples as a first central point, taking the central point as a current central point, and adding the current central point into a central point set M;

b: calculating the distance between the current central point and other sample points, and adding the other sample points with the minimum distance into the current cluster corresponding to the current central point to obtain a first cluster;

c: and (3) randomly taking one sample point from other sample points except the sample point in the current cluster as a second central point by using a weighted probability distribution model, then repeatedly executing the steps A and B to obtain a cluster corresponding to the second central point, and taking the density of the sample points as evidence of belonging to the same cluster. And sorting according to the probability that the sample point belongs to the same cluster of sample points, wherein the closer the distance is, the higher the probability is, and the higher the probability is, and the probability is that the sample point belongs to the same cluster is. Then, collecting the sample points with the distance less than the set distance into a cluster corresponding to the central point, and repeating the steps until k central points are obtained, wherein k is a preset integer greater than two;

d: and taking a k value different from the current k value as the current k value, for example, k +1 may be taken as a new current k value, or k-1 may be taken as a new current k value, or a k value different from the current k value may be randomly selected as the current k value, and then the steps from a to D are repeatedly performed until a plurality of k values are obtained.

In the embodiment of the invention, group analysis is emphasized, for example, a hacker registers a large number of account numbers to perform wool pulling for the purpose of obtaining a birthday or festival coupon, and at the moment, the hacker can show quite abnormal group access, such as a large number of links of skipping a login link, no coupon code verification link, skipping a payment verification code and the like, even some of the hacker can skip a receiving link in the background and directly enter a refund link, but no commodity returns after the return of the payment, and the like, and early warning reaction is made. The embodiment of the invention takes the behavior characteristics or physical attribute characteristics of a certain user as input data, takes the deviation, abnormal date, peer-to-peer group baseline, group ranking and the like as output targets, and further can be used as indexes for directly displaying the abnormal degree of the abnormal user, then, the K-means algorithm is selected for clustering calculation, and as the K-means algorithm needs to randomly select the initialized central point, if the central point is not properly selected, the problems of poor clustering effect or low convergence speed and the like can be caused.

In a P2P network environment, multiple computers connected to each other are in a peer-to-peer relationship, and each computer has the same function and no master-slave relationship, so the computers in the network are called peer-to-peer computers, and in a peer-to-peer computer, one computer can be used as a server to set shared resources for use by other computers in the network, or as a workstation, and the entire network does not generally rely on a dedicated centralized server or a dedicated workstation. Therefore, in the embodiment of the present invention, the peer group refers to sample points with similar properties.

32) And determining a target k value according to the minimum value of the SSE values after each clustering process.

Respectively calculating SSE (sum of square of error) corresponding to each k value, and then mapping The SSE of each k value into a two-dimensional coordinate system by taking The k value as a horizontal axis and The SSE value as a vertical axis. And then, fitting each point in the two-dimensional coordinate system by using a curve to obtain a function curve, calculating a minimum extreme point of the SSE value according to a second derivative of the function curve, and taking a k value corresponding to the minimum extreme point as a target k value.

It is understood that the second derivative is a minimum point when the second derivative is greater than 0. The maximum point is when the first derivative is equal to 0 and the second derivative is less than 0.

34) And clustering a cluster obtained by a clustering algorithm corresponding to the target k value to obtain a peer group, and for each peer group, obtaining the deviation degree of each sample point according to the ratio of the sample point in the peer group to other sample points in the peer group, and obtaining abnormal points according to the deviation degree.

Specifically, one cluster is used as a peer group, the distribution average value of the sample points in the peer group is calculated for each peer group, the average value is used as the base line of the peer group, the ratio of each sample point in the peer group to the base line of the peer group is calculated, the ratio is used as the deviation degree of the sample point from the base line of the peer group, and the point of which the absolute value of the deviation degree is greater than a fifth preset threshold value is used as a second abnormal point.

The definition of the deviation threshold can be configured in a self-defining way, and the model in the embodiment of the invention provides a default value.

By applying the embodiment of the invention, the peer-to-peer group division is carried out by adopting the clustering algorithm according to the equipment attribute information, the wind control data and the service data of the user, and then the abnormal points are found according to the deviation degree of each sample point in the peer-to-peer group.

S104 (not shown in the figure): and carrying out risk rating on the first abnormal user and the second abnormal user by using a clustering algorithm of density and grids to obtain the user to be detected with high abnormal risk.

Illustratively, when analyzing users, a score definition is given according to the degree of deviation, all users are treated uniformly at the moment, and for seeking a high-efficiency implementation mode, a clustering algorithm CLIQUE algorithm which combines density and grids at the same time is adopted for clustering, so that the algorithm can not only find clusters in any shape, but also process high-dimensional data;

the algorithm is concretely realized by the following steps:

41) and mapping the original data corresponding to the first abnormal user and the original data corresponding to the second abnormal user as data points into an n-dimensional space.

In general, the original data of the user to be detected includes how many items of data, and n can take the number of data items. For example, if tens of items of data are included in step S101, n can take a value corresponding to the number of items of data.

Each user to be detected includes a number of items of data that together constitute the user's data point, which can then be mapped into an n-dimensional space.

42) And dividing the n-dimensional space, equally dividing each dimension, and dividing the full space into mutually-disjoint grid units.

Calculating the density of each grid according to the number of data points included in each grid, dividing the grid with the density being greater than 4 into dense grids, dividing the grid with the density being less than or equal to 4 into non-dense grids, further dividing each grid into dense grids and non-dense grids, and setting the initial state of all the grids as 'unprocessed', according to a given sixth preset threshold (4 is taken as a density threshold, multiple times of test data prove that the value is reasonable, if the value is too high, the cluster is lost, and if the value is too low, the cluster is merged).

Determining a data dense unit of a low-dimensional space, such as a p-dimensional space, by adopting a bottom-up identification mode for input data; traversing all grids, judging whether the current grid is unprocessed or not, and processing the next grid if the current grid is not unprocessed; if the grid is in an unprocessed state, acquiring the maximum connected cluster corresponding to the grid according to the following steps until all adjacent grids are processed;

a: changing the mark of the current grid as 'processed', if the current grid is not a dense grid, processing the next current grid;

b: if the grid is dense, giving a new current cluster mark to the grid, creating a queue, and placing the dense grid in the queue;

c: judging whether the queue is empty, and if the queue is empty, processing the next grid; if the queue is not empty, the following processing is carried out:

taking the grid elements at the head of the queue, and checking all adjacent grids with unprocessed grid elements;

the change grid is marked as "processed";

if the adjacent grid is a dense grid, the adjacent grid is endowed with the current cluster mark and added into the queue;

c, circularly executing the step C until the queue is not increased any more;

d: after the examination of the density connected region is finished, marking the same dense grids to form the density connected region as a maximum connected cluster;

e: modifying to obtain a new current cluster mark, repeatedly executing the steps from A to D to search the next cluster, and circularly executing to traverse a data set formed by the whole original data and mark the data elements as all grid cluster mark values.

Then, the marking result of the p-dimensional space is used as input, and the processing is performed in the p + 1-dimensional space according to the above process until the n-dimensional space is processed, that is, the embodiment of the present invention first confirms the attribution of the cluster from the small dense unit, that is, the low-dimensional space recurs to the n-dimensional space step by step.

43) In the clustering task, all attributes of the data set are sometimes considered to reduce the clustering efficiency and the clustering effectiveness, because the samples have real cluster division on some attributes and are randomly distributed on other attributes, the clustering in the low-dimensional subspace of the high-dimensional space of the data set is very important, and different cluster divisions can be generated in different subspaces.

And reducing the dimensions of the data points in the maximum connected cluster into a two-dimensional space, and regarding each target dimension reduction result in the target dimension reduction results obtained after dimension reduction, taking the dense unit with the maximum number of data points contained in the target dimension reduction results, namely the dense unit with the maximum density, as an initial unit, and taking the initial unit as a current unit.

And taking the current unit as a starting point, acquiring adjacent units communicated with the current unit in each dimension of a two-dimensional space, adding a set of the current unit and the adjacent units into a current area, taking the adjacent units adjacent to the current unit as the current unit, returning to execute the step of acquiring the adjacent units communicated with the current unit from the current unit as the starting point in each dimension of the low-dimensional space until all adjacent units corresponding to the current unit are acquired, namely, forming the current area by continuously iterating and executing all dense units capable of being communicated.

Then, taking other dense units outside the current area as initial units, and returning to execute the step of taking the initial units as the current units until no isolated grid units exist in the low-dimensional space to obtain other current areas, and further obtaining the current areas of the maximum connected clusters in each two-dimensional space until no isolated grid units exist.

Then, for each current area of each low-dimensional space, judging whether the times of whether the dense units in the current area are all present in the current areas in other low-dimensional spaces are larger than a sixth preset threshold, if so, deleting the current area from other low-dimensional spaces until all the current areas are compared; the general process is as follows: the largest region containing the least number of dense elements is first found and if each element in this region has been repeated in other regions, the region is removed from the largest set of regions in the cluster until the next similar region is not found.

The "minimal description" that yields the largest connected cluster is the core of the algorithm, which determines whether the resulting clustering model is interpretable. The inventor finds that a large number of rules are usually used in combination for abnormal user detection in the prior art, but it is not known which rule plays a greater role after each rule is combined, or which rule is more useful, and which rule is not useful, and there is no reasonable explanation, so that a great obstacle is caused to abnormal user detection, a great deal of time is consumed in testing the combined rules in the abnormal user detection process, and the work efficiency is greatly reduced. After the data of the high-dimensional space is mapped into the low-dimensional space, the embodiment of the invention can delete the areas which are in contact with other current areas in other low-dimensional spaces, thereby avoiding the interference of the data points with poor relevance on the result, improving the accuracy, and also showing that the identification effect of the mapping direction of the low-dimensional space corresponding to the deleted area is poor, deleting the characteristics corresponding to the mapping direction, further accurately judging which characteristics are useful and which characteristics are not useful, and improving the interpretability of the model.

The process of the embodiment of the present invention takes as input a plurality of mutually exclusive clusters (dense grid cell sets) in a certain k-dimensional subspace S, and outputs a "minimum description" of the clusters, which is a set of regions, wherein each region must be included in the dense grid cell sets, and each dense cell in the dense cell sets should belong to at least one of the regions, which is obviously an NP-hard problem. Therefore, the embodiment of the invention is improved by a method of firstly obtaining the maximum area covering each cluster and then obtaining the minimum description by discarding the repeatedly covered grid unit, thereby obtaining the minimum description of each maximum connected cluster.

For example, the number of abnormal points in each minimum description may be normalized, and then the normalized value may be used as the risk level. And further, risk grades can be directly given, algorithm efficiency is extremely high, and quasi-real-time judgment can be achieved.

And reducing the dimensions of the data points in the maximum connected cluster into a two-dimensional space to obtain n-1 dimension reduction results, and taking the dimension reduction result with the minimum distance between the data points in all the dimension reduction results as a target dimension reduction result.

Corresponding to embodiment 1 of the present invention, the present invention also provides embodiment 2

Example 2

Fig. 4 is a schematic structural diagram of an abnormal behavior detection apparatus according to an embodiment of the present invention, and as shown in fig. 4, the abnormal behavior detection apparatus includes:

a first obtaining module 401, configured to obtain raw data corresponding to a user to be detected, where the raw data includes: the device attribute information, the wind control data and the service data of the user;

an identifying module 402, configured to identify a first abnormal user in the users to be detected by using an ARIMA model based on the stationary sequence corresponding to the original data;

a second obtaining module 403, configured to obtain, based on the original data, a second abnormal user from the users to be detected by using a clustering algorithm;

and the rating module 404 is configured to perform risk rating on the first abnormal user and the second abnormal user by using a clustering algorithm of density and grid, so as to obtain a user to be detected with a high abnormal risk.

In a specific implementation manner of the embodiment of the present invention, the identifying module 402 is configured to:

In a specific implementation manner of the embodiment of the present invention, the second obtaining module 403 is configured to:

In a specific implementation manner of the embodiment of the present invention, the rating module 404 is configured to:

The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method of abnormal behavior detection, the method comprising:

2. The abnormal behavior detection method according to claim 1, wherein the step 2) comprises:

3. The abnormal behavior detection method according to claim 2, wherein before step 23), the method further comprises:

4. The abnormal behavior detection method according to claim 2, wherein before step 25), the method further comprises:

5. The abnormal behavior detection method according to claim 1, wherein the step 3) comprises:

6. The abnormal behavior detection method according to claim 5, wherein the step 32) comprises:

C. b, randomly selecting one sample point from other sample points except the sample point in the current cluster as a current central point by using a weighted probability distribution model, and returning to execute the step A until k central points are obtained, wherein k is a preset integer larger than two;

7. The abnormal behavior detection method according to claim 1, wherein the step 4) comprises:

8. The abnormal behavior detection method according to claim 7, wherein the obtaining n is a maximum connected cluster corresponding to a dense grid connected in space, and comprises:

9. The abnormal behavior detection method according to claim 7, wherein for each maximum connected cluster, reducing the dimension of the maximum connected cluster into a low-dimensional space for re-clustering to obtain the minimum description of each maximum connected cluster, comprises:

10. An abnormal behavior detection apparatus, the apparatus comprising: