WO2017124942A1

WO2017124942A1 - Method and apparatus for abnormal access detection

Info

Publication number: WO2017124942A1
Application number: PCT/CN2017/070798
Authority: WO
Inventors: 付子豪; 张凯; 蔡宁; 杨旭; 褚崴
Original assignee: 阿里巴巴集团控股有限公司; 付子豪; 张凯; 蔡宁; 杨旭; 褚崴
Priority date: 2016-01-19
Filing date: 2017-01-10
Publication date: 2017-07-27
Also published as: CN106982196B; TW201730766A; CN106982196A

Abstract

The present application discloses a method for abnormal access detection, comprising: acquiring, on the basis of the extraction of a time-series data feature corresponding to each sample access request, the value of a corresponding tag, and then generating, according to the value of the tag corresponding to each sample access request and attribute data, a detection parameter, thus after acquiring the attribute data of the access request to be detected, generating, according to the attribute data and the detection parameter, an abnormity probability corresponding to the access request, and after determining whether the abnormity probability is greater than a preset abnormity threshold, determining, according to the comparison result, whether the access request is an abnormal access request. Therefore, an abnormal access request can be identified among a huge number of access requests and processed, ensuring the stability and security of the network.

Description

Abnormal access detection method and device

Technical field

The present application relates to the field of Internet technologies, and in particular, to an abnormal access detection method. The application also relates to an abnormal access detecting device.

Background technique

Data mining is the process of extracting potentially, implicit, and valuable knowledge, patterns, or rules from large data sets. The patterns of mining from large-scale datasets can generally be divided into five categories: association rules, classification and prediction, clustering, evolution analysis, and outlier detection. The mining of abnormal point data includes two parts: abnormal point data detection and abnormal point data analysis. Outlier data is data that is inconsistent with the general behavior or model of the data. They are data that are distinctive in the data set. These data are not random deviations but are generated by completely different mechanisms. Abnormal point data mining has a wide range of applications, such as fraud detection, detection of unusual credit card usage or telecommunication services with outlier detection; forecasting market trends; analysis of abnormal behaviors such as customer churn in market analysis; or discovery in medical analysis Unusual response to a variety of treatments, etc.; through the study of these data, found abnormal behavior and patterns, to achieve abnormal data mining capabilities.

As shown in FIG. 1 , a schematic diagram of the existing abnormal point monitoring technology to solve the service response problem, the abnormal point monitoring technology has a wide application. In this question, multiple users submit corresponding service requests to the server. Among these applications, some applications are normal applications, and some applications are abnormal applications. If the server accepts the exception request, it will seriously affect the server work, and will also have some impact on other normal applications.

In order to solve the above technical problem, in the prior art, the system determines whether to respond to a user request according to a user's request and a user's information record. In the process of judging, some machine learning algorithms are introduced to learn. The commonly used methods include constructing Markov distance based on user attributes to mine users at outliers, and performing abnormal point discrimination based on the frequency of user submission requests. The process of discrimination is as follows:

(1) In the process of discriminating outliers based on Mahalanobis distance, the covariance matrix between user attributes is first calculated, which is defined as follows:

∑=E{(XE[X])(XE[X]) ^T }

The Mahalanobis distance is then calculated from the covariance matrix, which is defined as follows:

M _a =(X-μ) ^T ∑ ^-1 (X-μ)

Finally, according to the size of the distance, some points with too large distance will be judged as outliers.

(2) In the method of discriminating the abnormal point according to the frequency at which the user submits the request, after the number of times the user unit submits the request exceeds a certain threshold, it will be directly determined as an abnormal point.

Therefore, how to use the existing access data and user information to more accurately identify the abnormal request and take corresponding measures, which is related to the stability and economy of service resource allocation, is a very important issue in the service response strategy.

However, the inventor found in the process of implementing the present application that the existing abnormal point detection algorithm with time series data or only using the feature data of the access user itself to perform clustering can only reflect the characteristics of the access user attribute; or only Using the time series data of the access, manually set the threshold to find some abnormal points (ie, confirm that the current access is abnormal). Neither of these methods fully exploits the value of the data, and the results are often not very accurate and effective.

Summary of the invention

The present application provides an abnormal method detection method for improving detection efficiency and accuracy for abnormal access. The method includes the following steps:

Obtaining attribute data of the access request to be detected;

Generating, according to the attribute data and the detection parameter, an abnormal probability corresponding to the access request, where the detection parameter is generated according to the value of the label corresponding to each sample access request and the attribute data;

Determining whether the abnormal probability is greater than a preset abnormal threshold;

If yes, confirm that the access request is an abnormal access request;

If not, confirm that the access request is a normal access request.

Preferably, before acquiring the attribute data of the access request to be detected, the method further includes:

Determining, according to the access frequency information of each sample access request, whether each of the sample access requests is abnormal;

Labeling different values for normal sample access requests and exception sample access requests;

Generating original detection parameters according to values of labels corresponding to each sample access request and attribute data;

The detection parameter is generated according to the original detection parameter.

Preferably, the access frequency information includes a user identifier corresponding to the sample access request and an access time, and determining whether each of the sample access requests is abnormal according to the access frequency information of each sample access request, specifically:

Acquiring, according to the user identifier, a first number of sample access requests submitted by the same user within a time window before the access time, and acquiring sample access submitted by the same user within the time window after the access time The second quantity requested;

Determining whether the sum of the first quantity and the second quantity is greater than a preset number of times threshold;

If yes, confirm that the sample access request is an abnormal sample access request;

If not, confirm that the sample access request is a normal sample access request.

Preferably, the original detection parameters are generated according to the following formula:

Where is the value function of the original detection parameter, w is the original detection parameter, and w is the minimum value corresponding to the summation item, where N is the number of the sample access requests, and each sample access request is The value of the label.

Preferably, the abnormal threshold is specifically generated by:

Get the percentage of exception sample access requests for all sample access requests;

Acquiring an abnormal probability corresponding to each of the sample access requests according to the detection parameter;

Sorting the abnormal probability corresponding to each sample access request from small to large;

Determining an abnormal probability corresponding to the percentage according to the sorting result, and using the abnormal probability as the abnormal threshold.

Correspondingly, the present application further provides an abnormal access detecting device, which is characterized in that it comprises:

Obtaining a module, acquiring attribute data of an access request to be detected;

The first generation module generates an abnormal probability corresponding to the access request according to the attribute data and the detection parameter, and the detection parameter is generated according to the value of the label corresponding to each sample access request and the attribute data;

a determining module, determining whether the abnormal probability is greater than a preset abnormal threshold;

If yes, the determining module confirms that the access request is an abnormal access request;

If not, the determining module confirms that the access request is a normal access request.

Preferably, the method further comprises:

Determining, by the access frequency information of each sample access request, determining whether each of the sample access requests is abnormal;

An allocation module that assigns different values to the normal sample access request and the abnormal sample access request;

The second generation module generates an original detection parameter according to the value of the label corresponding to each sample access request and the attribute data;

And a third generation module, configured to generate the detection parameter according to the original detection parameter.

Preferably, the access frequency information includes a user identifier ID and an access time corresponding to the sample access request, and the determining module is specifically configured to:

Acquiring, according to the user ID, a first number of sample access requests submitted by the same user within a time window before the access time, and acquiring sample access submitted by the same user within the time window after the access time The second quantity requested;

Where argmin _w is the value function of the original detection parameter, w is the original detection parameter, and w is the minimum value corresponding to the summation item, N is the number of sample access requests, and V _i is each The value of the label of the sample access request.

Preferably, the abnormal threshold is specifically generated by:

It can be seen that, after applying the technical solution of the present application, after acquiring the attribute data of the access request to be detected, the abnormal probability corresponding to the access request is generated according to the attribute data and the detection parameter, because the detection parameter is according to the label corresponding to each sample access request. The value of the attribute and the attribute data are generated. Therefore, after determining whether the abnormal probability is greater than a preset abnormal threshold, whether the access request is an abnormal access request can be confirmed based on the size of the two. Therefore, the abnormal access request can be accurately identified and processed in a large number of access requests, thereby ensuring the stability and security of the network.

DRAWINGS

1 is a schematic diagram of application of anomaly detection in a service response in the prior art;

2 is a schematic flowchart of an abnormal access detection method according to the present application;

3 is a flowchart of abnormal point detection based on time series feature extraction in a specific embodiment of the present application;

4 is a schematic diagram of feature extraction of time series data in a specific embodiment of the present application;

FIG. 5 is a schematic diagram of a threshold calculation process in a specific embodiment of the present application; FIG.

FIG. 6 is a schematic structural diagram of an abnormal access detecting apparatus according to the present application.

detailed description

As described in the background art, further improving the accuracy and effectiveness of the abnormal point detection for the characteristics of the time-series application data is a key issue related to the accurate and efficient operation of the system, and is also a technical problem to be solved by the present application.

In order to solve the above technical problem, the present application proposes an abnormal point detection method, which combines user statistics and time-series access data, gives a preliminary label by time series data according to rules, and adopts a logistic regression method for preliminary labels and users. The attributes are trained to produce the final result, so that the result of the abnormal point determination is further improved.

As shown in FIG. 2, a schematic flowchart of an abnormal point detecting method proposed by the present application includes the following steps:

S201 obtains attribute data of the access request to be detected.

In the embodiment of the present application, after the model and the detection parameter are generated, in the process of predicting each new access request, that is, in determining whether the access request is abnormal, only the attribute of the access request is determined, and the abnormality detection is performed. The problem is transformed into a classification problem. For the classification problem, only the attribute data of the access request to be detected is obtained to obtain all the attribute vectors, that is, the time series data of the new access request does not need to be acquired in this step.

Therefore, before performing the new access request abnormality prediction, the embodiment of the present application needs to perform the logistic regression training on the preliminary label and the user attribute corresponding to each sample access request to obtain the classification model and obtain the detection parameters, thereby implementing the The purpose of combining user data with time-series access data. The manner of logistic regression training and detection parameter acquisition in this application is as follows:

a) determining, according to the access frequency information of each sample access request, whether each of the sample access requests is abnormal;

b) assigning different values to the normal sample access request and the abnormal sample access request respectively;

c) generating original detection parameters according to the values of the labels corresponding to the sample access requests and the attribute data;

d) generating the detection parameter according to the original detection parameter.

In addition, it can be seen from the above steps that how to accurately determine whether the sample access request is abnormal is an important parameter for determining the classification model and the accuracy of the detection parameters. Therefore, the specific implementation of the present application proposes specific steps for determining whether each of the sample access requests is abnormal:

a) obtaining, according to the user identification, a first number of sample access requests submitted by the same user within a time window prior to the access time, and obtaining a submission by the same user within the time window after the access time The second number of sample access requests;

b) determining whether the sum of the first quantity and the second quantity is greater than a preset number of times threshold;

c) if yes, confirm that the sample access request is an abnormal sample access request;

d) If no, confirm that the sample access request is a normal sample access request.

In an embodiment of the present application, the access frequency information includes a user identifier corresponding to the sample access request and an access time. The user identifier is used as a credential for distinguishing different users. As long as different users have different user identifiers, different forms and contents may appear. For example, the user identifier may be the MAC address of the user-compatible terminal or the registration ID of the user at the service terminal. The access time is the access time point of the access request recorded by the server.

It should be noted that the specific examples of the foregoing user identifiers are only examples provided by the preferred embodiment of the present application, and other types of user identifiers may be selected on the basis of the foregoing, so that the application is applicable to more application fields, and the improvements are all applicable. It belongs to the scope of protection of the present invention.

It should be noted that the foregoing method for determining whether the sample access request is abnormal is only a preferred solution proposed by the specific embodiment of the present application, and those skilled in the art may also determine other methods by ensuring certain accuracy. These are all within the scope of this application.

S202 generates, according to the attribute data and the detection parameter, a difference corresponding to the access request. The probability of the detection is generated according to the value of the tag corresponding to each sample access request and the attribute data.

In embodiments of the present application, the anomaly threshold should be adjusted based on long-term experience to achieve a suitable range of values. If the value of the abnormal threshold is large, some abnormal points will be judged as normal access, so many abnormal points may be missed. Conversely, if the abnormal threshold is too small, some normal points will be judged as Abnormal points affect the use of normal users. Therefore, how to adjust the appropriate abnormal threshold value to improve the accuracy of the abnormal point detection is crucial, so the present application generates an abnormal threshold by:

a) get the percentage of exception sample access requests for all sample access requests;

b) acquiring an abnormal probability corresponding to each of the sample access requests according to the detection parameter;

c) sorting the abnormal probability corresponding to each sample access request from small to large;

d) determining an abnormal probability corresponding to the percentage according to the sorting result, and using the abnormal probability as the abnormal threshold.

In a specific embodiment of the present application, a reference formula for generating the original detection parameters is as follows:

Through the above reference formula for generating the original detection parameter, the calculation result is that the parameter w is the original detection parameter. In the subsequent process, all the new access requests can be calculated by using the original detection parameter w, and the calculation result and the abnormal threshold are judged, thereby realizing whether the new access request is abnormal.

It should be noted that the above formula is only a preferred solution proposed by the specific embodiment of the present application. However, those skilled in the art may modify or modify the formula under the premise of ensuring that the calculation result can be used as the original detection parameter. These are all within the scope of this application.

S203 determines whether the abnormal probability is greater than a preset abnormal threshold.

In an embodiment of the present application, when a new access request arrives, it is predicted by the classification model whether the new access request is an abnormal access request. Specifically, by first substituting the attribute data of the new access request into the classification model, the probability that the access is an abnormal access request, that is, the abnormal probability, can be obtained by comparing the abnormal probability of the frequent access request with a preset abnormal threshold. It is determined whether the abnormal probability is greater than a preset abnormal threshold. If the abnormal probability of the new access request is greater than the abnormal threshold, it is determined as an abnormal access request, that is, S204 is executed; if the abnormal probability of the new access request is less than the abnormal threshold, it is determined as a normal access request, that is, S205 is executed.

S204: If yes, confirm that the access request is an abnormal access request.

S205, if no, confirm that the access request is a normal access request.

It can be seen that, by applying the above technical solution, after obtaining the attribute data of the access request to be detected, the abnormal probability corresponding to the access request is generated according to the attribute data and the detection parameter, and the detection parameter is obtained according to the label corresponding to each sample access request. The value and the attribute data are generated. Therefore, after determining whether the abnormal probability is greater than a preset abnormal threshold, the access request can be confirmed as an abnormal access request based on the size of the two. Therefore, the abnormal access request can be accurately identified and processed in a large number of access requests, thereby ensuring the stability and security of the network.

In order to further illustrate the technical idea of the present application, the technical solution of the present application will be described in conjunction with a specific application scenario as shown in FIG. 2 . The outlier detection process based on time series feature extraction realizes the detection of abnormal points through three steps: time series analysis, linear classifier training and prediction. The three different steps are described as follows:

(1) Generate tags through time series

According to the characteristics of the time series, in the training set, all the user access data are first sorted in chronological order. After the sorting is completed, we compare the user ID of each access, set a sliding window to move backward, and traverse each time in order. access. For each visit, if the number of accesses submitted by the same user in its first half and the second half is greater than a certain threshold, it is marked as anomalous. Then the set of labels for the anomaly points can be written as:

Where Vi represents the label of the ith access,

w is the window size parameter,

t _h is a threshold parameter, and its schematic diagram is shown in FIG. 3 .

(2) Linear classifier training

After all the access tags are generated, for each visit, we think that the visit is abnormal, completely determined by the attributes of the visit, and the problem is transformed into a classification problem. For the classification problem, it is not necessary Use time series data. According to other attribute characteristics and labels of each visit, logistic regression training is performed to obtain a classification model. The result of this model is the parameter w, which satisfies:

Where argmin _w is a function of the value of the parameter w, and the value of w makes the sum of the right side take the minimum value. N represents the total number of learning samples, and V _i represents the abnormal point label of the previous step. w ^T represents the transposition of w. In the actual logistic regression training, the L-BFGS algorithm is used to accelerate it.

(3) New visit prediction

When a new visit arrives, the classification model can be used to predict whether the new access is anomalous. After the new access data is substituted into the classification model, the probability that the access is an abnormal point can be obtained, and a threshold is set. When the probability that the access is abnormal is greater than the threshold, the abnormal point is determined, and the set of all abnormal new accesses is determined. Expressed as:

{V _i |w ^T x _i >p _t }

Where V _i represents the ith access, x _i represents all attribute vectors of the visit, and p _t is the threshold for determining the abnormal point. Here, the threshold should be adjusted based on long-term experience until a suitable number. If the threshold value is too large, many abnormal points will be missed and judged as normal access. If the threshold value is too small, many normal points will be determined as abnormal points, which will affect the normal users. Therefore, it is necessary to adjust a suitable threshold. Here, it can be set according to the percentage. First, find the percentage of the abnormal point as the total training data, then bring the training data into the model to calculate the probability according to the model, and then proceed to the probability. Sort, find the probability that the anomaly point is the percentage of the population, and set it as the threshold. The specific schematic diagram is shown in Figure 5.

The technical solution of the foregoing application scenario provides a training label for the classification model by using the time series feature of the sample data, and then generates detection parameters according to the value of the label corresponding to each sample access request and the attribute data; after acquiring the attribute data of the access request to be detected The abnormal probability corresponding to the access request is generated according to the attribute data and the detection parameter. Therefore, after determining whether the abnormal probability is greater than a preset abnormal threshold, whether the access request is an abnormal access request may be confirmed based on the size of the two. Therefore, the abnormal access request can be accurately identified and processed in a large number of access requests, thereby ensuring the stability and security of the network.

To achieve the above technical purpose, the present application also proposes an abnormal access detecting device, as shown in FIG. 6, comprising the following modules:

The obtaining module 610 is configured to obtain attribute data of the access request to be detected.

The first generation module 620 generates an abnormal probability corresponding to the access request according to the attribute data and the detection parameter, and the detection parameter is generated according to the value of the label corresponding to each sample access request and the attribute data;

The determining module 630 determines whether the abnormal probability is greater than a preset abnormal threshold;

If yes, the determining module 630 confirms that the access request is an abnormal access request;

If not, the determining module 630 confirms that the access request is a normal access request.

In specific application scenarios, it also includes:

Determining a module, determining each sample according to access frequency information of each sample access request Whether the access request is abnormal;

In a specific application scenario, the access frequency information includes a user identifier ID and an access time corresponding to the sample access request, and the determining module is specifically configured to:

In a specific application scenario, the original detection parameters are generated according to the following formula:

In a specific application scenario, the abnormal threshold is specifically generated by:

After the attribute data of the access request to be detected is obtained, the abnormality probability corresponding to the access request is generated according to the attribute data and the detection parameter, and the detection parameter is based on the value of the label corresponding to each sample access request and The attribute data is generated. Therefore, after determining whether the abnormal probability is greater than a preset abnormal threshold, it can be confirmed based on the size of the two whether the access request is an abnormal access request. Therefore, the abnormal access request can be accurately identified and processed in a large number of access requests, thereby ensuring the stability and security of the network.

Through the description of the above embodiments, those skilled in the art can clearly understand that the present application can be implemented by hardware, or by software plus a necessary general hardware platform. Based on such understanding, the technical solution of the present application may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a USB flash drive, a mobile hard disk, etc.), including several The instructions are for causing a computer device (which may be a personal computer, server, or network device, etc.) to perform the methods described in various implementation scenarios of the present application.

A person skilled in the art can understand that the drawings are only a schematic diagram of a preferred implementation scenario, and the modules or processes in the drawings are not necessarily required to implement the application.

A person skilled in the art may understand that the modules in the apparatus in the implementation scenario may be distributed in the apparatus for implementing the scenario according to the implementation scenario description, or may be correspondingly changed in one or more devices different from the implementation scenario. The modules of the above implementation scenarios may be combined into one module, or may be further split into multiple sub-modules.

The above serial numbers are only for the description, and do not represent the advantages and disadvantages of the implementation scenario.

The above disclosure is only a few specific implementation scenarios of the present application, but the present application is not limited thereto, and any changes that can be made by those skilled in the art should fall within the protection scope of the present application.

Claims

An abnormal access detection method, comprising:

Obtaining attribute data of the access request to be detected;

Generating, according to the attribute data and the detection parameter, an abnormal probability corresponding to the access request, where the detection parameter is generated according to the value of the label corresponding to each sample access request and the attribute data;

Determining whether the abnormal probability is greater than a preset abnormal threshold;

If yes, confirm that the access request is an abnormal access request;

If not, confirm that the access request is a normal access request.
The method of claim 1, wherein before acquiring the attribute data of the access request to be detected, the method further comprises:

Determining, according to the access frequency information of each sample access request, whether each of the sample access requests is abnormal;

Labeling different values for normal sample access requests and exception sample access requests;

Generating original detection parameters according to values of labels corresponding to each sample access request and attribute data;

The detection parameter is generated according to the original detection parameter.
The method according to claim 2, wherein the access frequency information includes a user identifier corresponding to the sample access request and an access time, and each sample access request is determined according to access frequency information of each sample access request. Whether it is abnormal, specifically:

Acquiring, according to the user identifier, a first number of sample access requests submitted by the same user within a time window before the access time, and acquiring sample access submitted by the same user within the time window after the access time The second quantity requested;

Determining whether the sum of the first quantity and the second quantity is greater than a preset number of times threshold;

If yes, confirm that the sample access request is an abnormal sample access request;

If not, confirm that the sample access request is a normal sample access request.
The method of claim 2, wherein the original detection parameters are generated according to the following formula:

Where argmin w is the value function of the original detection parameter, w is the original detection parameter, and w is the minimum value corresponding to the summation item, N is the number of sample access requests, and V i is each The value of the label of the sample access request.
The method according to any one of claims 1 to 4, wherein the abnormality threshold is specifically generated by:

Get the percentage of exception sample access requests for all sample access requests;

Acquiring an abnormal probability corresponding to each of the sample access requests according to the detection parameter;

Sorting the abnormal probability corresponding to each sample access request from small to large;

Determining an abnormal probability corresponding to the percentage according to the sorting result, and using the abnormal probability as the abnormal threshold.
An abnormal access detecting device, comprising:

Obtaining a module, acquiring attribute data of an access request to be detected;

The first generation module generates an abnormal probability corresponding to the access request according to the attribute data and the detection parameter, and the detection parameter is generated according to the value of the label corresponding to each sample access request and the attribute data;

a determining module, determining whether the abnormal probability is greater than a preset abnormal threshold;

If yes, the determining module confirms that the access request is an abnormal access request;

If not, the determining module confirms that the access request is a normal access request.
The device of claim 6 further comprising:

Determining, by the access frequency information of each sample access request, determining whether each of the sample access requests is abnormal;

An allocation module that assigns different values to the normal sample access request and the abnormal sample access request;

The second generation module generates an original detection parameter according to the value of the label corresponding to each sample access request and the attribute data;

And a third generation module, configured to generate the detection parameter according to the original detection parameter.
The device according to claim 7, wherein the access frequency information includes a user identification ID and an access time corresponding to the sample access request, and the determining module is specifically configured to:

Acquiring, according to the user ID, a first number of sample access requests submitted by the same user within a time window before the access time, and acquiring sample access submitted by the same user within the time window after the access time The second quantity requested;

Determining whether the sum of the first quantity and the second quantity is greater than a preset number of times threshold;

If yes, confirm that the sample access request is an abnormal sample access request;

If not, confirm that the sample access request is a normal sample access request.
The apparatus according to claim 7, wherein the original detection parameter is generated according to the following formula:

Where argmin w is the value function of the original detection parameter, w is the original detection parameter, and w is the minimum value corresponding to the summation item, N is the number of sample access requests, and V i is each The value of the label of the sample access request.
The device according to any one of claims 6 to 10, wherein the abnormality threshold is specifically generated by:

Get the percentage of exception sample access requests for all sample access requests;

Acquiring an abnormal probability corresponding to each of the sample access requests according to the detection parameter;

Sorting the abnormal probability corresponding to each sample access request from small to large;

Determining an abnormal probability corresponding to the percentage according to the sorting result, and using the abnormal probability as the abnormal threshold.