CN108829715B

CN108829715B - Method, apparatus, and computer-readable storage medium for detecting abnormal data

Info

Publication number: CN108829715B
Application number: CN201810423903.9A
Authority: CN
Inventors: 黄铃; 向诗阳
Original assignee: Huianjinke Beijing Technology Co ltd
Current assignee: Huianjinke Beijing Technology Co ltd
Priority date: 2018-05-04
Filing date: 2018-05-04
Publication date: 2022-03-25
Anticipated expiration: 2038-05-04
Also published as: CN108829715A

Abstract

The disclosed embodiments provide a method, apparatus, and computer-readable storage medium for detecting anomalous data. The method comprises the following steps: determining a plurality of candidate features according to the abnormal behavior pattern; determining one or more valid features of a plurality of candidate features from a training data set; and determining abnormal data in the data to be detected according to the one or more effective characteristics.

Description

Method, apparatus, and computer-readable storage medium for detecting abnormal data

Technical Field

The present disclosure relates generally to the field of data mining, and more particularly to methods, apparatus, and computer-readable storage media for detecting anomalous data.

Background

As one of the important components in the modern internet industry chain, data providers provide essential data support services for many fields, such as financial information, electronic commerce, academic research, etc., by providing web-based query services and/or Application Programming Interface (API) -based query services to people. Among its data consumers, however, there are data resellers that abuse the data query API to make profit, or more colloquially called "data vendors". These malicious users download data in bulk through a query API provided by a data provider and either disguise the downloaded data for sale to an end user or directly to thereby make profit. Such malicious user actions hurt the interests of the data provider and infringe its copyrights. A solution is needed for this purpose that is able to accurately identify such malicious users.

Disclosure of Invention

To at least partially solve or mitigate the above-described problems, methods, apparatuses, and computer-readable storage media for detecting anomalous data in accordance with the present disclosure are provided.

According to a first aspect of the present disclosure, a method for detecting anomalous data is provided. The method comprises the following steps: determining a plurality of candidate features according to the abnormal behavior pattern; determining one or more valid features of a plurality of candidate features from a training data set; and determining abnormal data in the data to be detected according to the one or more effective characteristics.

In some embodiments, the data to be detected is log data of data acquired with a user, and the log data includes at least one of: a user identifier for each user; the acquisition time of each user for acquiring data each time; a database identifier of a database accessed each time each user acquires data; and an index in the database that each user accesses each time data is acquired. In some embodiments, the abnormal behavior pattern includes at least one of: the amount of data acquired is abnormal; the data type of the acquired data is abnormal; and a time anomaly of acquiring data. In some embodiments, the plurality of candidate features comprises at least one of: a characteristic related to uniformity of the acquired data; a characteristic related to the amount of data acquired; and features relating to the time period over which the data was acquired. In some embodiments, the characteristics relating to the time at which the data is acquired include characteristics relating to the time at which the data is acquired in cycles of different time units. In some embodiments, the training data set is a training data set with accurate classification labels. In some embodiments, the step of determining one or more valid features of the plurality of candidate features from the training data set comprises: determining the importance of the plurality of candidate features using a supervised learning algorithm from a training data set with accurate labels; and determining the one or more valid features based on the importance of each candidate feature. In some embodiments, the supervised learning algorithm is at least one of an L1 penalized Logistic Regression (LR) algorithm and a Random Forest (RF) algorithm. In some embodiments, the step of determining abnormal data in the data to be detected according to the one or more valid features comprises: and detecting the data to be detected by using an unsupervised abnormal value detection algorithm according to the one or more effective characteristics so as to determine the abnormal data. In some embodiments, the unsupervised outlier detection algorithm is a type of Support Vector Machine (SVM) algorithm. In some embodiments, after the step of detecting the data to be detected using an unsupervised outlier detection algorithm, the method further comprises: the anomalous data is filtered based on a predetermined threshold to filter out data having normal characteristics related to the amount of data acquired. In some embodiments, the step of determining abnormal data in the data to be detected according to the one or more valid features further comprises: determining a classifier from the supervised learning algorithm trained with the training data set; classifying the data to be detected by using the classifier so as to determine additional abnormal data belonging to an abnormal data class; and supplementing the exception data with the additional exception data.

According to a second aspect of the present disclosure, there is provided an apparatus for detecting anomalous data. The apparatus comprises: a processor; a memory having instructions stored thereon, which when executed by the processor, cause the processor to perform the method according to the first aspect of the disclosure.

According to a third aspect of the present disclosure, there is provided a computer readable storage medium storing instructions that, when executed by a processor, cause the processor to perform the method according to the first aspect of the present disclosure.

By using the method, the device and/or the computer readable storage medium disclosed by the invention, abnormal user behavior data in mass behavior data can be accurately and automatically detected, and a data provider is helped to accurately determine abnormal users needing attention, so that possible loss is avoided, and a large amount of operation and maintenance cost is saved.

Drawings

The foregoing and other objects, features and advantages of the disclosure will be apparent from the following description of preferred embodiments of the disclosure, taken in conjunction with the accompanying drawings, in which:

fig. 1A and 1B are schematic diagrams illustrating an example application scenario before and after, respectively, using a malicious user identification scheme according to an embodiment of the present disclosure.

FIG. 2 is a general flow diagram illustrating an example method for detecting anomalous data in accordance with an embodiment of the disclosure.

FIG. 3 is a diagram illustrating an example relationship between elements for determining candidate features from abnormal behavior patterns according to an embodiment of the disclosure.

FIG. 4 is an example flow diagram illustrating a method for determining valid features using a training data set in accordance with an embodiment of the present disclosure.

FIG. 5 is an example flow diagram illustrating an example method for identifying anomalous data from valid features in accordance with an embodiment of the present disclosure.

Fig. 6 is a hardware arrangement diagram showing an apparatus for identifying abnormal data according to an embodiment of the present disclosure.

Detailed Description

In the following detailed description of some embodiments of the disclosure, reference is made to the accompanying drawings, in which details and functions that are not necessary for the disclosure are omitted so as not to obscure the understanding of the disclosure. In this specification, the various embodiments described below which are used to describe the principles of the present disclosure are by way of illustration only and should not be construed in any way to limit the scope of the disclosure. The following description with reference to the accompanying drawings is provided to assist in a comprehensive understanding of exemplary embodiments of the present disclosure as defined by the claims and their equivalents. The following description includes various specific details to aid understanding, but such details are to be regarded as illustrative only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Moreover, descriptions of well-known functions and constructions are omitted for clarity and conciseness. Moreover, throughout the drawings, the same reference numerals are used for the same or similar functions, devices, and/or operations. Moreover, in the drawings, the parts are not necessarily drawn to scale. In other words, the relative sizes, lengths, and the like of the respective portions in the drawings do not necessarily correspond to actual proportions. Moreover, all or a portion of the features described in some embodiments of the present disclosure may be applied to other embodiments to form new embodiments that still fall within the scope of the present application.

Furthermore, the disclosure is not limited to each specific communication protocol of the involved devices, including (but not limited to) 2G, 3G, 4G, 5G networks, WCDMA, CDMA2000, TD-SCDMA systems, etc., and different devices may employ the same communication protocol or different communication protocols. In addition, the present disclosure is not limited to a specific operating system of a device, and may include (but is not limited to) iOS, Windows Phone, Symbian, Android, Linux, Unix, Windows, MacOS, and the like, and different devices may employ the same operating system or different operating systems.

Although the scheme for detecting anomalous data according to embodiments of the present disclosure will be described below primarily in conjunction with a specific scenario, such as a data reseller, the present disclosure is not limited thereto. In fact, the embodiments of the present disclosure can also be applied to other various data needing to be detected with a specific mode, such as detecting high-value customers, etc., with appropriate adjustment and modification. In other words, the scheme according to the embodiments of the present disclosure may be used as long as it is a scenario in which a pattern difference between data needs to be determined.

Fig. 1A and 1B are schematic diagrams illustrating an example application scenario 10 before and after, respectively, using a malicious user identification scheme in accordance with an embodiment of the present disclosure. As shown in fig. 1A and 1B, a data provider (also referred to as a data provider) 100 collects and compiles public and/or private data into a searchable database and provides services in various fields, such as financial and academic research fields, to its users 110-1, 110-2, etc. (hereinafter, collectively referred to as users 110 when no particular mention is required). Key added value to the data provider 100 is to provide, for example, data integration, data cleansing, data updating, and structured query interfaces. For example, users of penbows may query real-time financial data by using manual commands or a scripted API.

It is therefore important for the data provider 100 to protect its data security by allowing only users 110 that properly use their data. Although different data providers 100 may have different definitions of "proper use" of data, the "data resell" behavior is generally considered an unacceptable manner of data use. Generally, the Data Reseller (hereinafter sometimes abbreviated as DR, i.e., Data Reseller)120 can grab a large amount of Data from the Data provider 100 through a query API and resell or resell it, or provide its own Data query service. For example, as shown in FIG. 1A, data reseller 120 downloads large amounts of data, such as by paying a fee from data provider 100, and resells it at a low price to other users 110-2 and possibly a large number of other users. As previously mentioned, such data resale activity will result in the loss of interest or potential interest to the data provider 100 and infringement of copyrights of the data provider 100. In a current common practice scenario, the data provider 100 is generally unable to prevent data resale by the data reseller 120, but only occasionally discovers and refrains from data resale by the data reseller 120 through various possible channels after it has reselled a large amount of data. In such a case, the loss of the data provider 100 has generally been caused and is difficult to recover. Therefore, there is a need to provide data providers with a solution that can accurately identify such malicious behavior. Such a scheme should make it possible for the data provider 100 to find problems and take measures in advance as the data reseller 120 continues to download its data. For example, as shown in fig. 1B, in the case that the data provider 100 employs the scheme of identifying abnormal data according to the embodiment of the present disclosure, it may timely find the abnormal download record data of the data reseller 120 and take corresponding measures, such as closing down the account number of the data reseller 120, issuing an attorney letter, initiating a complaint, etc., before it causes further loss to itself.

However, it should be noted that: the problem of identifying data resellers 120 (which may also be referred to as the "anti-data-reselling (ADR for short)" problem) is different from the anti-web-crawling (AWC for short) problem. Existing AWC schemes focus on analyzing network access patterns for distinguishing automatic network grippers from normal human users, or for detecting malicious grippers against security breaches (e.g., for defending against distributed denial of service (DDoS) attacks).

However, these techniques for the AWC problem are not of practical significance for the ADR problem. One key goal of AWC is to distinguish between machines and humans, while many data providers 100 provide API-based query services to support machine-executed scripts. In other words, the data provider 100 should allow various types of machine query services to support automated machine operations such as high frequency trading, quantitative trading, etc. of stock market. Therefore, the key to solving the ADR problem is how to distinguish a particular type of abnormal data acquisition behavior, not whether the behavior is being executed by a programmed machine or a human. In this regard, much more attention is paid to the ADR problem as "why the data is acquired" than to "how the data is acquired" in the AWC problem, making the ADR problem more difficult to solve than the AWC problem. Technically, the AWC problem often relies on analyzing short-term behavior when identifying robots, while the ADR problem focuses more on longer-term patterns of behavior.

In practice, it is difficult to accurately identify the data reseller 120. For this reason, some data providers 100 may attempt to limit such malicious behavior by using simple rules, such as setting the amount of data queries per unit time, and the like. However, modern applications may require large amounts of data. For example, some automated trading programs use a large number of data queries to track real-time prices and historical prices, etc. for a series of stocks. If user 110 pays a fee for the query service, but is limited by the amount of data queries, the user experience may be very poor. Thus, a more finely set feature and/or model is needed to distinguish DR 120 from normal heavy data users 110. Furthermore, as is practical in the art, few data providers 100 are able to provide model training data with the proper labels. In other words, unsupervised learning techniques have to be used as much as possible.

In order to more accurately recognize the DR 120 and make it difficult for the DR 120 to avoid being detected by changing its behavior pattern, it is necessary to find an essential feature of the DR 120 that it cannot or is very difficult to change. In accordance with the teachings of the present disclosure, the DR 120 generally has the following several essential features or essential behavioral patterns.

First, it is inevitable for the DR 120 to need to download massive data from the data provider 100. In other words, even the DR 120 does not have much harm to the data provider 100 if it downloads a very small amount of data. It should be noted that: although the DR 120 has to download mass data, the user 110 who downloads mass data (i.e., heavy data user) is not necessarily the DR 120. For example, for users 110 who need to conduct high frequency transactions, machine transactions, and the like, they also need to acquire a large amount of real-time stock data from the data provider 100 through a computer, which should be allowed for legitimate users 110.

Second, it is necessary for the DR 120 to acquire a wide variety of data, rather than focusing on repeatedly acquiring a few classes or items of data as the general user 110 does. This is because the general user 110 is often more interested in certain specific data items (e.g., a specific series of stocks or a specific subject), while the DR 120 generally needs more types of data to conduct a resale profit campaign. However, it should be noted that: the data span or range itself does not necessarily lead to a user 110 being suspected of being a DR 120. For example, the general user 110 may be conducting a data research activity that may require the acquisition of a wide variety of data.

Again, the DR 120 typically needs to query the data periodically to update its database for resale, and thus there will be more or less a certain periodic pattern in its data acquisition behavior.

It should be noted that: these analyses for DR 120 behavior patterns mentioned above do not constitute limitations of the present disclosure, but are merely illustrative of the behavior patterns DR 120 may have. In other words, features as will be described in detail below may well be constructed from other behavioral patterns of DR 120 and/or combinations and the like with the behavioral features described above. In addition, the importance of the above behavior patterns is not arranged in the order described above, and should not be limited by the order. In other words, the importance of these behavior patterns should be determined from the actual data described below.

In some embodiments, a systematic approach for identifying DR 120 may be proposed based on the behavior patterns described above, which combines feature engineering (feature engineering) techniques and supervised/unsupervised learning techniques to solve the aforementioned ADR problem. The above-described method for identifying the DR 120 can be generalized to a method for identifying abnormal data, in consideration of no loss of generality. In the field of identifying DR, abnormal data may refer to download record data left when the DR 120 downloads data from the data provider 100, such as a data acquisition log or the like held by the data provider 100. By identifying the anomalous data of the DR 120 from all data of all users of the data provider 100, anomalous download behavior of the DR 120 can be determined.

FIG. 2 is a general flow diagram illustrating an example method 200 for detecting anomalous data in accordance with an embodiment of the disclosure. In particular to the DR identification problem, the method 200 for identifying DR 120 (abnormal data) according to an embodiment of the present disclosure may include the following several steps.

(1) Step S210: feature creation

For example, in some embodiments, a number of features (which may be referred to as "candidate features" hereinafter) that encompass DR behavior patterns (e.g., query data volume, distribution, time, periodicity, and burstiness) may be first constructed based at least on the behavior patterns of the three DRs 120 described above (and possibly other DR behavior patterns). The goal of this step is to cover the aforementioned DR behavior pattern from as many dimensions as possible, and should provide as much redundancy as possible with as low computational complexity as possible, so that it is difficult for the DR 120 to circumvent at least some of these features. Hereinafter, this step will be described in detail in conjunction with fig. 3.

(2) Step S220: feature selection

However, it should be noted that: in current practice, there is no clear or mature definition of DR behavioral patterns from the data provider 100. Thus, in some embodiments, an algorithm that learns (least-by-example) by example may be used, and a subset of features (which may be referred to hereinafter as "valid features") of the features constructed in the preceding stage is automatically selected based on fewer observed DR samples to be able to be used to identify the DR 120. Therefore, the goal of this step is to automatically learn the definition of DR 120 by the above-mentioned fewer DR samples. Hereinafter, this step will be described in detail in conjunction with fig. 4.

(3) Step S230: DR 120 (anomaly) identification

In this step, the behavior data of the user 110 to be monitored (including the behavior data of the DR 120) may be subjected to an unsupervised outlier detection algorithm (or a supervised outlier detection algorithm in the case of the presence of the feedback tag) using the feature subset (i.e., the valid features) selected in the foregoing step, so as to distinguish a small number of DRs 120 from a large number of general users 110, thereby achieving accurate identification of the DRs 120. Hereinafter, this step will be described in detail with reference to fig. 5

In addition, since the data provider 100 does not have a large amount of feedback information related to DR, in the case that it is difficult to quantify the correctness of the detection result, the feature data of the detection result can be interpreted and the security expert of the data provider 100 can be asked to evaluate the result to form the feedback of the correctness of the detection.

In the following, a scheme for identifying DR 120 according to an embodiment of the present disclosure will be described in connection with a specific usage scenario. In this particular scenario, a certain financial information service provider, as data provider 100, will be taken as an example, having tens of thousands of paying users 110 and providing data information relating to, for example, stocks, bonds, foreign exchange, economic indices, and the like. However, it should be noted that: the disclosed embodiments are not limited to this particular application scenario, but may be applicable to any application scenario in which it is desirable to identify the DR 120, or more generally, any application scenario in which it is desirable to identify abnormal user behavior. For example, in some embodiments, a high value user (e.g., a relatively high value of profit per unit obtained from the user) may be considered and identified as an anomalous user.

As an example, the data provider 100 provides five databases D_A～D_ETable 1 shows the normalized data thereof.

Table 1 database details

It should be noted that: the above tables are merely examples of databases, and the numbers therein are all D_EIs normalized by the number of usersAnd (5) row normalizing the data. In the example shown in Table 1, D_AIncluding the history of queries over a one month period. Furthermore, for brevity and without loss of generality of the following description, the corresponding data items in the database will be referred to by "key values". Furthermore, although each data item in the database includes many fields, in particular embodiments of the present application, only the following items will be of interest: account ID, query time, query DB, and query indices. It should be noted that: the present disclosure is not so limited, and in fact different features may be designed according to other DR behavior patterns, which in turn may involve different data item fields.

In the five databases, D has been paired by a security expert at the data provider 100_APartial tagging is performed to clarify the identity of users such as 1/6 therein, where 3% of these 1/6 users are tagged as DR and the remaining 97% are tagged as normal users. Next, how to identify DR based on this only marked data and other unmarked data will be described in detail.

As previously mentioned, a number of candidate features may be designed based on DR behavior patterns to cover as much aspects of DR behavior patterns as possible and to maintain as high redundancy as possible. Fig. 3 is a diagram illustrating an example relationship between elements for determining candidate features from abnormal behavior patterns 300, according to an embodiment of the disclosure. In the embodiment shown in fig. 3, 29 behavior features are designed around at least the above three DR behavior patterns, namely, the amount of data acquired 310 (corresponding to high data volume), the kind of data acquired 320 (corresponding to large data span), and the time of data acquired 330 (corresponding to strong periodicity), as shown in table 2 below. As shown in fig. 3, candidate features may be designed based on any one, two, or all three of these behavior patterns, for example, the type of data acquired per unit time (320+330), the amount of data acquired per unit time (310+330), the distribution of the amount of acquired data over the type of data (310+320), the distribution of the amount of acquired data over the type of data per unit time (310+320+330), and so on.

It should be noted that: in this context, it is considered that a user is "active" on a certain day may mean that the user has obtained data from the data provider 100 at least once on the current day.

TABLE 2 example candidate features

In table 2, three sets of features are designed for the three DR behavior patterns described above, respectively. The first three features are entropy (entropy) features designed to capture the uniformity in the access pattern of DR. These entropy features can be used to capture the "uniformity" of a particular distribution. In particular, given n discrete probabilities p_iIs such that

And its entropy is

Given user u, database T with m key values, and T_iFor the ith key value in T, the statistical user u can obtain T_iAt least one day and user u in the month (in view of database T)_AIs one month's query data) gets T_iAnd may use both as q_iAnd w_iAre shown separately. Then these two quantities of entropy can be defined as:

index_day_entropy＝H(q_i∑_iq_i)(i＝1，2，...，m) (1)

index_num_entropy＝H(w_i∑_iw_i)(i＝1，2，...，m) (2)

in addition, index _ day _ entry may be changed to create the entropy of the moving average of the number of key values obtained (without computing duplicate queries) over seven active days: index _ avg _ entry.

Thus, these entropy features can capture the uniformity of a user's query against different data indices (i.e., data items). That is, a repeated query for a relatively few specific data items as a general user will result in the above-described entropy feature value being relatively low, while a uniform query for a relatively many broad data items as a DR will result in the above-described entropy feature value being relatively high. Thus, while the total amount of data that a typical user may acquire is also high, primarily the amount of data access for the particular item of interest is high, its data access pattern is less "uniform" and thus the entropy signature value will be lower. In contrast, DR typically acquires many more different key values and exhibits strong periodicity, thus typically resulting in much higher entropy eigenvalues than average users.

The following 23 are features of different angles (e.g., magnitude, diversity, liveness, time density, etc.) associated with the captured volume, which are specifically defined in table 2. Followed by 3 features relating to different periods of the day. The principle behind this is that a typical user 110 typically only uses the services of the data provider 100 during day or evening work, while there are many DRs 120 that sometimes choose to grab a large amount of data when the late-night network is clear. Thus, acquiring data items during an abnormal time (e.g., 10 pm later to 8 pm) will make the user behavior more suspicious.

It should be noted that: the various features set forth above are exemplary only, and the disclosure is not limited thereto. In other words, it is fully possible to design more or fewer additional or alternative features based on the three user behavior patterns described above or possibly other user behavior patterns.

After determining possible candidate features, the features may be formed into feature vectors, e.g.

Wherein f is_kIs the kth feature, and h_iIs user u_iFor database D_T(wherein, D_T∈D_A，...，D_E) The query history of (c). Thus, it is possible to target each dataThe library creates one such vector for each user, which feature vector may also be referred to hereinafter

Called user u_iIn database D_TThe profile of (c).

After the above features have been designed, the database D can then be used_AIs tagged (hereinafter referred to as D)_A(L)) To select useful partial features from these features. In other words, one may attempt to find features important to identifying DR 120 from this limited training sample. FIG. 4 is an example flow diagram illustrating a method 400 for determining valid features using a training data set in accordance with an embodiment of the present disclosure.

As shown in FIG. 4, in step S410, L1-penalty Logistic Regression (L1-Penalized logical Regression, hereinafter abbreviated as LR) may be used. Logistic regression is a very widely used machine learning classification algorithm that fits data to a logistic function (or logistic function) to make predictions about the probability of events occurring. After logistic regression adds L1-penalty term, we can implement sparsification of model parameters: even though the training data has many dimensions, the trained model has non-zero parameters only in limited and important dimensions. The most important parameter in the LR algorithm is L1-penalty parameter C, which directly determines the number of non-zero parameters in the trained model. Based on this characteristic, LR is widely applied to a feature selection technique of learning a weight of each feature, and finally only features having a non-zero weight may be selected as features for identifying DR in some embodiments. Without loss of generality, the selected n' valid features are represented as f₀，...，f_n′(n′≤29)。

Although the LR may be able to pick out some features at the outset, the following process may also be used to adjust the weight of each feature: for D_A(L)All tagged users in (a) run a Random Forest (hereinafter abbreviated RF) algorithm to train the classifier G, as shown in step S420. The random forest is oneA new emerging, highly flexible and nonlinear machine learning algorithm integrates multiple trees into one algorithm by integrating the learning idea, trains data samples to generate multiple decision trees, and predicts and classifies new samples. When using a random forest algorithm, the parameters to be determined mainly include the number of decision trees, the depth, the size of the smallest sub-tree, and the like. During the training process, the relative importance of the respective features may also be determined according to the depth of each candidate feature in the learned decision tree. Further, assume that the importance of n' features is c₀，...，c_n′-1The importance of the n' features may then be scaled (or re-weighted) to

In some embodiments, k may be a scaling parameter with a default value of 0.5. The scaling parameter is mainly used to adjust the effect of the importance on the final recognition result of each candidate or valid feature, which may be determined experimentally, for example.

The labeled data is used to train the label weights, rather than using the trained classifier G directly for other databases, because the feature distributions for different databases may be variable, while the critical behavior pattern of DR is more stable. In other words, from the labels, rather than learning the detection model that depends on the data, one can learn the "rules" with which features look more important.

FIG. 5 is an example flow diagram illustrating an example method 500 for identifying anomalous data from valid features in accordance with an embodiment of the present disclosure. DR can be considered as an outlier (outlier) in the data, assuming that most users are generally legitimate users and there is a significant difference between them and DR in the feature space. As shown in step S510 in fig. 5, abnormal data identification may be performed on the abnormal value.

In some embodiments, among numerous outlier detection algorithms (e.g., density-based algorithms, algorithms insensitive to dimensional scaling, a Class of Support Vector machines (hereinafter sometimes referred to as a Class of SVMs), etc.), One Class of SVM algorithms (which is a nearest distance-based algorithm) and the valid features determined in the above method 400 may be selected for preliminary determination of outlier data in all data.

In addition, for database D_ASince some of its data has rare labels, the prediction of label-free data by the aforementioned classifier G may be used in some embodiments to supplement the algorithm anomaly detection results, as described in optional step S520. In other words, the union of the results of the two (outlier detection and classifier G) can be taken.

Furthermore, it has been found in practice that many users are detected as outliers simply because their effective activity is very small, so in some embodiments, some post-processing steps may be taken in optional step S530, such as filtering out these low activity users by using filters for features such as days _ num, total _ num, and index _ day _ entry. In some embodiments, the filtering threshold may be set to the median of all users. The accuracy of the threshold does not affect the end result very much, since the activity level of DR is usually significantly much higher than this median.

In one specific embodiment, C15 may be used for L1 penalty LR, n _ estimators 64 may be used for random forests, and kernel rbf, gamma 0.1, and nu 0.004 may be used for one class of SVMs. Furthermore, the outlier score parameter nu is an important trade-off parameter between accuracy (precision) and recall (recall), which in this embodiment may use 0.004. The term "accuracy" as used herein refers to the percentage of DR that is indeed among the DRs identified; and the term "recall" refers to the percentage of DR that was correctly identified among all DRs. Therefore, when nu is too large, although the "recall" rises, the accuracy drops, meaning that more innocent users are mistakenly identified as DR although more DR is caught; when nu is too small, accuracy increases, but "recall" will decrease, meaning that while innocent users who are misrecognized decrease, the DR seized also decreases.

In this embodiment, taking the above-mentioned certain data provider 100 as a specific example, the importance estimation of the candidate features is obtained after performing the above-mentioned method steps, as shown in the following table 3.

TABLE 3 feature importance estimation

After training, the importance of the remaining 20 features, except the features listed in the table, are all 0, and the features index _ day _ entry, total _ indices, and index _ avg _ entry have significantly higher importance than the other features. This finding also generally coincides with the DR behavioral characteristics described above: i.e., acquire relatively more key values and in a relatively more uniform manner.

In addition, for database D_AThe detection result of the user in (2) shows that 24 DRs are detected when only the abnormal value detection is used, and 21 DRs are detected when only the classifier G is used, which are 6 DRs in common between the two. Table 4 below shows some of the DRs.

TABLE 4 DR test results

The 24 DRs obtained according to the outlier detection algorithm all had very high values in the first three features shown in table 3 and their percentage was at least 91%. Of the 24 DRs, three of the DRs can be separated from the remaining DRs by using a k-means clustering algorithm (k ═ 2), two of the three DRs are shown in the two lowermost rows of table 3 above.

Whereas for classifier G it does not detect any of these three DRs. This is because, given a finite training set, classifiers tend to over-fit (over fit) a single DR behavior pattern, but in fact there are many new patterns. For example, the three DRs all have characteristic values as high as the other DRs, but other characteristics thereof (e.g., minutes _ rate _ rank, Sum _ day _ indices, and days _ num) are also abnormally high. In other words, they acquire data not only uniformly for key values, but also at extremely high frequency and intensity. Thus, they are in fact highly suspect DRs.

For purposes other than D_ASeveral other databases beyond, the first three features of the foregoing findings are equally very high for DR, and thus indicate that these features also have some degree of migratability for other data. In the actual measurement result, D_BOf a user who has never used the database D_A) The first five features of table 3 are all very high, all in percentages above 99%. From the manual inspection results, it 1) acquired data close to 9000 key values in two hours, which is 130 times the average value; and 2) it repeatedly acquires data with a set of nearly 500 key values over a period of about 20 days, all within about the same time period of each day. So that it has all the three aforementioned DR behavior patterns: massive, span and strong periodicity, making him highly suspicious. Furthermore, in communication with the data provider, the data provider's security specialist has manually confirmed most of the detected DR. It can be seen that the solution for identifying DR according to the embodiments of the present disclosure can automatically and accurately identify DR.

Fig. 6 is a diagram illustrating an example hardware arrangement of an apparatus 600 for determining data similarity according to an embodiment of the present disclosure. As shown in fig. 6, the electronic device 600 may include: a processor 610, a memory 620, an input/output module 630, a communication module 640, and other modules 650. It should be noted that: the embodiment shown in fig. 6 is merely illustrative for the purpose of this disclosure and therefore does not impose any limitation on the disclosure. Indeed, the electronic device 600 may include more, fewer, or different modules, and may be a stand-alone device or a distributed device distributed over multiple locations. For example, the electronic device 600 may include (but is not limited to): personal Computers (PCs), servers, server clusters, computing clouds, workstations, terminals, tablets, laptops, smart phones, media players, wearable devices, and/or home appliances (e.g., televisions, set-top boxes, DVD players), and the like.

The processor 610 may be a component responsible for the overall operation of the electronic device 600 that may be communicatively coupled to the other various modules/components to receive data and/or instructions to be processed from the other modules/components and to transmit processed data and/or instructions to the other modules/components. The processor 610 may be, for example, a general purpose processor such as a Central Processing Unit (CPU), a signal processor (DSP), an Application Processor (AP), or the like. In that case, it may perform one or more of the various steps of the method for detecting anomalous data in accordance with embodiments of the present disclosure above, under the direction of instructions/programs/code stored in memory 620. Further, the processor 610 may also be, for example, a special purpose processor, such as an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA), or the like. In this case, it may exclusively perform one or more of the above respective steps of the method for detecting abnormal data according to the embodiment of the present disclosure, according to its circuit design. Further, processor 610 may be any combination of hardware, software, and/or firmware. Furthermore, although only one processor 610 is shown in FIG. 6, in practice, processor 610 may include multiple processing units distributed across multiple locations.

The memory 620 may be configured to temporarily or persistently store computer-executable instructions that, when executed by the processor 610, may cause the processor 610 to perform one or more of the various steps of the various methods described in the present disclosure. Further, memory 620 may also be configured to temporarily or persistently store data related to these steps, such as user behavior data, candidate feature data, valid feature data, anomaly data, and the like. The memory 620 may include volatile memory and/or nonvolatile memory. Volatile memory may include, for example (but not limited to): dynamic Random Access Memory (DRAM), static ram (sram), synchronous DRAM (sdram), cache, etc. Non-volatile memory may include, for example (but not limited to): one Time Programmable Read Only Memory (OTPROM), programmable ROM (prom), erasable programmable ROM (eprom), electrically erasable programmable ROM (eeprom), masked ROM, flash memory (e.g., NAND flash memory, NOR flash memory, etc.), a hard disk drive or Solid State Drive (SSD), high density flash memory (CF), Secure Digital (SD), micro SD, mini SD, extreme digital (xD), multi-media card (MMC), memory stick, and the like. Further, the storage 620 may also be a remote storage device, such as a Network Attached Storage (NAS) or the like. The memory 620 may also include distributed storage devices, such as cloud storage, distributed across multiple locations.

The input/output module 630 may be configured to receive input from the outside and/or provide output to the outside. Although output/process module 630 is shown as a single module in the embodiment shown in fig. 6, in practice it may be a module dedicated to input, a module dedicated to output, or a combination thereof. For example, output/output module 630 may include (but is not limited to): a keyboard, mouse, microphone, camera, display, touch screen display, printer, speaker, headphones, or any other device that can be used for input/output, etc. In addition, the input/output module 630 may also be an interface configured to connect with the above-described devices, such as a headset interface, a microphone interface, a keyboard interface, a mouse interface, and the like. In this case, the electronic apparatus 600 may be connected with an external input/output device through the interface and implement an input/output function.

The communication module 640 may be configured to enable the electronic device 600 to communicate with other electronic devices and exchange various data. The communication module 640 may be, for example: ethernet interface card, USB module, serial line interface card, fiber interface card, telephone line modem, xDSL modem, Wi-Fi module, Bluetooth module, 2G/6G/4G/5G communication module, etc. The communication module 640 may also be considered to be part of the input/output module 660 in the sense of data input/output.

Further, the electronic device 600 may also include other modules 650, including (but not limited to): a power module, a GPS module, a sensor module (e.g., a proximity sensor, an illumination sensor, an acceleration sensor, a fingerprint sensor, etc.), and the like.

However, it should be noted that: the above-described modules are only some examples of modules that may be included in the electronic device 600, and the electronic device according to an embodiment of the present disclosure is not limited thereto. In other words, electronic devices according to other embodiments of the present disclosure may include more modules, fewer modules, or different modules.

In some embodiments, the electronic device 600 shown in FIG. 6 may perform the steps of the methods described in conjunction with FIGS. 2-5. In some embodiments, the memory 620 has stored therein instructions that, when executed by the processor 610, may cause the processor 610 to perform various steps according to the various methods described in conjunction with FIGS. 2-5.

The disclosure has thus been described in connection with the preferred embodiments. It should be understood that various other changes, substitutions, and additions may be made by those skilled in the art without departing from the spirit and scope of the present disclosure. Accordingly, the scope of the present disclosure is not to be limited by the specific embodiments described above, but only by the appended claims.

Furthermore, functions described herein as being implemented by pure hardware, pure software, and/or firmware may also be implemented by special purpose hardware, combinations of general purpose hardware and software, and so forth. For example, functions described as being implemented by dedicated hardware (e.g., Field Programmable Gate Array (FPGA), Application Specific Integrated Circuit (ASIC), etc.) may be implemented by a combination of general purpose hardware (e.g., Central Processing Unit (CPU), Digital Signal Processor (DSP)) and software, and vice versa.

Claims

1. A method for detecting anomalous data, comprising:

determining a plurality of candidate features according to the abnormal behavior pattern;

determining one or more valid features of a plurality of candidate features from a training data set; and

determining abnormal data in the data to be detected according to the one or more effective characteristics,

wherein the data to be detected is log data of data acquired by a user for recording,

wherein the plurality of candidate features includes a feature related to an evenness of the acquired data, a feature related to an amount of the acquired data, and a feature related to a time period during which the data is acquired,

wherein the abnormal behavior pattern includes an abnormality in the amount of data acquired, an abnormality in the kind of data of the acquired data, and an abnormality in the time of acquiring the data,

wherein the one or more valid features include the following:

-entropy features indicating the evenness of access patterns of respective ones of the data to be detected;

-entropy features indicating a running average over seven active days of the evenness degree of the access pattern of individual ones of the data to be detected without calculating repeated queries; and

-a feature indicating a total number of data indices queried without computing duplicate queries,

wherein the step of determining one or more valid features of the plurality of candidate features from the training data set comprises: determining weights of the plurality of candidate features using an L1 penalty Logistic Regression (LR) algorithm and adjusting the weights of the plurality of candidate features and determining a classifier using a Random Forest (RF) algorithm, from a training data set with accurate labels; and determining the one or more valid features based on the adjusted weights of the respective candidate features,

wherein the step of determining abnormal data in the data to be detected according to the one or more valid features comprises: detecting the data to be detected by using a Support Vector Machine (SVM) algorithm according to the one or more effective characteristics so as to determine abnormal data; classifying the data to be detected by using the classifier so as to determine additional abnormal data belonging to an abnormal data class; and supplementing the exception data with the additional exception data.

2. The method of claim 1, wherein the log data comprises at least one of:

a user identifier for each user;

the acquisition time of each user for acquiring data each time;

a database identifier of a database accessed each time each user acquires data; and

an index in the database that each user accesses each time data is acquired.

3. The method of claim 1, wherein the characteristics relating to the time at which the data is acquired comprise characteristics relating to the time at which the data is acquired in cycles of different units of time.

4. The method of claim 1, wherein the training data set is a training data set with accurate classification labels.

5. The method of claim 1, wherein after the step of detecting the data to be detected using an unsupervised outlier detection algorithm, the method further comprises:

the anomalous data is filtered based on a predetermined threshold to filter out data having normal characteristics related to the amount of data acquired.

6. An apparatus for detecting anomalous data comprising:

a processor;

a memory having instructions stored thereon, which when executed by the processor, cause the processor to perform the method of any of claims 1-5.

7. A computer-readable storage medium storing instructions that, when executed by a processor, cause the processor to perform the method of any of claims 1-5.