CN109947814B

CN109947814B - Method and apparatus for detecting anomalous data groups in a data collection

Info

Publication number: CN109947814B
Application number: CN201810957760.XA
Authority: CN
Inventors: 黄铃; 段亦涛; 徐葳; 班义琨
Original assignee: Huianjinke Beijing Technology Co ltd; Tsinghua University
Current assignee: Huianjinke Beijing Technology Co ltd; Tsinghua University
Priority date: 2018-08-21
Filing date: 2018-08-21
Publication date: 2021-03-30
Anticipated expiration: 2038-08-21
Also published as: CN109947814A

Abstract

The disclosed embodiments provide a method, apparatus, and computer-readable storage medium for detecting anomalous data groups in a data collection. The method comprises the following steps: removing edges from a graph corresponding to the set of data that have a weight below a first threshold; determining a connected subgraph which can be formed by the top points and the rest edges in the graph; respectively adjusting each connected subgraph to maximize the respective abnormal density value; and determining a data set corresponding to the adjusted connected subgraph as the abnormal data set.

Description

Method and apparatus for detecting anomalous data groups in a data collection

Technical Field

The present disclosure relates generally to the field of data mining, and more particularly to a method, apparatus, and computer-readable storage medium for detecting anomalous data groups in a data collection.

Background

Network fraud has become one of the serious threats to the contemporary internet. The purpose of fraud is diverse, ranging from minor attempts to gain public attention to serious financial fraud (e.g., credit card theft). For example, on social networking sites or media sharing sites, people want to increase their own account value by adding more fans (followers or followers). As another example, on an e-commerce website, a fraudster registers many accounts to abuse new user offers provided by the website, or to spam false services, goods, etc. to normal users. Therefore, a solution is needed to detect such network fraud.

Disclosure of Invention

To at least partially solve or mitigate the above-described network fraud problem, methods, apparatuses, and computer-readable storage media for detecting anomalous data sets in a data collection according to the present disclosure are provided.

According to a first aspect of the present disclosure, a method for detecting an anomalous data set in a data set is provided. The method comprises the following steps: removing edges from a graph corresponding to the set of data that have a weight below a first threshold; determining a connected subgraph which can be formed by the top points and the rest edges in the graph; respectively adjusting each connected subgraph to maximize the respective abnormal density value; and determining a data set corresponding to the adjusted connected subgraph as the abnormal data set.

In some embodiments, the data set includes data for a plurality of users in one or more modes, and the graph corresponding to the data set is determined by: the vertexes in the graph correspond to the users one by one; and if there is similarity between data of two users corresponding to two vertices in the graph, there is an edge between the two vertices. In some embodiments, the weight of an edge is given by:

wherein the content of the first and second substances,

wherein the content of the first and second substances,

wherein S is_i，jRepresenting the weight of the edge between vertex i and vertex j in the graph, K representing the total number of patterns, the subscript K representing the correlation with the kth pattern, p^k(x) Representing the probability of a value x of all possible values in the kth mode,

representing user u_iThe value of the a-th value in the k-th mode,

representing user u_jDereferencing the value of the b-th value in the k-th mode, operator

Representing an equal function that can be customized and log being a logarithmic function.

In some embodiments, the first threshold is given by:

where θ is the first threshold, C is a total number of data records in the data set, and N represents a total number of the plurality of users. In some embodiments, before determining a connected subgraph that the vertices and the remaining edges in the graph can make, the method further comprises: vertices in the graph that do not have any edges are removed. In some embodiments, after determining connected subgraphs that the vertices and remaining edges in the graph can make and before adjusting each connected subgraph separately so that its respective outlier density value is maximized, the method further comprises: removing connected subgraphs from the graph that have a number of vertices below a second threshold.

In some embodiments, the second threshold is given by:

where ψ is the second threshold, K represents a total number of modes, N represents a total number of the plurality of users, and q^kRepresenting the number of unique values in mode k. In some embodiments, the anomalous density values of the connected subgraph are given by:

wherein the content of the first and second substances,

representation connectivity subgraph

Abnormal density value of u_iAnd u_jRespectively representing vertices corresponding to user i and user j, v representing the current connected subgraph

Vertex of, S_i，jRepresenting the weight of the edge between vertices in the graph corresponding to user i and user j, and | v | representing the number of vertices of the current connected subgraph. In some embodiments, adjusting each connected subgraph separately such that its respective outlier density value is maximized comprises: performing the following for each connected subgraph: determining a first vertex in a current connected subgraph, so that the abnormal density value of the current connected subgraph without the first vertex and the associated edge is larger than or equal to the abnormal density value of the current connected subgraph without any other vertex except the first vertex and the associated edge; determining a current connected subgraph without the first vertex and the associated edge as an intermediate connected subgraph; repeating the steps for the intermediate connected subgraphs one or more times until the last intermediate connected subgraph has only one vertex to form a series of intermediate connected subgraphs; and determining the original current connected subgraph without any vertices removed and the connected subgraph with the highest abnormal density value in the series of intermediate connected subgraphs as the connected subgraph with the maximized adjusted abnormal density value.

According to a second aspect of the present disclosure, there is provided an apparatus for detecting anomalous data groups in a data set. The apparatus comprises: a processor; a memory having instructions stored thereon, which when executed by the processor, cause the processor to perform the method according to the first aspect of the disclosure.

According to a third aspect of the present disclosure, there is provided a computer readable storage medium storing instructions that, when executed by a processor, cause the processor to perform the method according to the first aspect of the present disclosure.

By using the method, the device and/or the computer readable storage medium of the embodiment of the disclosure, the abnormal data group in the mass data can be accurately and automatically detected, and the service provider can be helped to accurately determine the abnormal user group needing attention, so that possible loss is avoided, and a large amount of operation and maintenance cost is saved.

Drawings

The foregoing and other objects, features and advantages of the disclosure will be apparent from the following description of preferred embodiments of the disclosure, taken in conjunction with the accompanying drawings, in which:

FIG. 1 is a schematic diagram illustrating an example application scenario suitable for using an outlier data set detection scheme according to an embodiment of the present disclosure.

FIG. 2 is a flow diagram illustrating an example method for detecting anomalous data groups in a data set in accordance with an embodiment of the present disclosure.

FIG. 3 is a diagram illustrating stages in applying a method for detecting anomalous data groups in a data set in accordance with an embodiment of the present disclosure to an example data set.

Fig. 4 is a diagram showing an example result in the case of using the abnormal data group detection method according to the embodiment of the present disclosure.

FIG. 5 is a schematic diagram illustrating comparison of detection results of an anomalous data set detection scheme in accordance with an embodiment of the present disclosure with detection results of other methods.

Fig. 6 is a hardware arrangement diagram illustrating an apparatus for detecting an abnormal data group in a data set according to an embodiment of the present disclosure.

Detailed Description

In the following detailed description of some embodiments of the disclosure, reference is made to the accompanying drawings, in which details and functions that are not necessary for the disclosure are omitted so as not to obscure the understanding of the disclosure. In this specification, the various embodiments described below which are used to describe the principles of the present disclosure are by way of illustration only and should not be construed in any way to limit the scope of the disclosure. The following description with reference to the accompanying drawings is provided to assist in a comprehensive understanding of exemplary embodiments of the present disclosure as defined by the claims and their equivalents. The following description includes various specific details to aid understanding, but such details are to be regarded as illustrative only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Moreover, descriptions of well-known functions and constructions are omitted for clarity and conciseness. Moreover, throughout the drawings, the same reference numerals are used for the same or similar functions, devices, and/or operations. Moreover, in the drawings, the parts are not necessarily drawn to scale. In other words, the relative sizes, lengths, and the like of the respective portions in the drawings do not necessarily correspond to actual proportions. Moreover, all or a portion of the features described in some embodiments of the present disclosure may be applied to other embodiments to form new embodiments that still fall within the scope of the present application.

Furthermore, the disclosure is not limited to each specific communication protocol of the involved devices, including (but not limited to) 2G, 3G, 4G, 5G networks, WCDMA, CDMA2000, TD-SCDMA systems, etc., and different devices may employ the same communication protocol or different communication protocols. In addition, the present disclosure is not limited to a specific operating system of a device, and may include (but is not limited to) iOS, Windows Phone, Symbian, Android, Linux, Unix, Windows, MacOS, and the like, and different devices may employ the same operating system or different operating systems.

Although the scheme for detecting anomalous data groups in a data set according to embodiments of the present disclosure will be described below primarily in a specific scenario of preventing cyber-fraud, the present disclosure is not so limited. In fact, the embodiments of the present disclosure can also be applied to various other situations that require detecting a data set with a specific pattern, such as detecting a high-value customer, etc., with appropriate adjustment and modification. In other words, the scheme according to the embodiments of the present disclosure may be used whenever a scenario is required to determine differences between data sets.

As mentioned previously, network fraud has become one of the serious threats to the contemporary internet. Therefore, fraud detection becomes a fundamental issue in the field of computer security and data mining. Although there are many types of methods of capturing/detecting cyber-fraud, in some embodiments of the present disclosure, it is primarily of interest to distinguish fraudulent from legitimate users by analyzing user logs, also known as click streams. The logs referred to herein may be multi-modal and may contain at least two types of data: (1) user profiles such as geographic location, age, phone number, gender, etc.; and (2) user actions such as sign-on, login, page view and purchase, and social actions such as paying attention to other users. Depending on the particular application, most logs contain only a subset of these attributes. However, it should be noted that: the present disclosure is not limited thereto. In fact, in other embodiments of the present disclosure, anomalous data group detection may also be performed for other data sets besides user logs.

Related methods of detecting fraud from logs can be broadly divided into two categories. The first category is a sequence of actions that models a single user action and utilizes a rule-based approach or a machine learning-based approach to detect anomalies. However, this type of solution has at least two drawbacks: (1) it is necessary to wait for multiple actions to occur and a loss has occurred at the time of detection; and (2) fraudsters can change their behavior to circumvent detection, and thus the sequence patterns used for detection are not persistent.

A second, more popular type of detection technique focuses on detecting group behavior across multiple users. The basic assumptions are: normal users take a variety of actions, while fraudsters have unusual synchronization behavior. This synchronization behavior exists for at least two reasons. First, in order to reduce the cost spent on resources such as proxy servers, telephone numbers, false accounts, and/or fraud scheme designs, fraudsters will reuse these resources as much as possible, and thus many fraudulent activities have shared attributes such as source IP addresses or telephone numbers. Second, to achieve the "economies of scale" effect, fraudsters often use many (fake) accounts to do the same fraud (e.g., zombie powder), resulting in synchronized actions between these accounts (e.g., focusing on a certain person at the same time).

Cluster-based fraud detection is challenging. First, because fraud schemes change rapidly over time, there is typically no significant amount of tagged fraud data to perform detection model training. In addition, fraudsters often perform camouflaging actions to disguise similarities between them. Therefore, in most cases it can only rely on unsupervised learning methods. Second, log data is multi-modal in nature, with different data types for each mode. Many related methods encode these data types as feature vectors. Given multiple discrete data types (e.g., IP addresses and telephone numbers), typical one-hot encoding results in very high dimensionality and sparse feature vectors, making it difficult to apply to unsupervised algorithms. Furthermore, different fraudsters exhibit synchronized behavior on different subsets of patterns, and therefore cannot analyze data using conventional clustering algorithms. Third, different patterns may have widely different distributions across different data sets. For example, IP addresses typically have a poisson distribution, while items purchased are subject to a power law distribution, which further complicates the model. Last but not least, it is important for the operator to have the detection result interpretable, since false detections will significantly affect the user experience, which enables the operator to make a reliable decision about fraud.

Therefore, in order to at least partially solve or mitigate the above-mentioned problems, a scheme for detecting fraudulent groups based on log data or, more generally, abnormal data groups therein based on a graph representation of a data set according to an embodiment of the present disclosure is proposed. Which determines the similarity between two users by using information that captures the degree of abnormality that they are similar in a specific value (e.g., IP address). In addition, this information is combined with graph-level attributes (e.g., subgraph size and its density) to measure the suspiciousness of a group of users. However, it should be noted that: the fraud group is only one type of anomalous data group, and thus the disclosed embodiments are not limited thereto. In other words, the approach according to embodiments of the present disclosure may be applicable to detecting anomalous data groups having any particular pattern, not just fraudulent groups. Therefore, in the following embodiments herein, when the term "fraud" or the like is used, it can be considered to be able to be used in exchange for "abnormality".

In addition, the anomaly data set detection scheme designed according to the disclosed embodiments is linear in both average run-time and space costs versus the number of users. Therefore, in practical implementation, it can be easily scaled according to the size of data without decreasing the operation efficiency as the data size becomes larger.

FIG. 1 is a schematic diagram illustrating an example application scenario 10 suitable for using an outlier data set detection scheme according to an embodiment of the present disclosure. As shown in fig. 1, a rogue user 110 may conduct various fraudulent activities by registering a large number of false accounts with a service provider 100. Fraudulent activities may include (but are not limited to): make use of new user offers provided by the service provider 100 to make a lot of profits, provide various illicit services based on false account numbers to normal users 120 (e.g., buying and selling zombie powder), conduct fraud to normal users 120, and so on. Thus, the rogue user 110 poses a serious threat to either the service provider 100 or the normal user 120. For this reason, the service provider 100 needs a scheme capable of detecting an abnormal user (or more generally, abnormal data) from various data (e.g., user information, behavior data, etc.) of the user.

In particular, in an abnormal data set detection scheme according to an embodiment of the present disclosure, the following concept is proposed. As a core of the scheme, the embodiments of the present disclosure propose two metrics: collusion score (or collusion value) and anomalous density score. These two metrics catch well the graph attributes and information theory attributes of rogue user groups. These two metrics distinguish the rogue group better from normal users than existing metrics, especially when the rogue group exhibits synchronized behavior for different subsets of patterns. Hereinafter, these two metrics will be described in detail in connection with specific embodiments.

Since most fraud schemes are designed to capture financial interest, it would be more helpful to understand the economic principles behind them to detect them. In fact, as with most economic activities, for a fraudster, the fraudster is only motivated to fraud if its profit is greater than its cost. For example, false disposable accounts (sometimes also referred to as "Sybils" or "witch") are key drivers of many types of fraud. To make the cost of creating such false accounts more expensive, most websites require the email address or cell phone number of the associated user. For example, in the embodiment shown in FIG. 1, the service provider 100, in order to increase the fraud cost of the fraudulent user 110, would likely require the user to provide a series of information resources such as identification, phone number, contact address, email, etc. at registration.

Unfortunately, however, fraudsters have formed a professional chain of fraudulent services that can be widely disseminated through the darknet. These services greatly reduce the cost of the fraudster. There are reports indicating that $ 25 can purchase 150 IP addresses, whereas $ 140 to $ 420 is required for approximately 1000 mobile SIM cards. To maximize profits, fraudsters often reuse these different resources (e.g., cell phone numbers, accounts, and zombie powders, etc.) for multiple fraud.

This resource sharing phenomenon has become a key element of fraud detection. In view of the inevitable sharing of resources, fraudulent users often exhibit unusual similar characteristics compared to legitimate users. For example, there are studies that have found that many cell phone numbers are reused and that many spam agents and zombie hosts have IP addresses that fall within a few specific ranges.

In addition, there are other reasons for the fraudster's synchronization of behavior. Fraudsters use a large number of accounts to complete several tasks in a short period of time, such as paying attention to the same paying user, or setting up many new accounts. This is because the economic benefit of a single task is small and the initial investment of a fraudster (e.g. developing software for fraud) can only be offset if the number is large enough.

Thus, shared resources and common tasks are more fundamental, more difficult to avoid behavioral characteristics for fraudsters. While a fraudster may quickly change its page view sequence to avoid detection of a particular mode, changing the resource and task sharing modes can have a serious impact on the fraudster's economic model.

Thus, significant clustering behavior of fraudsters due to resource reuse and common tasks may be considered. In this regard, according to the embodiments of the present disclosure, the following are proposed as some basic facts of a scheme for detecting fraud.

(1) Similarities do not make the user suspect, but unusual similarities make the user suspect. In particular, two users may be similar in many modes, but in some of the modes the similarity is more suspicious than in others. For example, if two users share an IP address and both are interested in the same random "innominate" (as opposed to celebrities), then the two users are very suspicious. However, if the two users have the same gender, city, or concern the same celebrity, they are less suspicious. In other words, two users are more likely to be fraudulent collusions, not because they are similar, but because their probability of being the same in a certain pattern or at a certain value is very low. A user having such similar behavior is referred to herein as "collusion (compatibility)".

(2) A cheating group (or cheating organization) contains an unusually large number of collusions. The fraudster has to do the same activity many times to achieve his/her economies of scale. Thus, it is expected that many pairwise collusions can be found between fraudsters. Many documents indicate that a larger group size is a key indicator for fraudsters. It may be normal for some family members to share a phone number, but it is not so normal when several tens of users share a phone number. In some specific examples, fraudsters can be seen to have a group size from 70 to 667, whereas a normal group rarely has more than 20 users.

(3) A group is suspect not because it has many edges, but because it contains many similar unusual edges. Fraudsters typically use a large number of fraud accounts for the same task, and thus users within the same fraud group are similar to each other in an unusual but consistent manner. Relatively speaking, even if a genuine legitimate user is similar to some fraudsters (e.g., in the same IP subnet as the rogue organization), he or she is unlikely to be similar to a rogue user in many modes. In order to avoid being detected, fraudsters often generate masking activities, such as having each false user focus on a certain random character to reduce its consistency. While a good fraud detection algorithm should be able to combat this disguising behavior.

In the following embodiments of the present disclosure, log data containing a large number of user profiles and user behaviors will be used as input. It is therefore one of the purposes of some embodiments of the present disclosure to distinguish an exception group from a legitimate user in an unsupervised manner. However, the present disclosure is not so limited, but may be applicable to any scenario where a suspect data set (and not so limited to log data) needs to be detected from a large amount of data.

A method for detecting an abnormal data group in a data set according to an embodiment of the present disclosure will be described in detail below in conjunction with fig. 2 and 3.

FIG. 2 is a flow diagram illustrating an example method 200 for detecting anomalous data groups in a data set in accordance with an embodiment of the present disclosure. FIG. 3 is a diagram illustrating stages in applying a method 200 for detecting anomalous data groups in a data set in accordance with an embodiment of the present disclosure to an example data set.

In general, a solution according to an embodiment of the present disclosure may generally work as follows. First, log data may be scanned and an initial graph constructed. During construction, the number of edges that need to be further considered can be minimized. Edge weights (i.e., collusion scores) are then computed for each edge, and thresholds for filtering out less suspicious edges are dynamically determined. In other words, only very unusual similarities are considered. Sub-graph anomaly density scores are then computed and filtered. For example, all connected subgraphs can be found and an anomaly density score for each connected subgraph computed. Subgraphs may then be ranked based on the score and subgraphs with lower scores deleted. Finally, the sub-map consistency can be refined. After the above steps, the remaining subgraphs may exhibit significant clustering behavior and are therefore highly suspicious. However, some individual legitimate users are still included (e.g., accidentally sharing an IP subnet). Clusters are refined by removing individual vertices to increase internal consistency and then reduce false positives.

Specifically, the raw data set may be graphed prior to the method 200 of FIG. 2. For example, as shown in fig. 3, the entire data set with N users in K modes may be first formed with a weighted undirected graph G ═ V, E, where V is the set of vertices of graph G and E is the set of edges of graph G. Each user is modeled as a vertex of graph G (which may be a tensor with K modes). In other words, the user is in one-to-one correspondence with the vertex V of the graph G. Furthermore, if there is some similarity between two users, e.g., sharing the same value of a certain pattern (e.g., having the same IP address, the same phone number, etc.), then an edge may be used to connect the two users. In this context, the term "mode" may be understood as data of a certain dimension of a user, such as its user name, mailbox, mobile phone number, home address, IP address, or may refer to behavioral data of the user, such as paying attention to a certain person, purchasing a certain item, and so on. In other words, any information or class of information that is relevant to a user that can be reflected in data can be considered a dimension or pattern of the user.

In addition, u_iE.g. V represents user, and u yields represents u_iThe kth mode of (1). Note that:

may comprise a set of values and may be used

To represent the mth value (e.g., mth person the user has focused on, mth item purchased, etc.)Etc.). The set of values for the user may be used to capture multiple actions, e.g., if the user purchases multiple items, each may be used

To represent each individual item. u. of_iAnd u_jEdge e between_i，jMay have an edge weight S_i，jI.e., collusion score, which will be described in detail later. Thus, by the above conversion, the "raw data" can be transformed into the graph including the vertices and edges shown in fig. 3 as shown in the step "format conversion" in fig. 3. Thus, in the embodiment shown in FIG. 3, the goal is to find a subgraph in the graph that represents the set of anomalies. As will be described in detail below, a single group or subgraph may be defined as

Wherein

. Furthermore, in a given subgraph

And the abnormal density fraction is defined as

In the case of (2), then there is a higher in the group

Is more likely to be an anomalous member. Thus, the object becomes: given graph G, calculate the following

A list of ranked anomaly groups.

Accordingly, the method 200 for detecting an abnormal data set according to an embodiment of the present disclosure may include the following steps S210 to S240.

S210: removing low weight edges

The term "collusion score" as used herein will first be described in detail. The collusion score may be a metric defined to capture the degree of unusual similarity between a pair of users. The probability of the value x in the mode k is denoted as p^k(x) Will measure

Is defined as catching u_iAnd u_jThe singularity of the same value x in mode k, in other words u_iAnd u_jInformation of events that all take the value x. It is specifically defined by the following formula (1):

wherein the content of the first and second substances,

is a customizable equality operator that defaults to a natural equality function corresponding to the data type. For example, for numerical data, arithmetic equality operations are possible; for character data, a text equality operation or the like is possible. However, as mentioned above, the equality operator may be another operator defined by itself as desired. For example, values of a specified interval around a certain value may be considered to be all equal to each other, and so on.

Intuitively, if they do not take the same value x (which is normally considered normal), the resulting information is 0. Furthermore, information that shares the same value x for two users is relevant to the overall probability of that value. For example, if both users are interested in a celebrity on a microblog, it is not surprising that this is, but if they are both interested in a "innominate," they are more suspicious.

Thus, by summing this information over all possible x, it is defined at u_iAnd u_jCollusion score in mode k (or collusion score of a certain mode):

to compute collusion scores (or overall collusion scores or S scores) across all K modes, according to typical information attributes, one can define:

intuitively, S_i，jThe higher u is_iAnd u_jThe more similar. In practice, the S-score shows a large variance. For example, a pair of users sharing an IP subnet and device ID may get a higher S-score. In contrast, a normal user may not share these values with anyone, and thus S_i，jClose to zero.

In the embodiment shown in FIG. 3, the edge weights S_i，jThe thicker the higher the edge, the opposite the edge weight S_i，jThe lower the edge the thinner. It can be seen that in the embodiment shown in FIG. 3, the vertex u₁And u₂Middle and top u₂And u₃Middle and top u₄And u₇Middle and top u₂And u₄All the edges in between are relatively low in weight, and the vertex u₁And u₄Middle and top u₅、u₆、u₈、u₉And a vertex u₇And u₈The weights in between are relatively high.

In addition, collusion scores can be extended to accommodate different data types and distributions.

First, it is noted that

To represent a customizable "equality" definition for each mode. For example, if the first 24 bits of two users' IP addresses are the same, one typically defines the two users as sharing the same IP subnet. As another example, for timestamps, one typically treats two timestamps in the range Δ as being timestampsThe same time stamp.

Next, p is determined^k(x) Is cumbersome because we do not always know the distribution of mode k. In this case, in some embodiments, for modes with discrete values (e.g., typed modes), we can assume a "uniform distribution" and simply p for all x^k(x) Is set to be 1/q^kWherein q is^kIs the number of unique values for mode k. For example, if all possible values of pattern k are {1, 2, 3}, then although the data of pattern k that appears in the data set may be {1, 1, 1, 2, 2, 2, 1, 1}, q may be^kStill 3. This approximation is applicable to many fraud-related attributes, such as IP subnets and telephone numbers, which typically have a poisson distribution.

However, the assumption of uniformity is not applicable to low entropy distributions, such as long tail distributions, which are common to patterns such as purchased items or interested users. Low entropy means that many users behave very similarly regardless of fraud. Intuitively, it is not surprising for such a distribution to focus on celebrities (head of distribution), but more information is provided if they are all focused on someone at the tail. For example, in a social network, 20% of users get more than 80% of their attention. The density subgraph between celebrities and their fans is less likely to be fraudulent. If mode k has a long tail distribution, its entropy is very low. For example, the entropy of a uniform distribution over more than 50 values is 3.91, but the entropy of a long-tailed distribution with 90% probability centered at one value is only 0.71. When a value in one mode is found to have low entropy, an empirical distribution (i.e., histogram) can be calculated and used to calculate p^k(x) In that respect In addition, in other embodiments, other custom p's may also be used^k(x) A function.

Next, a threshold θ for removing the normal edge (herein, sometimes referred to as "first threshold") may be determined. Specifically, iteration may be performed, for example, in all sides in the figure, and the S-score thereof is calculated. If S is_i，j< θ, the corresponding edge may be removed, where θ is the threshold. Can be used forθ is determined using the following equation:

where C is the total number of event records in the data set. Intuitively, θ is the product of two terms: average number of events per user

And average information per user in mode k

Therefore, θ can be understood as the average of the scores on all sides. Furthermore, in some embodiments, a vertex may be removed if the degree (degree) of the vertex in the graph becomes zero. This can significantly reduce the amount of subsequent processing on the one hand, and on the other hand because we are not concerned about users that are dissimilar to other users.

Returning to FIG. 3, after the step "edge/vertex removal" is performed, vertex u can be seen₁And u₂Middle and top u₂And u₃Middle and top u₄And u₇Middle and top u₂And u₄The relatively low-weight edges in between are all removed, leaving vertex u₁And u₄Middle and top u₅、u₆、u₈、u₉And a vertex u₇And u₈Relatively high-weight edges in between. In addition, the vertex u is removed after these edges are removed₂And u₃To a vertex without edges, i.e., to zero degrees, both vertices may be removed as well. However, in other embodiments, the two vertices may not be deleted, but may be respectively used as connected subgraphs for subsequent processing.

S220: determining connectivity subgraphs

Then, candidate outlier data sets can be found. Since many edges have been filtered out, it is possible to draw the figureG is divided into a plurality of connected components (i.e., connected subgraphs). Instead of using other clustering algorithms, these connected subgraphs can be used as candidates for the outlier data set. This is a reasonable choice not only because the connected attributes indicate their similarity, but also because it is efficient to compute using an algorithm for connected subgraphs. As shown in FIG. 3, after the step "connected subgraph determination", the vertex u may be connected₁、u₄And u₅、u₆、u₇、u₈、u₉Two connected subgraphs are separated, as shown by the dashed box.

Furthermore, in some embodiments, only sufficiently large connected subgraphs may be retained and all smaller connected subgraphs considered normal, according to the aforementioned basic fact (2). First, in some embodiments, the magnitude threshold (sometimes referred to herein as a "second threshold") may be heuristically determined

Intuitively, ψ is the sum of the average number of users per value of mode k. Thus ψ can be understood as the average size of the connected subgraphs. So that in some embodiments all connected subgraphs having a size smaller than ψ can be removed. As shown in FIG. 3, after the step "connected subgraph filtering out", connected subgraphs (u) with a number of vertices below ψ can be filtered₁、u₄) Removed from the graph to leave only connected subgraphs (u) with more vertices₅、u₆、u₇、u₈、u₉)。

However, it should be noted that: this filtering step is not necessary. In other words, in other embodiments, smaller connected subgraphs may be fully retained for subsequent processing. For example, when identifying a scene for a high-value customer, a smaller connected subgraph can be kept at all to avoid missing a particular high-value customer.

S230: regulating connectivity pattern

As previously mentioned, depending on the data distribution and equality operators

Since some legitimate users may occasionally have edges with non-negligible weights S connected to one or more anomalous users, the legitimate users may also be included in the connected subgraph. For example, as shown in FIG. 3, user u₇With user u only₈Connected, for example, because they have the same IP subnet. It is clear, however, that the user is not similar to the other users in the connected sub-graph, so he is likely not a fraudster, but only a normal user. Therefore, based on the above basic fact (3), such false detection can be eliminated by increasing the consistency among all members in the group. Since rogue users share common resources and tasks, they share a higher degree of similarity with each other than with legitimate users that are accidentally included.

For this reason, the anomaly density score needs to be calculated. Legitimate users may be excluded by considering the density of anomalies. In some embodiments, the candidate subgraphs may be grouped as shown below

Dividing the sum of the edge weights of (a) by the number of vertices of (b) to compute a candidate subgraph

Abnormal density fraction of (2):

score captures connected subgraph (or group)

Three metrics of (c): edge weight S, group size | v |, and edge density

The former metric takes into account information theory similarity, while the latter two take into account graph-based clustering features.

Specifically, if all other parameters are kept constant for each case, the anomaly density score satisfies the following three conditions: (1) collusion property: edges with higher weight result in higher

(ii) a (2) Size: larger groups having higher heights

(ii) a And (3) consistency: the denser the clusters (dense means having a higher total S-score),

the higher. Appendix a gives a proof of this.

In contrast, a metric having only graph features or only informative features cannot satisfy all three of the above conditions. E.g. edge density

Is not a good metric because it does not satisfy condition (1). The molecule in formula (5)

Only the information is emphasized, and thus the condition (3) is not satisfied.

For group

Increase of

Meaning increasing the group size and increasing the consistency between all vertices. Thus, the problem is simplified to find

Middle maximization

Sub-drawing of

. This is a typical closest sub-graph problem that can be solved using a flow network (flow network). However, it is difficult to extend it to datasets with millions of vertices. Thus, in some embodiments, a greedy algorithm is proposed that operates in near linear time. Algorithm 1 gives the pseudo code of the algorithm.

Algorithm 1: finding a maximum

Sub-graph of (please see the pseudo-code below)

Specifically, for each connected subgraph, a first vertex in the current connected subgraph can be determined, so that the abnormal density value of the current connected subgraph without the first vertex and the associated edge is greater than or equal to the abnormal density value of the current connected subgraph without any other vertex except the first vertex and the associated edge. However, the current connected subgraph without the first vertex and associated edge is determined to be an intermediate connected subgraph. Repeating the above steps one or more times for the intermediate connected subgraphs until the last intermediate connected subgraph has only one vertex to form a series of intermediate connected subgraphs. Determining the original current connected subgraph without any vertices removed and the connected subgraph with the highest abnormal density value in the series of intermediate connected subgraphs as the adjusted connected subgraph with the maximized abnormal density value.

Intuitively, remove vertex u_i(and all associated edges thereof) the molecule of FIG. 5 (denoted t) is reduced

It also reduces the denominator | v | by 1. At each iteration, the removal has a minimum

Vertex u of the value^*And is in

Greedily above maximizes revenue. This process is repeated until all vertices are removed. At the end of the algorithm, it returns to

The set of remaining vertices at maximum. The greedy algorithm is a 2-approximation algorithm, with the proofs given in appendix B.

Returning to FIG. 3, after going through the step "connected subgraph adjustment", it can be seen that the legitimate user u₇Is excluded from the abnormal user group, thereby obtaining the final abnormal connected subgraph (u)₅、u₆、u₈、u₉)。

S240: determining anomalous data sets

Finally, an abnormal data group corresponding to the abnormal connected subgraph can be determined according to the determined abnormal connected subgraph. For example, in the embodiment shown in FIG. 3, the user (u) may be compared to the original data set₅、u₆、u₈、u₉) The associated data are all determined as abnormal data, and the user group (u) is set₅、u₆、u₈、u₉) And determining the abnormal user group.

Therefore, by using the method for detecting the abnormal data group according to the embodiment of the disclosure, the abnormal data group in the mass data can be accurately and automatically detected, and the service provider can be helped to accurately determine the abnormal user group needing attention, so that possible loss is avoided, and a large amount of operation and maintenance cost is saved.

Theoretically, in a graph with N vertices, there is O (N)²) Side of order and due toThis is done with O (KN)²) Times of the order of magnitude to perform graph initialization and traversal are trivial. However, since not so many users are similar to each other, the graph will be very sparse. Let M be the initial number of edges, the cost of graph traversal is O (KM). In practice, it can be observed that M is approximately on the same order as N, so that the mean graph traversal time is linear with respect to N. In terms of memory, the complexity is also linear for N + M. Furthermore, the time to compute all collusion scores S is linear with the number of user pairs sharing the same value for each mode. Assuming that M is the number of user pairs having the same value on the kth mode, the computational complexity of S per mode is O (M + N), and thus the total cost is O (K (M + N)). Furthermore, the cost of finding a connected subgraph in the graph is O (N), and for each group it takes O (| ε |) to compute the density. To improve the consistency of the outlier data sets using Algorithm 1, O (| ε | log | v |) is spent for each outlier data set. Since it is very likely | ε | < | E |, | V | < | V |, and thus the cost is small.

In summary, the overall complexity of the above scheme is linear with respect to the number of vertices in the graph. More importantly, however: the algorithms used are highly parallelizable and therefore the scheme can be easily implemented on parallel computing platforms, such as Apache Spark.

Further, in some embodiments, information across all modes and all values may be used as collusion scores for user pairs. This information is a good metric for capturing unusual similarities. In a particular embodiment, first 50 normal users and 50 fraudulent users may be randomly sampled from a single group of a genuine data set with a genuine tag and their pair-wise collusion scores plotted as a thermodynamic diagram, for example, as shown in fig. 4. Users 1-50 in FIG. 4 are normal users, while users 51-100 are rogue users. It can be observed that the collusion scores between normal-normal user pairs and normal-fraudulent user pairs are very low. In contrast, the score of a fraud-fraud pair is much higher, confirming the basic fact that the fraudulent users observed above are more similar to each other. Second, the score is a non-negative number given the definition. Thus each pattern only contributes non-negatively to the overall fraud score, i.e. introducing a new pattern does not eliminate the contribution of the other patterns, which is very useful for camouflage tolerance. Again, to combine the different scores, the combination can be achieved by simply adding the information, which also makes it computationally simpler.

It is most common for a fraudulent user to attempt to behave like a normal user. Consider the following: fraudulent users u, other than paying-attention users_fA celebrity or random user is also contemplated as a disguise. As mentioned previously, existing algorithms do not distinguish such masking behavior well and the final result does not perform well. The algorithms of the disclosed embodiments are resistant to such masking behavior. First, the masking activity does not reduce u_fIts own S score (in fact, the S score never drops). Second, it may increase the S-score of a paying user, but by an amount corresponding to u_fThe increase in the score of itself is the same. Thereby making u_fStill more questionable. In other words, the algorithm for detecting an abnormal data set according to the embodiment of the present disclosure has a high resistance to masked fraudulent activities (abnormalities), and can also find an abnormal data set or a fraudulent user.

The term "accuracy" as used herein refers to the percentage of anomalous data in the identified anomalous data set that is indeed anomalous data; and the term "recall" refers to the percentage of anomalous data sets that were correctly identified among all anomalous data sets. Thus, as the "recall" generally increases, the accuracy decreases, meaning that while more sets of anomalous data are caught, more innocent users are misidentified as anomalous data. In contrast, with increasing accuracy, the "recall" will typically decrease, meaning that while innocent users who are misrecognized decrease, the abnormal data sets that are simultaneously grabbed also decrease.

As shown in fig. 5, as the recall rate increases, the accuracy of both the CrossSpot algorithm and the SynchroTrap algorithm in comparison decreases dramatically, especially below a recall rate of 0.9. Whereas the anomalous data set detection scheme according to embodiments of the present disclosure can maintain a high degree of accuracy up to a recall of 0.92.

Fig. 6 is a hardware arrangement diagram illustrating an apparatus 600 for detecting anomalous data groups in a data set in accordance with an embodiment of the present disclosure. As shown in fig. 6, the electronic device 600 may include: a processor 610, a memory 620, an input/output module 630, a communication module 640, and other modules 650. It should be noted that: the embodiment shown in fig. 6 is merely illustrative for the purpose of this disclosure and therefore does not impose any limitation on the disclosure. Indeed, the electronic device 600 may include more, fewer, or different modules, and may be a stand-alone device or a distributed device distributed over multiple locations. For example, the electronic device 600 may include (but is not limited to): personal Computers (PCs), servers, server clusters, computing clouds, workstations, terminals, tablets, laptops, smart phones, media players, wearable devices, and/or home appliances (e.g., televisions, set-top boxes, DVD players), and the like.

The processor 610 may be a component responsible for the overall operation of the electronic device 600 that may be communicatively coupled to the other various modules/components to receive data and/or instructions to be processed from the other modules/components and to transmit processed data and/or instructions to the other modules/components. The processor 610 may be, for example, a general purpose processor such as a Central Processing Unit (CPU), a signal processor (DSP), an Application Processor (AP), or the like. In that case, it may perform one or more of the various steps of the method for detecting anomalous data in accordance with embodiments of the present disclosure above, under the direction of instructions/programs/code stored in memory 620. Further, the processor 610 may also be, for example, a special purpose processor, such as an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA), or the like. In this case, it may exclusively perform one or more of the above respective steps of the method for detecting abnormal data according to the embodiment of the present disclosure, according to its circuit design. Further, processor 610 may be any combination of hardware, software, and/or firmware. Furthermore, although only one processor 610 is shown in FIG. 6, in practice, processor 610 may include multiple processing units distributed across multiple locations.

The memory 620 may be configured to temporarily or persistently store computer-executable instructions that, when executed by the processor 610, may cause the processor 610 to perform one or more of the various steps of the various methods described in the present disclosure. Additionally, the memory 620 may be configured to temporarily or persistently store data associated with the steps, such as raw data sets, their corresponding graph representation data, abnormal data sets, and the like. The memory 620 may include volatile memory and/or nonvolatile memory. Volatile memory may include, for example (but not limited to): dynamic Random Access Memory (DRAM), static ram (sram), synchronous DRAM (sdram), cache, etc. Non-volatile memory may include, for example (but not limited to): one Time Programmable Read Only Memory (OTPROM), programmable ROM (prom), erasable programmable ROM (eprom), electrically erasable programmable ROM (eeprom), masked ROM, flash memory (e.g., NAND flash memory, NOR flash memory, etc.), a hard disk drive or Solid State Drive (SSD), high density flash memory (CF), Secure Digital (SD), micro SD, mini SD, extreme digital (xD), multi-media card (MMC), memory stick, and the like. Further, the storage 620 may also be a remote storage device, such as a Network Attached Storage (NAS) or the like. The memory 620 may also include distributed storage devices, such as cloud storage, distributed across multiple locations.

The input/output module 630 may be configured to receive input from the outside and/or provide output to the outside. Although output/process module 630 is shown as a single module in the embodiment shown in fig. 6, in practice it may be a module dedicated to input, a module dedicated to output, or a combination thereof. For example, output/output module 630 may include (but is not limited to): a keyboard, mouse, microphone, camera, display, touch screen display, printer, speaker, headphones, or any other device that can be used for input/output, etc. In addition, the input/output module 630 may also be an interface configured to connect with the above-described devices, such as a headset interface, a microphone interface, a keyboard interface, a mouse interface, and the like. In this case, the electronic apparatus 600 may be connected with an external input/output device through the interface and implement an input/output function.

The communication module 640 may be configured to enable the electronic device 600 to communicate with other electronic devices and exchange various data. The communication module 640 may be, for example: ethernet interface card, USB module, serial line interface card, fiber interface card, telephone line modem, xDSL modem, Wi-Fi module, Bluetooth module, 2G/6G/4G/5G communication module, etc. The communication module 640 may also be considered to be part of the input/output module 660 in the sense of data input/output.

Further, the electronic device 600 may also include other modules 650, including (but not limited to): a power module, a GPS module, a sensor module (e.g., a proximity sensor, an illumination sensor, an acceleration sensor, a fingerprint sensor, etc.), and the like.

However, it should be noted that: the above-described modules are only some examples of modules that may be included in the electronic device 600, and the electronic device according to an embodiment of the present disclosure is not limited thereto. In other words, electronic devices according to other embodiments of the present disclosure may include more modules, fewer modules, or different modules.

In some embodiments, the electronic device 600 shown in FIG. 6 may perform the steps of the methods described in connection with FIGS. 2-3. In some embodiments, the memory 620 has stored therein instructions that, when executed by the processor 610, may cause the processor 610 to perform various steps according to the various methods described in conjunction with FIGS. 2-3.

The disclosure has thus been described in connection with the preferred embodiments. It should be understood that various other changes, substitutions, and additions may be made by those skilled in the art without departing from the spirit and scope of the present disclosure. Accordingly, the scope of the present disclosure is not to be limited by the specific embodiments described above, but only by the appended claims.

Furthermore, functions described herein as being implemented by pure hardware, pure software, and/or firmware may also be implemented by special purpose hardware, combinations of general purpose hardware and software, and so forth. For example, functions described as being implemented by dedicated hardware (e.g., Field Programmable Gate Array (FPGA), Application Specific Integrated Circuit (ASIC), etc.) may be implemented by a combination of general purpose hardware (e.g., Central Processing Unit (CPU), Digital Signal Processor (DSP)) and software, and vice versa.

Claims

1. A method for detecting anomalous data groups in a data collection, comprising:

removing edges from a graph corresponding to the set of data that have a weight below a first threshold;

determining a connected subgraph which can be formed by the top points and the rest edges in the graph;

respectively adjusting each connected subgraph to maximize the respective abnormal density value; and

determining a data group corresponding to the adjusted connected subgraph as the abnormal data group.

2. The method of claim 1, wherein the data set comprises data for a plurality of users in one or more modes, and the graph corresponding to the data set is determined by:

the vertexes in the graph correspond to the users one by one; and

if there is similarity between the data of two users corresponding to two vertices in the graph, then there is an edge between the two vertices.

3. The method of claim 2, wherein the weight of an edge is given by:

wherein the content of the first and second substances,

wherein the content of the first and second substances,

representing user u_iThe value of the a-th value in the k-th mode,

4. The method of claim 3, wherein the first threshold is given by:

where θ is the first threshold, C is a total number of data records in the data set, and N represents a total number of the plurality of users.

5. The method of claim 1, wherein prior to determining a connected subgraph that vertices and remaining edges in the graph can make, the method further comprises:

vertices in the graph that do not have any edges are removed.

6. The method of claim 2, wherein after determining connected subgraphs that vertices and remaining edges in the graph can make and before adjusting each connected subgraph separately so that its respective outlier density value is maximized, the method further comprises:

removing connected subgraphs from the graph that have a number of vertices below a second threshold.

7. The method of claim 6, wherein the second threshold is given by:

where ψ is the second threshold, K represents a total number of modes, N represents a total number of the plurality of users, and q^kRepresenting the number of unique values in mode k.

8. The method of claim 1, wherein the anomalous density values of the connected subgraph are given by:

wherein the content of the first and second substances,

representation connectivity subgraph

Abnormal density value of u_iAnd u_jRespectively representing vertices corresponding to user i and user j,

representing a current connectivity subgraph

Vertex of, S_i，jWeights representing edges between vertices in the graph corresponding to user i and user j, an

Representing the number of vertices of the current connected subgraph.

9. The method of claim 8, wherein adjusting each connected subgraph separately so that its respective outlier density value is maximized comprises:

performing the following for each connected subgraph:

step (a): determining a first vertex in a current connected subgraph, so that the abnormal density value of the current connected subgraph without the first vertex and the associated edge is larger than or equal to the abnormal density value of the current connected subgraph without any other vertex except the first vertex and the associated edge;

step (b): determining a current connected subgraph without the first vertex and the associated edge as an intermediate connected subgraph;

repeating the above steps (a) and (b) one or more times for the intermediate connected subgraphs until the last intermediate connected subgraph has only one vertex to form a series of intermediate connected subgraphs; and

determining the original current connected subgraph without any vertices removed and the connected subgraph with the highest abnormal density value in the series of intermediate connected subgraphs as the adjusted connected subgraph with the maximized abnormal density value.

10. An apparatus for detecting anomalous data groups in a data collection, comprising:

a processor;

a memory having instructions stored thereon that, when executed by the processor, cause the processor to perform the method of any of claims 1-9.

11. A computer-readable storage medium storing instructions that, when executed by a processor, cause the processor to perform the method of any of claims 1-9.