CN109886016B

CN109886016B - Method, apparatus, and computer-readable storage medium for detecting abnormal data

Info

Publication number: CN109886016B
Application number: CN201811609044.9A
Authority: CN
Inventors: 黄铃
Original assignee: Huianjinke Beijing Technology Co ltd
Current assignee: Huianjinke Beijing Technology Co ltd
Priority date: 2018-12-27
Filing date: 2018-12-27
Publication date: 2021-01-12
Anticipated expiration: 2038-12-27
Also published as: CN109886016A; CN112685735B; CN112685735A

Abstract

Embodiments of the present disclosure propose methods, apparatuses, and computer-readable storage media for detecting anomalous data. The method comprises the following steps: sending at least one part of the data to be detected to external data analysis resources as query data according to a first abnormal index of the data to be detected determined by using a current detector model; receiving analysis results for the query data from the external data analysis resource; updating the detector model based at least in part on the analysis and the data to be detected; and determining data having a second abnormality index higher than a predetermined threshold value among the data to be detected as abnormal data according to the second abnormality index of the data to be detected determined using the updated detector model.

Description

Method, apparatus, and computer-readable storage medium for detecting abnormal data

Technical Field

The present disclosure relates to the field of data processing, and more particularly, to a method, apparatus, and computer-readable storage medium for detecting anomalous data.

Background

With the popularization of computers and the internet, particularly portable electronic devices, software, networks and the like have become an indispensable part of people's production and life. Naturally, data security inevitably becomes one of important research areas. Malware (malware) is one of the major security threats to software, including, for example, local and network directed computer viruses, worms, trojans, lemonades, scripts, etc., that cause economic, time, effort, etc. losses to users by stealing, kidnapping, etc. the user's private information. Similarly, fraud and the like are among the major security threats to the network, including inappropriate user behavior such as misuse of newsworts by registering large numbers of new accounts, reduction of legitimate revenue for websites by downloading data material in bulk and reselling.

To circumvent or mitigate these threats, software developers have developed various software attempts to detect, avoid or at least mitigate threats, such as antivirus software, enrollment authentication systems, and the like. In contrast, to avoid these detections, the attacker also evolves to avoid the detection system, and the detection mechanism responds to this. According to the latest research findings: only 66% of the malware can be detected within 24 hours after appearance, only 72% of the malware can be detected within 1 week, and only 93% of the malware can be detected within 1 month. In fact, to evade detection, attackers often make vast quantities of different malicious binary files. For example, McAfee accepts submissions of over 30 million binary files per day. Similarly, in terms of inappropriate user behavior, malicious users will often continually change their attack patterns, e.g., using different registered addresses, registered phones, or registering, interacting, etc. through different IP addresses, etc., so that detection of such malicious network behavior can be circumvented.

Disclosure of Invention

To at least partially solve or mitigate the above-described problems, methods, apparatuses, and computer-readable storage media for detecting anomalous data in accordance with embodiments of the present disclosure are provided.

According to a first aspect of the present disclosure, a method of detecting anomalous data is provided. The method comprises the following steps: sending at least one part of the data to be detected to external data analysis resources as query data according to a first abnormal index of the data to be detected determined by using a current detector model; receiving analysis results for the query data from the external data analysis resource; updating the detector model based at least in part on the analysis and the data to be detected; and determining data having a second abnormality index higher than a predetermined threshold value among the data to be detected as abnormal data according to the second abnormality index of the data to be detected determined using the updated detector model.

In some embodiments, determining the first abnormality index or the second abnormality index of the data to be detected using the current detector model or the updated detector model comprises: extracting a characteristic vector of the data to be detected; and applying the current detector model or the updated detector model to the feature vector to determine a first anomaly index or a second anomaly index of the data to be detected, respectively. In some embodiments, extracting the feature vector of the data to be detected includes performing, for each of one or more attribute data of the data to be detected, one of: if the attribute data is categorical data, the attribute corresponds to a particular element in the feature vector; if the attribute data is ordinal type data, the partition in which the attribute is located corresponds to a particular element in the feature vector; if the attribute data is plain string data, the 3-gram corresponding to the attribute corresponds to a particular element in the feature vector; and if the attribute data is sequential data, the n-gram corresponding to the attribute corresponds to a particular element in the feature vector. In some embodiments, the external data analysis resource is a third party detector and/or expert review. In some embodiments, sending at least a portion of the data to be detected as query data to an external data analysis resource based on a first anomaly index of the data to be detected determined using a current detector model comprises: determining first data with a first abnormality index lower than the preset threshold value in the data to be detected; determining one or more second data with a first abnormal index ranking at the top in the first data as the query data; and sending the query data to the external data analysis resource. In some embodiments, the number of the one or more second data is a product of the number of the first data and a fixed ratio or an integer thereof. In some embodiments, the number of the one or more second data is a fixed number. In some embodiments, updating the detector model based at least in part on the analysis results and the data to be detected comprises: updating the abnormal label of the query data in the data to be detected based on the analysis result; and retraining the detector model using the updated data to be detected. In some embodiments, retraining the detector model using the updated data to be detected comprises: retraining the detector model with the updated query data in an increased weight relative to other data in the data to be detected. In some embodiments, the data to be detected is user behavior data relating to user behavior. In some embodiments, the user behavior data comprises at least one of: the user's registration information, the user's operational information, and the user's social information.

According to a second aspect of the present disclosure, an electronic device is provided. The electronic device includes: a processor; a memory storing instructions that, when executed by the processor, cause the processor to perform the method according to the first aspect of the disclosure.

According to a third aspect of the present disclosure, there is provided a computer readable storage medium comprising instructions which, when executed by a processor, further cause the processor to perform the method according to the first aspect of the present disclosure.

By using the method, the device and the computer readable storage medium, the detection rate of abnormal data can be greatly improved under the condition of introducing limited external resources, so that an abnormal data detection mechanism with low cost and high efficiency is realized.

Drawings

The foregoing and other objects, features and advantages of the disclosure will be apparent from the following description of preferred embodiments of the disclosure, taken in conjunction with the accompanying drawings, in which:

FIG. 1 is an architecture and flow diagram illustrating an example system for detecting anomalous data in accordance with an embodiment of the present disclosure.

Fig. 2A and 2B are graphs illustrating a distribution of example data for a detection system according to an embodiment of the present disclosure.

Fig. 3A is a schematic diagram illustrating three training schemes for cross-validation, time consistent samples, and time consistent labeling according to an embodiment of the present disclosure.

Fig. 3B is a graph illustrating a comparison of performance for three training schemes of cross-validation, time consistent samples, and time consistent labeling according to an embodiment of the present disclosure.

FIG. 4 is a graph illustrating a comparison of performance of a system for detecting anomalous data in accordance with an embodiment of the present disclosure in different configurations.

Fig. 5A and 5B are graphs showing performance comparison of a system for detecting abnormal data according to an embodiment of the present disclosure under the influence of different influencing factors.

FIG. 6 is an example flow diagram illustrating an example method for detecting anomalous data in accordance with an embodiment of the present disclosure.

Fig. 7 is a hardware arrangement diagram illustrating an example apparatus for detecting abnormal data according to an embodiment of the present disclosure.

Detailed Description

In the following detailed description of some embodiments of the disclosure, reference is made to the accompanying drawings, in which details and functions that are not necessary for the disclosure are omitted so as not to obscure the understanding of the disclosure. In this specification, the various embodiments described below which are used to describe the principles of the present disclosure are by way of illustration only and should not be construed in any way to limit the scope of the disclosure. The following description with reference to the accompanying drawings is provided to assist in a comprehensive understanding of exemplary embodiments of the present disclosure as defined by the claims and their equivalents. The following description includes various specific details to aid understanding, but such details are to be regarded as illustrative only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Moreover, descriptions of well-known functions and constructions are omitted for clarity and conciseness. Moreover, throughout the drawings, the same reference numerals are used for the same or similar functions, devices, and/or operations. Moreover, in the drawings, the parts are not necessarily drawn to scale. In other words, the relative sizes, lengths, and the like of the respective portions in the drawings do not necessarily correspond to actual proportions.

Furthermore, the disclosure is not limited to each specific communication protocol of the involved devices, including (but not limited to) 2G, 3G, 4G, 5G networks, WCDMA, CDMA2000, TD-SCDMA systems, etc., and different devices may employ the same communication protocol or different communication protocols. In addition, the present disclosure is not limited to a specific operating system of a device, and may include (but is not limited to) iOS, Windows Phone, Symbian, Android, Linux, Unix, Windows, MacOS, and the like, and different devices may employ the same operating system or different operating systems.

Although the scheme for detecting anomalous data according to embodiments of the present disclosure will be described below primarily in connection with software applications and/or user behavior data, the present disclosure is not so limited. In fact, embodiments of the present disclosure, with appropriate adjustment and modification, may also be applicable to a variety of other anomaly data detection fields, such as identifying a population of users with a particular behavioral pattern (e.g., high-value users). In other words, the scheme according to the embodiments of the present disclosure may be used as long as it is a scene in which the abnormal data needs to be distinguished.

As described above, there is a need for a scheme capable of detecting abnormal data to quickly detect abnormal data such as malware, malicious user behavior, and the like. Machine learning offers the possibility of large-scale timely detection, but the field of malware or malicious user behavior is different from common machine learning applications. Unlike applications such as voice and/or text recognition, where pronunciation and character shapes remain relatively constant over time, malware/malicious user behavior evolves as enemies attempt to evade or spoof the detection party. Indeed, malware detection or malicious user behavior detection has become an online process where a software/network service provider must continually update its detectors in response to new threats, requiring accurate tagging of new data. Unfortunately, malware/malicious user behavior tagging (or sometimes also referred to as "tagging") presents a distinctive challenge. The fraudulent and technical nature of malware and/or malicious user behavior requires expert analysis, which typically involves consuming more resources (time, effort, money, etc.), as opposed to reading this behavior in the field of text recognition being sufficient to correctly mark the text.

To this end, some embodiments according to the present disclosure propose a large-scale detection mechanism that can combine machine learning with expert review to enable the detection mechanism to keep pace with the evolution of the latest threats. Since expert review of the tags is costly, experts are modeled as being able to provide tags (or tags) for limited sample selection. Thus, a limited supply of expert review tags is combined with a more widely supplied noisy tag. Expert labels may be considered herein as expensive resources of relatively higher accuracy, such as more accurate results by computing more time-consuming, high-precision algorithms, or even accurate results given by manual computation, inspection, prediction, etc.; in contrast, noisy labels can be viewed as a relatively less accurate, inexpensive resource, such as a less accurate algorithm with short computation times, or faster labeling results by large numbers of unskilled personnel.

As described in some embodiments below relating to software applications (e.g., binary files or binary), the scheme of embodiments of the present disclosure may be examined using samples submitted to VirusTota1 (which is a malware analysis and detection website, http:// www.virustotal.com /). The data set includes a timestamp and antivirus tags for each submitted binary, which captures the binary's occurrence and prevalence over a 2.5 year span, as well as tag knowledge. The scheme according to some embodiments of the present disclosure employs a customized scheme combining accurate expert review tags and noisy anti-virus tags to train new models weekly and evaluate each model for the next week. Furthermore, for large scale evaluation, in some embodiments, the simulation of expert review tags by revealing the results of an automated scan at least 8 months after the first appearance of the sample provides the possibility for an automated detector to update and detect new viruses.

In fact, it can be noted that: accurate training labels are generally not available for all data at the first time, and therefore the effect of training labels on performance measurements is examined below. Prior work has introduced the concept of "time consistent samples," which requires that the binary file used to train the machine learning model be earlier than the binary file used to evaluate the trained machine learning model. In some embodiments of the present disclosure, however, the concept of "time consistent tags" is further introduced, which not only requires that the training binary be earlier than the evaluating binary, but also requires that the training tags be earlier than the evaluating binary. However, it should be noted that: time consistent labels limit label quality. For example, in the early stage of binary file appearance, the corresponding label is not always correct, and more accurate labels are likely to appear in the course of time and various detectors, so that the model trained by using the "time consistent label" is not necessarily capable of achieving high detection rate. In contrast, in common practice, labels long after the binary file first appears are typically collected and used for training and evaluation, resulting in an artificially high detection rate for the trained model. Similarly, for abnormal user behavior, in the event of a new occurrence of a fraud pattern, the training data without the correct label also makes it impossible or very unlikely that the detection model trained in practice will correctly detect the fraud pattern. Thus, more generally, in some embodiments, for any training data set, a training mode in which time consistent labels should be employed to ensure that the trained detector is more consistent with the real scene.

Thus, the anomaly data detection scheme provided by some embodiments of the present disclosure actually makes the following contributions:

a detection system is proposed that incorporates limited external resources (e.g., expert review) that can greatly increase the rate of correct detection of anomalous data. For example, in the field of malware detection, the malicious binary (or anomalous binary) detection rate can be boosted from 72% at 0.5% false positive rate (comparable to the best antivirus software on VirusTotal) to 77% and 89% using an average of 10 and 80 expert reviews per day, respectively. In addition, the detection system may also detect 42% of malicious binary files that have not been previously detected by any other software.

Furthermore, the effect of time-inconsistent labels on performance measurements was demonstrated, which artificially falsely increased the detection rate, from 72% to 91% for the correct detection rate of some detectors with 0.5% false alarm rate.

Further, as will be mentioned below: the evaluation also includes several additional tests that provide a more comprehensive understanding of the detection performance. Although the design of the detection system according to some embodiments of the present disclosure includes static and/or dynamic features, since the VirusTotal detector as a comparison must operate statically, the performance of the detector according to some embodiments of the present disclosure is also compared to VirusTotal in some cases using only static features. Note that: limiting to static features is actually less advantageous for the present solution because the VirusTotal detector can operate on any file, whereas the present solution constrains itself to static attributes available in VirusTotal. The performance of this scheme is slightly biased, yielding an 84% detection rate at 0.5% false positive with 80 queries per day, yet it outperforms the best detector on VirusTotal. In addition, the influence of inaccurate human markers or incompletely accurate algorithms on the detection performance of the system is explored by adding random noise to the simulated expert labels. It can thus be found that the scheme according to some embodiments of the present disclosure is robust in the presence of imperfect labels. If expert review with 90% accuracy and 5% false positive is used, the present solution is still able to achieve 82% detection at 0.5% false positive (although slightly lower than 89% detection using accurate expert review).

Several related anomaly data detection schemes typically employ a "weak detector" design, i.e., it is cost-effective to mark some instances as benign, while marking any instance as malicious requires costly validation. In contrast to weak detectors, the approach according to some embodiments of the present disclosure treats expensive resources (e.g., expert review tags) as an integral part of the periodic retraining system, rather than as a final step in the detection flow. Rather than attempting to pass the entire malicious instance set to expensive resources for verification, the present solution identifies a smaller instance set from the instance set that maximizes automated detection, and passes it to a more accurate limited resource for tagging, and uses the high accuracy tag in conjunction with other data, thereby training a detection model with high accuracy (or detection rate) at a relatively low cost.

Furthermore, in related detection schemes, samples are typically randomly divided into several groups for training and validation, due to the lack of timestamps for the samples. It is not in fact able to accurately assess the workload of a detector performance or expert review in the face of a new type of attack. This is because random grouping ensures that virus types encountered during verification are generally encountered during training, thereby falsely increasing their detection performance. In practice, however, such detectors do not provide such detection accuracy when exposed to new types of malware that have not been present during training. In contrast, in the detection scheme according to the embodiment of the present disclosure, the expert review integration improves the detection rate by 17 percentage points with respect to the uncertain sampling.

Next, the architecture and flow diagrams of an example system for detecting anomalous data in accordance with some embodiments of the present disclosure will be described in detail in conjunction with fig. 1.

FIG. 1 is an architecture and flow diagram illustrating an example system 10 for detecting anomalous data in accordance with an embodiment of the present disclosure. The exemplary system 10 is generally divided into two parts, and the part above the dashed line across the figure may be generally considered the "detection flow" of the system 10, while the part below the dashed line may be considered the "training flow".

When data to be detected 110 (e.g., binary files, user behavior data, etc.) is input, the detection process may extract feature vectors 120 of the data to be detected 110 and apply the current model 130 to the feature vectors 120 to obtain a determination result 140, e.g., classify the data to be detected 110 as abnormal or normal (i.e., tagged, labeled, or labeled). At the same time, or before or after this, the training process stores the data to be detected 110 in the database 115 together with all other data up to now (detected, not detected, labeled, unlabeled, etc.). During each retraining cycle, in some embodiments, database 115 may provide the data that already has tags therein directly as training data 170, while the data that does not have tags is provided to third party 150 (e.g., VirusTotal or any other external resource) for data analysis. It should be noted that: the third party 150 may be a branding resource with low accuracy, such as a free antivirus software detection service (e.g., VirusTotal, etc.), which in this embodiment is not required to provide high accuracy data analysis results (or tags). In addition, in other embodiments, database 115 may also provide tagged data stored therein to third party 150 for data analysis, for example, to update tags of such data using free resources that third party 150 periodically updates (e.g., some data previously marked as normal may be found to be anomalous data after updating).

Returning to the data analysis by the third party 150, in some embodiments, data deemed abnormal by the third party 150 (e.g., malware or malicious user behavior, etc.) may be used as the training data 170, while data deemed normal by the third party 150 (e.g., benign software or benign user behavior, etc.) may be provided to the feature extraction 120 of the detection process of the system 10 for further detection by the current model. As described in detail below, this is primarily because third parties are generally very cautious when data is considered abnormal, as opposed to being relatively loose when data is considered normal, requiring further testing of data considered normal. However, the present disclosure is not so limited and in other embodiments, data that is deemed normal may also be used as training data 170.

Further, as shown in FIG. 1, a binary file detected by a detector on the third party 150 and/or the current model 130 may be considered a submission for the integrated expert review 160. In some embodiments, abnormal data believed to be detected by the current model 130 (e.g., malicious binaries, malicious user behavior data, etc.) may be included directly in the training data 170. Further, a portion of the remaining data that is allowed by the limited review budget is submitted to the integrated expert review 160 (e.g., benign binary files, benign user behavior data, etc.) that is purported to be normal, in accordance with the query policy 135. Further, in some embodiments, the last remaining uncommitted data is included in training data 170 as normal or benign data. At the end of the retraining period, the next model 195 generated in the training flow (e.g., by feature extraction 180 and model training 190) replaces the current model 130, and such process is repeated.

Next, the operations of the

feature extractions

120 and 180 will be briefly described.

Many machine learning algorithms are most effective at learning for digital features, but not all attributes of the data (e.g., binary files, user behavior data, etc.) can have this format. Four common techniques for converting static and dynamic attributes of binary file/user behavior data into numerical feature vectors are discussed herein. Which of the four techniques is specifically applied may be specifically determined according to the attribute. For each technique, it is discussed here how the technique should be applied to maximize robustness against adversary circumvention.

Classifiable (Categorical): the classification map associates each possible attribute with a respective one of the dimensions. For example, in the software application domain, a DeviceIoControl API call may correspond to index i in a feature vector x, where x is if and only if a binary file issues the DeviceIoControl API call _i1. Similarly, in the field of user behavior, each attribute, such as gender, filled in at the time of user registration may correspond to an index i in a feature vector x, where x is, for example, if and only if the user to which the user behavior data corresponds has the corresponding attribute _i1. Since the absence of an attribute also reveals information about the data, a special "null" index "may be included to indicate that there is no value for the attribute. For example, the binary file may not generate any network traffic or may not be signed. Where possible, application of classification feature extraction is structured to constrain the aggressor within a finite set of values. For example, a subnet mask is applied to the IP addresses of binary file accesses to effectively reduce the IP space and to associate accesses to similar IP addresses with the same feature index, so that adversaries can be effectively prevented from circumventing the feature extraction step.

Ordinal number type (Ordinal): ordinal type attributes are represented as specific values within an ordered range of possibilities, such as the size of a binary file, the size of user behavior data, and the like. To retain robustness in accommodating fluctuations due to adversary attempting evasive detection, a partitioning (binning) scheme pair may be usedRather than associating each distinct quantity with a unique index, the ordinal values are vectorized. The partitioning scheme works as follows: for a given attribute value, we will return the index of the partition into which the value falls, and set the corresponding dimension to 1. Furthermore, for widely varying attributes, a non-linear scheme may be used to avoid larger values overriding smaller values during training. For example, the number of times v a file is written may be discretized into a value i, such that 3ⁱ≤v≤3ⁱ ⁺¹Wherein the exponential partitions accommodate this number of large dynamic ranges.

Plain (Free-formString) or plain string: many important attributes appear as unbounded strings, such as comment fields in software signature verification, content posted by the user, and so forth. If these attributes are represented as categorical features, it may result in an attacker being able to evade detection by: the single character in the attribute is changed so that the attribute maps to a different dimension. To increase robustness, 3-grams of these strings may be captured, where each sequence of three contiguous characters represents a different 3-gram, and each 3-gram is considered a different dimension. Since this scheme is still sensitive to variations that change the 3-gram, additional string simplification is introduced.

In some embodiments, to reduce sensitivity to 3-gram variations, a class of equivalence relationships between characters is defined, and each character is replaced with its representative representation (canonical representation). For example, in some embodiments, string 3PUe5f may be typified by 0BAa0B, where upper and lower case vowels are mapped to "a" and "a," upper and lower case consonants are mapped to "B" and "B," respectively, and numeric characters are mapped to "0. Similarly, string 7SEi2d may also be typified as 0BAa0 b. Sometimes, we sort the characters of the 3-gram to further control the deformation and better capture the morphology of the string. Mapping portable executable resource names (which sometimes exhibit long sequences of randomly-appearing bytes) is one application of this string reduction technique.

Sequential type (Sequential): the value of some of the attributes is a token sequence, whichEach token exhibits a limited range of values. These sequential attributes have strong associations with unformatted string attributes, however individual tokens are not limited to individual characters. In the software domain, sequential feature extraction may be used to capture, for example, API call information because there is a limited set of API calls and the calls occur in a particular order. Similarly, in the field of user behavior, sequential feature extraction may be used to capture, for example, a sequence of commands that a user triggers a website action, which again has limited website actions and occurs in a particular order. Similar to the plain string feature, an n-gram scheme may be used, where every sequence of n adjacent tokens corresponds to a single feature. Order vectorization may be vulnerable to circumvention where an adversary can introduce tokens with no effect and tokens of different meaning. To increase robustness, n-gram vectorization in the case of n-1, n-2, and n-3 may be applied in some embodiments to reduce the number of unique n-grams that an adversary can generate.

It should be noted that: the n-gram (for example, n-3 is a 3-gram) mentioned above is a commonly used technical means in the field of machine learning for languages, and a detailed description thereof is omitted here since it is not a subject of the present disclosure, and specific details can be referred to related papers and websites. However, this does not affect the way in which one skilled in the art can implement the embodiments of the present disclosure in light of the description herein.

In the following, various attributes that may be involved in vector extraction are roughly described by taking a binary file as an example. In contrast to static properties obtained by analyzing the binary itself, dynamic properties may be obtained by executing the binary in a virtual machine such as a Cuckoo sandbox. Table 1 below provides an overview of static attributes, dynamic attributes, and associated vectorization techniques.

Table 1: wherein categorical vectorization is applied to all attributes, plain string vectorization is applied to attributes marked with #, and plain string vectorization is applied to attributes marked with #

Has applied ordinal type vectorization, and is directed to the mark

Has applied sequential vectorization

Further, available static attributes may be composed of: direct properties of the executable code itself, properties associated with or derived from the executable code, and results of heuristic tools applied to the executable code. The attributes extracted directly from the code may include any static import library function as well as aspects of the portable executable format, such as resource language, segment attributes (e.g., entropy), and resource attributes (e.g., type). The metadata associated with the code may include the output of the MAGIC and EXIFTOOL tools, which infer attributes such as the type of file, and any digital signatures associated with the file. The verification status, identity of each entity in the certificate chain, comments, product name, description, copyright, internal name, and publisher may be collected from each digital signature. Heuristic tools applied to the executable file may include PEID and tools from camav, and check packaging, network tools, or administrative tools associated with malware or potentially unwanted applications.

In addition, the available dynamic properties capture interactions with the host operating system, magnetic/optical disks, and network resources, among others. Interaction with the operating system may include: dynamic library imports, mutex (mutex) activity, and manipulation of other processes running on the system. In addition, the execution trace of all Windows API calls accessed by the binary file may be captured, including the parameters, parameter values, and return values of any system call. The summary of the disk/disc activity may include file system and registry operations that capture any persistent effects of the binary file. In addition, full and/or partial paths that operate with the file system, types and/or numbers of operations directed to the file system may also be captured during feature extraction; but also the specific registry key that the binary accessed or modified. Finally, features can be extracted from the network activity of the binary file, including HTTP and DNS traffic and IP addresses accessed via TCP and UDP.

Similarly, feature vectors representing various static and/or dynamic attributes of user behavior data may also be extracted. In addition, for the user behavior data, the corresponding relationship between the corresponding static attribute and dynamic attribute and the associated vectorization technology can also be set. In some embodiments, user behavior data may include (but is not limited to): registration information of the user (e.g., username, nickname, avatar, signature, address, contact phone, email, etc.), operational information of the user (e.g., login time, location, IP address, frequency, software name used, version, consumption status, etc. of the user), social data (e.g., forum posting information, friend information, interactions with friends, etc.), and the like.

Returning to FIG. 1, during each retraining cycle, the training process must assign labels to all available binary files for training. In some embodiments, the process of assigning training labels is a unified collaboration of four different sources of information: the decision results from the third party 150, the decision results from the current model 130 (i.e., the results not submitted in the selection by the query policy 135), any prior reviews (i.e., the decision results from the database 115), and additional fresh decision results for the small number of binaries selected by the query policy 135 for review 160. However, the present disclosure is not limited thereto, and for example, in other embodiments, some of the determination results described above may be employed, or determination results from other resources may be additionally employed, and the like.

Returning to FIG. 1, the tagging process may begin with the third party data analysis results 150 and the application of the current model 130, both of which prune the data set to be submitted by the query policy 135 to the integrated expert review 160. Application of the results of the third party data analysis 150 takes advantage of the following intuitive knowledge: that is, the determination provided by the third party data analysis 150 is more prone to false positives than false positives. In other words, third-party data analysis 150 is more cautious when certain data is considered malicious and relatively more lax when certain data is considered benign. In contrast, the information that certain data presented by the third-party data analysis 150 is anomalous during the training process is deemed sufficient to mark the data as anomalous, but data that has not been detected as anomalous is typically not marked as normal before further analysis. This heuristic may be referred to as a "no detection" filter because only binary files that are not detected by the analysis of the third party data 150 are considered candidates for expert review 160.

Next, the current detection model 130 may be applied to all data for which no anomalies are detected, and any data with a score above the threshold M may be assigned an anomaly label. This heuristic may be referred to as "auto relabeling" because some data that is not detected as anomalous is automatically relabeled, similar to the self-training concept from semi-supervised learning. If the data is neither detected by the third party data analysis 150 nor automatically relabeled by the current detector model 130, the data may be considered for submission to the query policy 135.

In some embodiments, from data that cannot be confidently labeled as anomalous, the query policy 135 may select a subset thereof for expert review 160 to improve its training labels. The uncertainty sampling query policy 135 may select the data closest to the decision boundary (e.g., the aforementioned threshold M). The intuitive reason is to make model training benefit from learning the labels of those data it is uncertain. Since the existing uncertainty sampling strategy does not realize how the two aforementioned heuristics use noisy labels from the antivirus scanner to filter data for consideration, a new type of query strategy that can be realized with heuristics is proposed to increase the effectiveness of the integrated expert review.

Since heuristics have the potential to identify data as anomalous, any data not identified by them or selected for expert review can be marked as benign. Therefore, expert review results that merely flag data as anomalous will affect the final training data label. Thus, an anomaly query strategy was developed that selects data that received a higher score of the present detection model, but not so high as to be automatically relabeled for expert review. More generally, in some embodiments, the query policy 135 may have a commit budget B, where B is determined as a fixed percentage of the total number of new training data during the retraining period. Furthermore, in other embodiments, budget B may also be a fixed value that is independent of the total. The outlier query policy 135 may then submit the B remaining data with the maximum outlier score less than the automatic re-labeling threshold M to the integrated expert review 160. The remaining binaries that are not submitted to the integration expert review that exceed B are marked as benign. By selecting data that is likely to be anomalous, but will likely be marked as benign, the outlier scheme can make the expert review more likely to affect changes in the training labels than the uncertainty sampling.

Furthermore, in some embodiments, after several forms of learning are considered (including decision tree and nearest neighbor based approaches), logistic regression may be selected as the basis for the anomaly data detection model according to embodiments of the present disclosure. As a linear classifier, logistic regression assigns a weight to each feature and computes a prediction as a linear function of the feature vector, resulting in a real-valued quantity. Scoring each data as a real-valued quantity allows a balance to be created between correct reporting and false reporting by adjusting the threshold used by the data to be flagged as anomalous. Linear classification can be well adapted to a variety of data sizes because the size of the model is a function of the data dimensions, not the training data size. Furthermore, the clear relationship between weights and features allows an analyst to easily understand what the detector is doing and why, which is quite difficult for complex tree integrations (tree ensemble). Finally, in many available implementations with the ability to accommodate high dimensional feature spaces and large amounts of training data, logistic regression can scale during training.

Returning to the training flow, it integrates high quality tags from the expert review 160 with noisy tags from the third party data analysis 150 and the current model 130. Since the expert review 160 only tags a relatively small amount of data, if no special processing is done, the noisy tags from the third party data analysis 150 will overwhelm the tags of the expert review 160 during training unless special processing is done on the tags of the expert review 160. Thus, the following standard logistic regression training process is proposed, followed by a description of the special processing of the labels for expert review 160. The logistic regression training process finds the training set of labels { (x)¹，y¹)，...，(xⁿ，yⁿ) W of a weight vector w in which y is minimized by the following loss functionⁱE { -1, +1} represents data xⁱThe label of (2):

wherein, C_{_}> 0 and C₊> 0 are different hyper-parameters that control regularization and classification importance weighting, and l (x) log (1+ exp (-x)) is the logic loss function. First term in the above formula (1)

And the second term

Corresponding to misclassification loss for negative and positive examples, respectively, and the last term is a regularization term that hinders the appearance of models with many large non-zero weights. To enhance the effectiveness of the labels of the expert review 160, any data marked as benign by the expert review 160 is assigned a higher weight W during training. An outstanding result of weighting only the data that expert review 160 is tagged as benign is obtained because the abnormality query policy 135 tends to select data that falls on the abnormal side of the decision boundary for expert review. When the normal number is set during trainingA particularly high weight must be required to produce a corrective effect on the model and force the instance to receive a normal classification in light of the classification as abnormal.

In the example for binary files, the dataset for evaluation includes various binary files to reflect the appearance and prevalence of binary files over time and to record changes in the best available tag knowledge for binary files over time. Thus, in some embodiments, the evaluation dataset used consists of approximately 110 million different binaries submitted to VirusTotal between month 1 2012 and month 6 2014, which implement the criteria described above. VirusTotal receives submissions from end users, researchers, and companies, resulting in a diverse binary file sample containing thousands of malware families (abnormal data) and benign instances (normal data). To randomize the interactions with daily and hourly batch commit jobs, VirusTotal provides the hash value of the binary file committed during the randomized segment during each hour of the aforementioned collection period, which reflects about 1% of the total binary file during the collection period. Thus, the data set includes each submission of each binary to accurately represent the popularity and tag knowledge of the binary over time.

Finally, from month 2012 to month 12 2012, the first year of the data set, is retained for obtaining the initial model and performing a full rolling window evaluation for the detector using data from month 1 2013 to month 6 2014. Fig. 2A shows the scan as a function of time, and it can be seen that the scan continues to occur throughout the entire period of measuring performance, with approximately the first 200 days containing fewer scans. In addition to good distribution over time, the scan will also be distributed across different binary files. FIG. 2B illustrates resubmission of a data set with the horizontal axis ordering the binary files in order from most to least frequently committed. The dataset includes resubmissions to ensure that the distribution of the assessment data mirrors the distribution of the actual data submitted to VirusTotal by incorporating the popularity of each individual file, effectively balancing any effect of polymorphisms (polymorphisms) in the dataset. Furthermore, including a rescan event in the analysis provides a more timely marker during evaluation.

The detection system evaluation proves the potential of the integrated expert review technology in improving performance relative to the current antivirus software vendors, the influence of expert review errors, the marginal effect of additional expert review and the influence of different expert review integrated strategies.

As previously described, in some embodiments, instead of using actual human expert reviews, the expert reviews 160 may be provided using algorithms with high accuracy and high cost. Further, in some embodiments, the expert review 160 can be simulated using existing high accuracy tags, for example, when evaluating a detection system according to some embodiments of the present disclosure. For example, the integration expert review 160 is modeled by using gold tags (gold label) associated with binary files. For experiments that take into account imperfect expert reviews, simulation expert reviews may be assigned a correct reporting rate and a false reporting rate, allowing the probability of an expert review providing the correct label to be dependent on the gold label of the sample. By adjusting the likelihood of a correct response to the gold tag of the sample, errors of actual expert reviews (which are highly likely to correctly identify benign binaries as benign, but less likely to correctly identify malicious binaries as malicious) can be modeled more accurately.

Further, in some embodiments, the various system parameters described above may be managed, including expert review of the submitted budget B, automatic relabeling of the confidence threshold M, and the learning parameter C_-、C₊And W. The effect of changing the commit budget B is described below. In some embodiments, the trial is performed with an average of 80 queries per day. In some embodiments, the remaining parameters are fine-tuned to maximize the detection rate for a set of binary files obtained from an industry partner with a false positive rate between 0.01 and 0.001. In some embodiments, the following values may be used: m1.25, C_-＝0.16、C₊0.0048 and W10.

The primary motivation for measuring the performance of a detection system in a research or development setting is to understand how the system will behave in a production setting. Therefore, measurement techniques should seek to minimize the differences from the production setup. In practice, the knowledge of the data to be detected and the tags changes over time as new data to be detected appears and the detector responds appropriately with updated tags. Performance measurement techniques that fail to recognize the presence of data to be detected and tag knowledge over time effectively utilize knowledge from the future, making the measured solution virtually endless. For example, consider malware that can evade detection but can be easily detected once first identified. The difficult task of first identifying anomalous data is virtually circumvented by the appearance of performance glitches due to the insertion of correctly labeled data into the training data.

Three schemes for measuring the performance of the detector are analyzed below, each identifying, to varying degrees, the binary files and tags that occur over time. "cross-validation" is a common scheme for machine learning evaluation in the case of independent co-distribution (i.i.d.) of the data to be detected. However, in the malware/malicious user behavior detection scenario, the i.i.d. assumption does not hold, since malware/malicious user behavior changes over time to evade detection. Cross-validation evaluation, which randomly partitions the data and applies an evaluation quality label to all data, does not take time into account at all. The evaluation of the maintained time consistent samples identified a chronological ordering of the data, but did not identify the appearance of the tags over time. Instead, it applies the gold tag from the future scan results to all binaries. The use of the golden quality label during training effectively assumes that accurate detection occurs immediately. The evaluation of the time-consistent labels is maintained to fully respect the progressiveness of knowledge, ordering the data in time and constraining the training process to data and labels available at the time of training. To make a measurement with both time consistent samples and time consistent tags, the data may be divided into cycles and the first n-1 cycles used to detect the content in cycle n. Hereinafter, a cycle length of one week may be used unless explicitly indicated otherwise, although the disclosure is not limited thereto. Fig. 3A shows details of the three schemes described above.

Fig. 3A is a schematic diagram illustrating three training schemes for cross-validation, time consistent samples, and time consistent labeling according to an embodiment of the present disclosure. As shown in FIG. 3A, the upper left corner shows data A-G that may have different labels at different times. For example, data C appears when t is 0, and its label at the time of appearance is a negative result (i.e., normal data), whereas its label becomes a positive result (i.e., abnormal data) by time t is 2, and the gold label finally confirmed is a positive result. In the "cross-validation" scheme in the upper right corner of FIG. 3A, there is no requirement for the chronological order of the appearance of the data and its labels, and thus the data used in validation may appear at the time of training. For example, data E 'with known gold labels (e.g., data E' with positive labels submitted a second time) is used in the training, whereas the model thus trained can undoubtedly accurately identify data E submitted a first time for which the labels are unknown. In other words, since training data that does not distinguish the occurrence time is used, the detection performance of the detector is made to be pseudo-swollen. In the "time consistent samples" scheme in the lower left corner, although the samples (or data) appear in chronological order, the corresponding labels are always the final gold labels, which also results in a false increase in detector performance. Taking sample C as an example, although it is a negative result when t is 0 as shown in the upper left corner of fig. 3A, in the time-consistent sample scheme, it will be trained using its positive gold label, so that the detector can accurately detect the sample, unlike what actually occurs. In the "time consistent label" scheme in the lower right corner, both data and labels appear in chronological order, which is consistent with the actual situation. In other words, the detector performance trained and evaluated by this scheme is the one closest to the actual situation.

This experiment demonstrates that measurement techniques effectively affect the execution results. Fig. 3B shows the results of the analysis. Note that: cross-validation 310 and temporal consistency samples 320 behave similarly, with their detection rates being 20 and 19 percentiles false-positive relative to temporal consistency label 330 at 0.5% false positive rate, respectively. These experiments were conducted without any expert review queries, as expert reviews effectively reduced the impact of the time consistency tags by revealing future tags. Note that: the conclusions here only apply to the malware/user behavior detection environment, not to the classification scenario. This is because the classification scenario may cause a change in classification according to a change in the data to be classified, such as an increase or decrease in classification, in other words, there is no so-called gold label.

FIG. 4 is a graph illustrating a comparison of performance of a system for detecting anomalous data in accordance with an embodiment of the present disclosure in different configurations. Without involvement of the from the integrated expert review 410, the detector achieved 72% detection at 0.5 false positive rate, which is roughly equivalent to the best detector performance on VirusTotal. In the case of support from expert review 420, the detection rate is increased to 89% at 0.5% false positive rate by using, for example, an average of 80 queries per day.

Furthermore, VirusTotal calls the checker from the command line, rather than in the execution environment, which allows the checker to arbitrarily check the file, but not observe its dynamic behavior. Since the present detection system includes analyzing dynamic attributes, the performance when constrained to the static attributes provided by VirusTotal was also observed,

curves

430 and 440 corresponding to with and without expert review, respectively. Note that this constraint imposes a more unfavorable constraint on the present detector than on a third party detector, which may access the binary itself and apply a signature derived from the dynamic analysis. Fig. 4 demonstrates that the present detector performance degrades when constrained to static features, but still exceeds third party detectors with support from integrated expert review, achieving 84% detection at 0.5% false positive.

In addition to providing superior detection performance across all data aggregations relative to third party tags, the present detector also experiences greater success: it is capable of detecting new types of malware that were missed by the detector of VirusTotal. Of the 110 million samples included in some actual datasets, there were 6873 samples that had malicious gold tags but were not detected by all vendors at the first time the sample appeared. Using 80 expert review queries per day, the present detector was able to detect 44% and 32% new samples with 1% and 0.1% false positive rates, respectively.

Furthermore, to provide a corresponding analysis of false positives, performance was measured on 61213 samples that had benign gold tags and were not detected as malware by any software vendor at the first appearance of the sample. Of these 61213 benign samples, the present detector labeled 2.0% and 0.2% as malicious with a false alarm rate of 1% and 0.1% for all data, respectively. Since this sample has not been included as training data, an increase in the false alarm rate for the initial scan of benign samples can be expected.

In addition, expert review of query strategies can present a number of advantages over existing work.

Fig. 5A shows the effect of each of the three improvements introduced and discussed above. For example, where all three improvements are employed, i.e., where the query policy 135 selects a sample submitted to expert review based on maliciousness (rather than uncertainty), employs a "no detection" filter, and employs automatic re-labeling, it can be seen that its correct reporting or detection rate is highest. And when two or even one of them is selected, the correct reporting rate decreases. For a fixed tag budget B of 80, the uncertainty samples resulted in a 17 percent lower detection rate than the combined improvement of the above techniques at 0.1% false positive rate.

Further, fig. 5B shows a comparison in the case where the accuracy of the expert review 160 itself changes. It can be seen that as the accurate reporting Rate (True Positive Rate or TPR) of expert review increases, the accurate reporting Rate of the finally trained detector model also increases gradually. But even though the TPR reviewed by the expert was at 0.80, the TPR of the finally trained detector model was higher than 0.8.

Therefore, by using the detection mechanism combined with expert review, the detection rate of abnormal data can be greatly improved under the condition of introducing limited external resources, so that the abnormal data detection mechanism with low cost and high efficiency is realized.

Next, a method for detecting abnormal data according to an embodiment of the present disclosure will be described in detail with reference to fig. 6. FIG. 6 is a flow diagram illustrating an example method 600 for detecting anomalous data in accordance with an embodiment of the present disclosure. As shown in fig. 6, the method 600 may include steps S610, S620, S630, and S640. Some of the steps of method 600 may be performed separately or in combination, and may be performed in parallel or sequentially, according to some embodiments of the present disclosure, and are not limited to the specific order of operations shown in fig. 6.

The method 600 begins at step S610, and at step S610, at least a portion of the data to be detected may be transmitted to an external data analysis resource as query data based on a first anomaly index of the data to be detected determined using a current detector model.

In step S620, analysis results for the query data may be received from the external data analysis resource.

In step S630, the detector model may be updated based at least in part on the analysis results and the data to be detected.

In step S640, data having a second abnormality index higher than a predetermined threshold value among the data to be detected may be determined as abnormal data according to the second abnormality index of the data to be detected determined using the updated detector model.

In some embodiments, determining the first abnormality index or the second abnormality index for the data to be detected using the current detector model or the updated detector model may include: extracting a characteristic vector of data to be detected; and applying the current detector model or the updated detector model to the feature vector to determine a first anomaly index or a second anomaly index of the data to be detected, respectively. In some embodiments, extracting the feature vector of the data to be detected includes, for each of one or more attribute data of the data to be detected, performing one of the following operations: if the attribute data is categorical data, the attribute corresponds to a particular element in the feature vector; if the attribute data is ordinal type data, the partition in which the attribute is located corresponds to a particular element in the feature vector; if the attribute data is plain string data, the 3-gram corresponding to the attribute corresponds to a particular element in the feature vector; and if the attribute data is sequential data, the n-gram corresponding to the attribute corresponds to a particular element in the feature vector. In some embodiments, the external data analysis resource may be a third party detector and/or an expert review. In some embodiments, sending at least a portion of the data to be detected as query data to the external data analysis resource based on the first anomaly index of the data to be detected determined using the current detector model may include: determining first data with a first abnormality index lower than a preset threshold value in the data to be detected; determining one or more second data with the first abnormal index ranking at the top in the first data as query data; and sending the query data to an external data analysis resource. In some embodiments, the number of the one or more second data may be a product of the number of the first data and a fixed ratio or a rounded value thereof. In some embodiments, the number of the one or more second data may be a fixed number. In some embodiments, updating the detector model based at least in part on the analysis results and the data to be detected may include: updating the abnormal label of the query data in the data to be detected based on the analysis result; and retraining the detector model using the updated data to be detected. In some embodiments, retraining the detector model using the updated data to be detected may include: the updated query data is retrained with increasing weight relative to other ones of the data to be detected. In some embodiments, the data to be detected may be user behavior data relating to user behavior. In some embodiments, the user behavior data may include at least one of: the user's registration information, the user's operational information, and the user's social information.

Fig. 7 is a diagram illustrating an example hardware arrangement of an apparatus 700 for detecting anomalous data in accordance with an embodiment of the present disclosure. As shown in fig. 7, the electronic device 700 may include: a processor 710, a memory 720, an input/output module 730, a communication module 740, and other modules 750. It should be noted that: the embodiment shown in fig. 7 is for the purpose of illustrating the present disclosure only, and thus does not impose any limitation on the present disclosure. Indeed, the electronic device 700 may include more, fewer, or different modules, and may be a stand-alone device or a distributed device distributed over multiple locations. For example, the electronic device 700 may include (but is not limited to): personal Computers (PCs), servers, server clusters, computing clouds, workstations, terminals, tablets, laptops, smart phones, media players, wearable devices, and/or home appliances (e.g., televisions, set-top boxes, DVD players), and the like.

The processor 710 may be a component responsible for the overall operation of the electronic device 700 that may be communicatively coupled to the other various modules/components to receive data and/or instructions to be processed from the other modules/components and to transmit processed data and/or instructions to the other modules/components. The processor 710 may be, for example, a general purpose processor such as a Central Processing Unit (CPU), signal processor (DSP), Application Processor (AP), or the like. In that case, it may perform one or more of the various steps of the method for detecting anomalous data in accordance with embodiments of the present disclosure above, under the direction of instructions/programs/code stored in memory 720. Further, the processor 710 may also be, for example, a special purpose processor, such as an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA), or the like. In this case, it may exclusively perform one or more of the above respective steps of the method for detecting abnormal data according to the embodiment of the present disclosure, according to its circuit design. Further, processor 710 may be any combination of hardware, software, and/or firmware. Moreover, although only one processor 710 is shown in FIG. 7, in practice processor 710 may comprise multiple processing units distributed across multiple sites.

The memory 720 may be configured to temporarily or persistently store computer-executable instructions that, when executed by the processor 710, may cause the processor 710 to perform one or more of the various steps of the various methods described in the present disclosure. Additionally, memory 720 may also be configured to temporarily or persistently store data related to these steps, such as user behavior data to be processed, feature vectors, degree of abnormality data, and so forth. The memory 720 may include volatile memory and/or non-volatile memory. Volatile memory may include, for example (but not limited to): dynamic Random Access Memory (DRAM), static ram (sram), synchronous DRAM (sdram), cache, etc. Non-volatile memory may include, for example (but not limited to): one Time Programmable Read Only Memory (OTPROM), programmable ROM (prom), erasable programmable ROM (eprom), electrically erasable programmable ROM (eeprom), masked ROM, flash memory (e.g., NAND flash memory, NOR flash memory, etc.), a hard disk drive or Solid State Drive (SSD), high density flash memory (CF), Secure Digital (SD), micro SD, mini SD, extreme digital (xD), multi-media card (MMC), memory stick, and the like. Further, the storage 720 may also be a remote storage device, such as a Network Attached Storage (NAS) or the like. Memory 720 may also comprise distributed storage devices, such as cloud storage, distributed across multiple locations.

The input/output module 730 may be configured to receive input from the outside and/or provide output to the outside. Although input/output module 730 is shown as a single module in the embodiment shown in fig. 7, in practice it may be a module dedicated to input, a module dedicated to output, or a combination thereof. For example, input/output module 730 may include (but is not limited to): a keyboard, mouse, microphone, camera, display, touch screen display, printer, speaker, headphones, or any other device that can be used for input/output, etc. In addition, the input/output module 730 may also be an interface configured to connect with the above-described devices, such as a headset interface, a microphone interface, a keyboard interface, a mouse interface, and the like. In this case, the electronic device 700 may be connected with an external input/output device through the interface and implement an input/output function.

The communication module 740 may be configured to enable the electronic device 700 to communicate with other electronic devices and exchange various data. The communication module 740 may be, for example: ethernet interface card, USB module, serial line interface card, fiber interface card, telephone line modem, xDSL modem, Wi-Fi module, Bluetooth module, 2G/7G/4G/5G communication module, etc. The communication module 740 may also be considered as a part of the input/output module 730 in the sense of data input/output.

Further, electronic device 700 may also include other modules 750, including (but not limited to): a power module, a GPS module, a sensor module (e.g., a proximity sensor, an illumination sensor, an acceleration sensor, a fingerprint sensor, etc.), and the like.

However, it should be noted that: the above-described modules are only some examples of modules that may be included in the electronic device 700, and the electronic device according to the embodiments of the present disclosure is not limited thereto. In other words, electronic devices according to other embodiments of the present disclosure may include more modules, fewer modules, or different modules.

The disclosure has thus been described in connection with the preferred embodiments. It should be understood that various other changes, substitutions, and additions may be made by those skilled in the art without departing from the spirit and scope of the present disclosure. Accordingly, the scope of the present disclosure is not to be limited by the specific embodiments described above, but only by the appended claims.

Furthermore, functions described herein as being implemented by pure hardware, pure software, and/or firmware may also be implemented by special purpose hardware, combinations of general purpose hardware and software, and so forth. For example, functions described as being implemented by dedicated hardware (e.g., Field Programmable Gate Array (FPGA), Application Specific Integrated Circuit (ASIC), etc.) may be implemented by a combination of general purpose hardware (e.g., Central Processing Unit (CPU), Digital Signal Processor (DSP)) and software, and vice versa.

Claims

1. A method of detecting anomalous data comprising:

sending at least one part of the data to be detected to external data analysis resources as query data according to a first abnormal index of the data to be detected determined by using a current detector model;

receiving analysis results for the query data from the external data analysis resource;

updating the detector model based at least in part on the analysis and the data to be detected; and

determining data having a second abnormality index higher than a predetermined threshold value among the data to be detected as abnormal data according to the second abnormality index of the data to be detected determined using the updated detector model,

wherein, according to a first anomaly index of data to be detected determined using a current detector model, sending at least a portion of the data to be detected to an external data analysis resource as query data comprises:

determining first data with a first abnormality index lower than the preset threshold value in the data to be detected;

determining one or more second data with a first top n abnormal index in the first data as the query data, wherein n is a natural number; and

and sending the query data to the external data analysis resource.

2. The method of claim 1, wherein determining the first abnormality index or the second abnormality index of the data to be detected using a current detector model or an updated detector model comprises:

extracting a characteristic vector of the data to be detected; and

applying the current detector model or the updated detector model to the feature vector to determine a first anomaly index or a second anomaly index of the data to be detected, respectively.

3. The method of claim 2, wherein extracting the feature vector of the data to be detected comprises performing, for each of one or more attribute data of the data to be detected, one of:

if the attribute data is categorical data, the attribute data corresponds to a particular element in the feature vector;

if the attribute data is ordinal type data, the partition in which the attribute data is located corresponds to a specific element in the feature vector;

if the attribute data is plain string data, a 3-gram corresponding to the attribute data corresponds to a particular element in the feature vector; and

if the attribute data is sequential data, an n-gram corresponding to the attribute data corresponds to a particular element in the feature vector.

4. The method of claim 1, wherein the external data analysis resource is a third party detector and/or expert review.

5. The method of claim 1, wherein the number of the one or more second data is a product of the number of the first data and a fixed ratio or a rounded value thereof.

6. The method of claim 1, wherein the number of the one or more second data is a fixed number.

7. The method of claim 1, wherein updating the detector model based at least in part on the analysis and the data to be detected comprises:

updating the abnormal label of the query data in the data to be detected based on the analysis result; and

retraining the detector model using the updated data to be detected.

8. The method of claim 7, wherein retraining the detector model using the updated data to be detected comprises:

retraining the detector model with the updated query data in an increased weight relative to other data in the data to be detected.

9. The method of claim 1, wherein the data to be detected is user behavior data relating to user behavior.

10. The method of claim 9, wherein the user behavior data comprises at least one of: the user's registration information, the user's operational information, and the user's social information.

11. An apparatus for detecting anomalous data comprising:

a processor;

a memory storing instructions that, when executed by the processor, cause the processor to perform the method of any of claims 1-10.

12. A computer readable storage medium comprising instructions which, when executed by a processor, cause the processor to carry out a method according to any one of claims 1 to 10.