WO2020078059A1 - Interpretation feature determination method and device for anomaly detection - Google Patents

Interpretation feature determination method and device for anomaly detection Download PDF

Info

Publication number
WO2020078059A1
WO2020078059A1 PCT/CN2019/097171 CN2019097171W WO2020078059A1 WO 2020078059 A1 WO2020078059 A1 WO 2020078059A1 CN 2019097171 W CN2019097171 W CN 2019097171W WO 2020078059 A1 WO2020078059 A1 WO 2020078059A1
Authority
WO
WIPO (PCT)
Prior art keywords
sample
feature
anomaly detection
interpretation
model
Prior art date
Application number
PCT/CN2019/097171
Other languages
French (fr)
Chinese (zh)
Inventor
方文静
Original Assignee
阿里巴巴集团控股有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴集团控股有限公司 filed Critical 阿里巴巴集团控股有限公司
Publication of WO2020078059A1 publication Critical patent/WO2020078059A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/211Selection of the most significant subset of features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Definitions

  • the present disclosure relates to the field of big data technology, and in particular, to an anomaly detection interpretation feature determination method and device.
  • Anomaly detection is an important part of data mining and can be applied to various fields such as intrusion detection, fraud detection, fault detection, system health detection, sensor network event detection, and ecosystem interference detection.
  • one of the algorithms is an unsupervised anomaly detection model.
  • the anomaly detection model is often a black box, and users cannot perceive its internal working state.
  • model interpretation is crucial. By interpreting the model, you can further understand the output of the model, such as which features of the input sample have the greatest impact on the model output. Model interpretation can provide an analysis direction for the cause of the output of the anomaly detection model.
  • one or more embodiments of the present specification provide a method and apparatus for determining an interpretation feature of anomaly detection, so as to improve the accuracy of acquiring the interpretation feature of anomaly detection.
  • a method for determining an interpretation feature for anomaly detection includes:
  • the sample For a sample of the input anomaly detection model, the sample includes at least one sample feature, and the degree of deviation of the sample feature is determined according to the distribution parameter of each sample feature; the distribution parameter is used to indicate that the sample feature is in the anomaly Distribution characteristics in the training set data of the detection model; the anomaly detection model is an unsupervised model;
  • At least one sample feature is determined as the interpretation feature corresponding to the sample, and the interpretation feature is used to interpret the model output result of the sample and the corresponding anomaly detection model Associations.
  • an interpretation feature determination device for anomaly detection includes:
  • An offset calculation module for a sample of the input anomaly detection model, the sample includes at least one sample feature, and the offset of the sample feature is determined according to the distribution parameter of each sample feature; the distribution parameter is used for Indicates the distribution characteristics of the sample feature in the training set data of the anomaly detection model; the anomaly detection model is an unsupervised model;
  • a feature determination module configured to determine at least one sample feature as an interpretation feature corresponding to the sample according to the deviation degree of each sample feature in the sample, and the interpretation feature is used to interpret the sample and the corresponding anomaly Check the correlation between the model output results of the model.
  • an interpretation feature determination device for anomaly detection includes a memory, a processor, and a computer program stored on the memory and executable on the processor.
  • the processor implements the program to implement the following step:
  • the sample For a sample of the input anomaly detection model, the sample includes at least one sample feature, and the degree of deviation of the sample feature is determined according to the distribution parameter of each sample feature; the distribution parameter is used to indicate that the sample feature is in the anomaly Distribution characteristics in the training set data of the detection model; the anomaly detection model is an unsupervised model;
  • At least one sample feature is determined as the interpretation feature corresponding to the sample, and the interpretation feature is used to interpret the model output result of the sample and the corresponding anomaly detection model Associations.
  • the method and device for determining the interpretation feature of anomaly detection in one or more embodiments of this specification finds the interpretation feature based on the distribution parameters based on the distribution parameter, which is based on the data distribution characteristics of the feature value of the sample feature itself to find the interpretation feature and the model It is irrelevant and does not depend on the model. Therefore, imperfect information about the model, such as sample imbalance, will not affect the detection of interpretation features. Moreover, the use of distribution parameters to identify interpretation features conforms to the characteristics of abnormal point data distribution and interpretation features of anomaly detection. The accuracy of acquisition is high.
  • 1 is a schematic diagram of the principle of anomaly detection provided by one or more embodiments of this specification.
  • 2 is a method for determining an explanation feature of anomaly detection provided by one or more embodiments of this specification
  • FIG. 3 is a schematic structural diagram of a device for determining an explanation feature of abnormality detection provided by one or more embodiments of the present specification
  • FIG. 4 is a schematic structural diagram of another apparatus for determining an explanation feature of abnormality detection provided by one or more embodiments of the present specification.
  • Anomaly detection is also called outlier detection.
  • An outlier is an object that deviates significantly from other data points. Outliers are not the same as most of the data, and only a small part of the overall data. Anomaly detection requires Identify these outliers from the data. For example, it can be used to identify abnormal transactions.
  • At least one embodiment of this specification provides an interpretation feature determination method for anomaly detection, which can be applied to the interpretation of an unsupervised anomaly detection model, and the interpretation scheme may not require the introduction of additional interpretation models, and will not rely on The anomaly detection model itself.
  • the sample can be used as the input of the anomaly detection model, and can correspond to the model output of an anomaly detection model. For example, you can input A into the anomaly detection model and get B output by the model, then A is the sample.
  • a sample may have at least one sample characteristic, which is used to describe the attribute properties of the sample in different aspects.
  • the sample may be a user whose user ID is 1100, and the at least one sample characteristic included in the sample may include: the user's age, address, and working years. Among them, age is a sample feature, and address can be another sample feature.
  • the interpretation feature may be a partial feature determined from the above-mentioned sample features.
  • the sample feature may include F1, F2, and F3, and the interpretation feature may be F1 and F2 therein.
  • the process of anomaly detection includes two processes of "training” and “prediction”.
  • the anomaly detection model in the “training” stage, can be trained through the training set data.
  • the "prediction” stage you can use a sample in the test set data as the input of the anomaly detection model to predict whether the input sample is abnormal data.
  • the model interpretation and the model training prediction are two independent operations. part.
  • FIG. 2 describes a method for determining an explanation feature of anomaly detection.
  • this method uses local model interpretation when interpreting the anomaly detection model, that is, to provide a corresponding interpretation for the prediction of a specific sample.
  • the method may include:
  • step 200 according to the training set data of the anomaly detection model, the distribution parameters of each sample feature in the training set data are obtained respectively.
  • the anomaly detection model may be an unsupervised model.
  • the training set data may be data for training an anomaly detection model.
  • the training set data may include multiple samples, and each sample may include at least one sample feature.
  • the sample may be a user whose user ID is 1100, and at least one sample characteristic included in the sample may include: the user's age, address, working years, and annual income.
  • Each sample feature can obtain a corresponding distribution parameter.
  • the sample feature "age” corresponds to a distribution parameter S1
  • the sample feature "working years” corresponds to a distribution parameter S2.
  • the distribution parameter of each sample feature can be obtained by obtaining the same sample feature from each sample of the training set data.
  • the same sample feature can be called a target sample feature, and then a plurality of target sample features are obtained.
  • the training set data may include multiple samples, assuming that the user identified as 1100, the user identified as 1101, and the user identified as 1102 are included. Each user's sample features include this "annual income”.
  • the "annual income” sample feature can be obtained from each sample, and this feature can be called the target sample feature.
  • a target feature set can be obtained, and the target feature set includes the "annual income” of the above three users. Then, the distribution parameter corresponding to the feature "annual income” can be determined according to the feature value of the "annual income" in the target feature set.
  • Distribution parameters can be used to represent the distribution characteristics of sample features in the training set data of the anomaly detection model.
  • the multivariate Gaussian model is a classic algorithm. Its data is assumed to have a normal distribution for each dimensional feature distribution. Under this assumption, there is a well-known 3-sigma principle, and there are 3 variance regions around the mean. The range contains 99.7% of the data, and outside this area can be considered as an outlier. Of course, there can be 2-sigma principle, 1-sigma principle, etc.
  • the above description shows a data distribution characteristic.
  • the abnormal point to be detected and identified by anomaly detection is usually a point that deviates from the area where most data is located in terms of distribution characteristics. Characteristic, for example, within the range of 3 variances around the mean.
  • the distribution parameters calculated in this step may include: the mean and variance of the sample features.
  • the mean can be represented by u
  • the variance can be represented by s.
  • the input sample includes at least one sample feature, and the degree of deviation of the sample feature is determined according to the distribution parameter of each sample feature.
  • the sample is a sample in the test set data.
  • the test set data may include multiple samples, and each sample may include at least one sample feature.
  • this method's interpretation scheme for anomaly detection is applied to local model interpretation, that is, to explain the anomaly detection of each specific sample.
  • the sample Y1 input training anomaly detection model gets the model output result D1
  • the sample Y2 input anomaly detection model gets the model output result D2
  • the model interpretation of this method should be used to explain the association between Y1 and D1, The relationship between D2. For example, which features of Y1 contribute more to the result D1, and which features of Y2 contribute more to get D2. Therefore, step 202 and step 204 may be performed on one of the samples in the test set data.
  • each sample in the test set data may also include multiple sample features.
  • a corresponding offset degree is calculated for each sample feature, and the offset degree may be an index for measuring whether the sample feature is in the above-mentioned "region where most data is located".
  • the degree of offset can be calculated based on the following principle: for each dimension feature, the distance that each new sample deviates from the mean of the training set by several times the variance can be calculated. The greater the deviation, the more abnormal the data. Then, taking the distribution parameters as mean and variance as an example, the following formula (1) can be used as the calculation formula of the degree of deviation:
  • n is the degree of offset, and this n can provide a uniform abnormality measurement index for different sample characteristics.
  • v is the actual feature value of a sample feature in the sample in the sample;
  • u is the mean value of the sample feature based on the statistics of the training set;
  • s is the variance of the sample feature based on the statistics of the training set. According to formula (1), the distance from which the actual value deviates from the mean by several times the variance is determined as the degree of deviation.
  • step 204 at least one sample feature is determined as the interpretation feature of this anomaly detection corresponding to the sample according to the degree of deviation of each sample feature in the sample.
  • the interpretation feature is used to explain the association between the sample input in this anomaly detection and the output result of the model. For example, if the sample Y1 is input to the anomaly detection model to obtain the model output result D1, and the determined interpretation features are t1 and t2, then the sample Y1 includes the features t1 and t2, and the contribution value of the t1 and t2 to the output D1 is relatively High, it may be that the two sample features t1 and t2 lead to the model output D1. Of course, you can further analyze the cause of the anomaly detection output D1 corresponding to Y1 this time on the basis of explaining the characteristics.
  • the interpretation feature may be obtained by sorting the features of each sample in descending order according to the offset of each sample feature in the sample of the input model, and using at least one sample feature that is ranked in the preset number of digits as The explained feature.
  • This method selects several sample features with higher offset as the interpretation features. In a specific implementation, it is not limited to this method.
  • an offset degree threshold may also be set, and sample features whose offset degree is higher than the threshold value are used as interpretation features.
  • step 200 can be performed on one device and belongs to the training phase, that is, the training phase of the anomaly detection model can include two parts, one is the training of the conventional anomaly detection model, and the other is to obtain the distribution parameters according to the training set data.
  • Steps 202 and 204 can be executed on another device (or the same device). It belongs to the prediction phase of the model, that is, the prediction phase of the anomaly detection model also includes two parts. One part is the conventional use of the model to predict whether it is abnormal, and One part is to get the interpretation characteristics according to the distribution parameters.
  • training phase or prediction phase the model interpretation scheme and the model's training prediction scheme can be run independently. Of course, it is also possible to calculate the distribution parameters while training, or to calculate and interpret features based on input samples while predicting.
  • the method for determining the interpretation features of anomaly detection in at least one embodiment of this specification finds the interpretation features by finding the anomaly interpretation features based on the distribution parameters. This is based on the data distribution characteristics of the feature values of the sample features themselves. Depends on the model, therefore, imperfect information about the model, such as sample imbalance, will not affect the detection of interpretation features, and the use of distribution parameters to identify interpretation features conforms to the characteristics of abnormal point data distribution of anomaly detection and accurate interpretation feature acquisition Sexuality is higher.
  • FIG. 3 is an anomaly detection interpretation feature determination device provided by one or more embodiments of the present specification. As shown in FIG. 3, the device may include: an offset calculation module 31 and a feature determination module 32.
  • the offset calculation module 31 is used for a sample of the input anomaly detection model, the sample includes at least one sample feature, and the offset of the sample feature is determined according to the distribution parameter of each sample feature; the distribution parameter is used Yu represents the distribution characteristics of the sample features in the training set data of the anomaly detection model; the anomaly detection model is an unsupervised model;
  • the feature determination module 32 is configured to determine at least one sample feature as an interpretation feature corresponding to the sample according to the degree of deviation of each sample feature in the sample, and the interpretation feature is used to interpret the sample and the corresponding The correlation between the model output results of the anomaly detection model.
  • FIG. 4 is another apparatus for determining an explanatory feature for anomaly detection provided by one or more embodiments of the present specification. As shown in FIG. 4, the device may further include: a distribution calculation module 33 based on the structure shown in FIG. .
  • the distribution calculation module 33 is used to obtain target sample features from each sample of the training set data to obtain a target feature set including multiple target sample features; according to the target feature set, determine the distribution parameters of the target sample features;
  • the training set data includes multiple samples, and each sample includes at least one sample feature.
  • the offset calculation module 31 is specifically configured to: for one of the sample features of the sample in the test set data of the anomaly detection model, determine the actual value of the sample feature in the sample Obtain the mean value of the sample features in the training set data; determine the distance that the actual value deviates from the mean several times the variance as the degree of offset; the distribution parameters include: the mean and variance of the sample features .
  • At least one embodiment of the present specification also provides an interpretation feature determination device for anomaly detection.
  • the device includes a memory, a processor, and a computer program stored on the memory and executable on the processor.
  • the processor executes the The program implements the following steps:
  • the sample For a sample of the input anomaly detection model, the sample includes at least one sample feature, and the degree of deviation of the sample feature is determined according to the distribution parameter of each sample feature; the distribution parameter is used to indicate that the sample feature is in the anomaly Distribution characteristics in the training set data of the detection model; the anomaly detection model is an unsupervised model;
  • At least one sample feature is determined as the interpretation feature corresponding to the sample, and the interpretation feature is used to interpret the model output result of the sample and the corresponding anomaly detection model Associations.
  • each step can be implemented in the form of software, hardware, or a combination thereof.
  • those skilled in the art can implement it in the form of software code, which can be executable by a computer capable of implementing the logical function corresponding to the step instruction.
  • the executable instructions can be stored in the memory and executed by the processor in the device.
  • the device or module explained in the above embodiments may be implemented by a computer chip or entity, or by a product with a certain function.
  • a typical implementation device is a computer, and the specific form of the computer may be a personal computer, a laptop computer, a cellular phone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email sending and receiving device, and a game control Desk, tablet computer, wearable device, or any combination of these devices.
  • one or more embodiments of this specification may be provided as a method, system, or computer program product. Therefore, one or more embodiments of this specification may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware. Moreover, one or more embodiments of this specification may employ computer programs implemented on one or more computer usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer usable program code The form of the product.
  • computer usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.
  • These computer program instructions may also be stored in a computer readable memory that can guide a computer or other programmable data processing device to work in a specific manner, so that the instructions stored in the computer readable memory produce an article of manufacture including an instruction device, the instructions The device implements the functions specified in one block or multiple blocks of the flowchart one flow or multiple flows and / or block diagrams.
  • These computer program instructions can also be loaded onto a computer or other programmable data processing device, so that a series of operating steps are performed on the computer or other programmable device to produce computer-implemented processing, which is executed on the computer or other programmable device
  • the instructions provide steps for implementing the functions specified in one block or multiple blocks of the flowchart one flow or multiple flows and / or block diagrams.
  • One or more embodiments of this specification may be described in the general context of computer-executable instructions executed by a computer, such as program modules.
  • program modules include routines, programs, objects, components, data structures, etc. that perform specific tasks or implement specific abstract data types.
  • One or more embodiments of this specification can also be practiced in distributed computing environments in which tasks are performed by remote processing devices connected through a communication network.
  • program modules may be located in local and remote computer storage media including storage devices.

Abstract

Embodiments of the description provide an interpretation feature determination method and device for anomaly detection. The method comprises: for a sample input to an anomaly detection model and comprising at least one sample feature, determining, according to a distribution parameter of each sample feature, the degree of deviation of the sample feature, wherein the distribution parameter is used to represent distribution characteristics of the sample feature in training set data of the anomaly detection model, and the anomaly detection model is an unsupervised model; and determining, according to the degree of deviation of all of the sample features of the sample, at least one sample feature to be an interpretation feature corresponding to the sample, wherein the interpretation feature is used to interpret the association between the sample and a model output result of the corresponding anomaly detection model.

Description

一种异常检测的解释特征确定方法和装置Anomaly detection interpretation feature determination method and device 技术领域Technical field
本公开涉及大数据技术领域,特别涉及一种异常检测的解释特征确定方法和装置。The present disclosure relates to the field of big data technology, and in particular, to an anomaly detection interpretation feature determination method and device.
背景技术Background technique
异常检测是数据挖掘中的较为重要的一部分,可以应用于入侵检测、欺诈检测、故障检测、系统健康检测、传感器网络事件检测和生态系统干扰检测等多种领域。在实际的异常检测应用当中,其中一种算法即为无监督的异常检测模型。异常检测模型往往是一个黑盒,用户无法感知其内部工作状态,为了提高使用模型的可信度,模型解释就显得至关重要。通过对模型解释,可以进一步理解模型的输出结果,例如究竟输入样本的哪些特征对模型输出影响最大。通过模型解释能够为异常检测模型的输出结果的原因提供分析方向。Anomaly detection is an important part of data mining and can be applied to various fields such as intrusion detection, fraud detection, fault detection, system health detection, sensor network event detection, and ecosystem interference detection. In actual anomaly detection applications, one of the algorithms is an unsupervised anomaly detection model. The anomaly detection model is often a black box, and users cannot perceive its internal working state. In order to improve the credibility of using the model, model interpretation is crucial. By interpreting the model, you can further understand the output of the model, such as which features of the input sample have the greatest impact on the model output. Model interpretation can provide an analysis direction for the cause of the output of the anomaly detection model.
发明内容Summary of the invention
有鉴于此,本说明书一个或多个实施例提供一种异常检测的解释特征确定方法和装置,以提高异常检测的解释特征获取的准确性。In view of this, one or more embodiments of the present specification provide a method and apparatus for determining an interpretation feature of anomaly detection, so as to improve the accuracy of acquiring the interpretation feature of anomaly detection.
具体地,本说明书一个或多个实施例是通过如下技术方案实现的:Specifically, one or more embodiments of this specification are implemented by the following technical solutions:
第一方面,提供一种异常检测的解释特征确定方法,所述方法包括:In a first aspect, a method for determining an interpretation feature for anomaly detection is provided. The method includes:
对于输入异常检测模型的一个样本,所述样本包括至少一个样本特征,根据每个样本特征的分布参数确定所述样本特征的偏移度;所述分布参数用于表示该样本特征在所述异常检测模型的训练集数据中的分布特点;所述异常检测模型是无监督模型;For a sample of the input anomaly detection model, the sample includes at least one sample feature, and the degree of deviation of the sample feature is determined according to the distribution parameter of each sample feature; the distribution parameter is used to indicate that the sample feature is in the anomaly Distribution characteristics in the training set data of the detection model; the anomaly detection model is an unsupervised model;
根据所述样本中的各个样本特征的偏移度,确定至少一个样本特征作为所述样本对应的解释特征,所述解释特征用于解释所述样本与对应的所述异常检测模型的模型输出结果之间的关联。According to the deviation degree of each sample feature in the sample, at least one sample feature is determined as the interpretation feature corresponding to the sample, and the interpretation feature is used to interpret the model output result of the sample and the corresponding anomaly detection model Associations.
第二方面,提供一种异常检测的解释特征确定装置,所述装置包括:In a second aspect, an interpretation feature determination device for anomaly detection is provided. The device includes:
偏移度计算模块,用于对于输入异常检测模型的一个样本,所述样本包括至少一个样本特征,根据每个样本特征的分布参数确定所述样本特征的偏移度;所述分布参数用于表示该样本特征在所述异常检测模型的训练集数据中的分布特点;所述异常检测模型 是无监督模型;An offset calculation module, for a sample of the input anomaly detection model, the sample includes at least one sample feature, and the offset of the sample feature is determined according to the distribution parameter of each sample feature; the distribution parameter is used for Indicates the distribution characteristics of the sample feature in the training set data of the anomaly detection model; the anomaly detection model is an unsupervised model;
特征确定模块,用于根据所述样本中的各个样本特征的偏移度,确定至少一个样本特征作为所述样本对应的解释特征,所述解释特征用于解释所述样本与对应的所述异常检测模型的模型输出结果之间的关联。A feature determination module, configured to determine at least one sample feature as an interpretation feature corresponding to the sample according to the deviation degree of each sample feature in the sample, and the interpretation feature is used to interpret the sample and the corresponding anomaly Check the correlation between the model output results of the model.
第三方面,提供一种异常检测的解释特征确定设备,所述设备包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时实现以下步骤:In a third aspect, an interpretation feature determination device for anomaly detection is provided. The device includes a memory, a processor, and a computer program stored on the memory and executable on the processor. The processor implements the program to implement the following step:
对于输入异常检测模型的一个样本,所述样本包括至少一个样本特征,根据每个样本特征的分布参数确定所述样本特征的偏移度;所述分布参数用于表示该样本特征在所述异常检测模型的训练集数据中的分布特点;所述异常检测模型是无监督模型;For a sample of the input anomaly detection model, the sample includes at least one sample feature, and the degree of deviation of the sample feature is determined according to the distribution parameter of each sample feature; the distribution parameter is used to indicate that the sample feature is in the anomaly Distribution characteristics in the training set data of the detection model; the anomaly detection model is an unsupervised model;
根据所述样本中的各个样本特征的偏移度,确定至少一个样本特征作为所述样本对应的解释特征,所述解释特征用于解释所述样本与对应的所述异常检测模型的模型输出结果之间的关联。According to the deviation degree of each sample feature in the sample, at least one sample feature is determined as the interpretation feature corresponding to the sample, and the interpretation feature is used to interpret the model output result of the sample and the corresponding anomaly detection model Associations.
本说明书一个或多个实施例的异常检测的解释特征确定方法和装置,通过根据分布参数找到异常的解释特征,这是基于样本特征的特征值本身的数据分布特点,来找到解释特征,与模型无关且不依赖于模型,因此,模型相关信息的不完善比如样本不平衡性不会影响到解释特征的检测,并且,利用分布参数识别解释特征,符合异常检测的异常点数据分布特点,解释特征获取的准确性较高。The method and device for determining the interpretation feature of anomaly detection in one or more embodiments of this specification finds the interpretation feature based on the distribution parameters based on the distribution parameter, which is based on the data distribution characteristics of the feature value of the sample feature itself to find the interpretation feature and the model It is irrelevant and does not depend on the model. Therefore, imperfect information about the model, such as sample imbalance, will not affect the detection of interpretation features. Moreover, the use of distribution parameters to identify interpretation features conforms to the characteristics of abnormal point data distribution and interpretation features of anomaly detection. The accuracy of acquisition is high.
附图说明BRIEF DESCRIPTION
为了更清楚地说明本说明书一个或多个实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本说明书一个或多个实施例中记载的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly explain one or more embodiments of the specification or the technical solutions in the prior art, the following will briefly introduce the drawings required in the description of the embodiments or the prior art. Obviously, the following description The drawings are only some of the embodiments described in one or more embodiments of this specification. For those of ordinary skill in the art, without paying any creative labor, other drawings can also be obtained from these drawings.
图1为本说明书一个或多个实施例提供的异常检测的原理示意图;1 is a schematic diagram of the principle of anomaly detection provided by one or more embodiments of this specification;
图2为本说明书一个或多个实施例提供的异常检测的解释特征的确定方法;2 is a method for determining an explanation feature of anomaly detection provided by one or more embodiments of this specification;
图3为本说明书一个或多个实施例提供的一种异常检测的解释特征的确定装置的结构示意图;FIG. 3 is a schematic structural diagram of a device for determining an explanation feature of abnormality detection provided by one or more embodiments of the present specification;
图4为本说明书一个或多个实施例提供的另一种异常检测的解释特征的确定装置的结构示意图。FIG. 4 is a schematic structural diagram of another apparatus for determining an explanation feature of abnormality detection provided by one or more embodiments of the present specification.
具体实施方式detailed description
为了使本技术领域的人员更好地理解本说明书一个或多个实施例中的技术方案,下面将结合本说明书一个或多个实施例中的附图,对本说明书一个或多个实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是一部分实施例,而不是全部的实施例。基于本说明书一个或多个实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都应当属于本申请保护的范围。In order to enable those skilled in the art to better understand the technical solutions in one or more embodiments of this specification, the following will be combined with the drawings in one or more embodiments of this specification. The technical solution is described clearly and completely. Obviously, the described embodiments are only a part of the embodiments, but not all the embodiments. Based on one or more embodiments of this specification, all other embodiments obtained by persons of ordinary skill in the art without creative efforts shall fall within the scope of protection of this application.
异常检测也称为离群点检测,离群点是一个明显偏离其他数据点的对象,离群点和大部分的数据不太一样,在整体的数据当中也只是占一小部分,异常检测需要将这些离群点从数据中分辨出来。例如,可以用于识别异常交易。Anomaly detection is also called outlier detection. An outlier is an object that deviates significantly from other data points. Outliers are not the same as most of the data, and only a small part of the overall data. Anomaly detection requires Identify these outliers from the data. For example, it can be used to identify abnormal transactions.
本说明书至少一个实施例提供了一种异常检测的解释特征确定方法,该方法可以应用于对无监督的异常检测模型的解释,并且该解释方案可以无需引入额外的解释模型,并且也不会依赖于异常检测模型本身。At least one embodiment of this specification provides an interpretation feature determination method for anomaly detection, which can be applied to the interpretation of an unsupervised anomaly detection model, and the interpretation scheme may not require the introduction of additional interpretation models, and will not rely on The anomaly detection model itself.
如下对该方法描述中涉及到的部分特征进行说明:The following describes some of the features involved in the method description:
样本:该样本可以是用于作为异常检测模型的输入,并且可以对应一个异常检测模型的模型输出结果。例如,可以将A输入异常检测模型,并得到模型输出的B,那么A即为所述样本。Sample: The sample can be used as the input of the anomaly detection model, and can correspond to the model output of an anomaly detection model. For example, you can input A into the anomaly detection model and get B output by the model, then A is the sample.
样本特征:一个样本可以具有至少一个样本特征,该样本特征用于描述该样本在不同方面的属性性质。例如,该样本可以是用户标识为1100的用户,该样本包括的至少一个样本特征可以包括:该用户的年龄、住址、工作年限等。其中,年龄是一个样本特征,住址可以是另一个样本特征。Sample characteristics: A sample may have at least one sample characteristic, which is used to describe the attribute properties of the sample in different aspects. For example, the sample may be a user whose user ID is 1100, and the at least one sample characteristic included in the sample may include: the user's age, address, and working years. Among them, age is a sample feature, and address can be another sample feature.
解释特征:机器学习任务中,不同的模型被提出,用以对问题进行建模。除了模型的直接输出以外,我们还需要对结果进一步的理解,例如究竟哪些特征对模型输出影响最大,究竟是什么因素决定了它所对应的输出,这就需要对模型进行相应的解释。本说明书实施例中用“解释特征”来表示能够对异常检测模型的模型输出结果进行解释的特征,该解释特征可以用于解释异常检测模型的输入样本和模型输出结果之间的关联。比如,将样本Y1输入异常检测模型得到模型输出结果D1,且确定的解释特征是t1和t2, 那么,样本Y1中包括的特征t1和t2对输出D1的贡献值较高,可能是由于这两个样本特征t1和t2才导致得到了D1。解释特征可以是由上述的样本特征中确定的部分特征,例如,样本特征可以包括F1、F2和F3,解释特征可以是其中的F1和F2。Explaining features: In machine learning tasks, different models are proposed to model the problem. In addition to the direct output of the model, we need to further understand the results, such as which features have the greatest impact on the model output, and what factors determine its corresponding output, which requires a corresponding interpretation of the model. In the embodiments of the present specification, "interpretation feature" is used to indicate a feature that can explain the model output result of the anomaly detection model. The interpretation feature can be used to explain the association between the input sample of the anomaly detection model and the model output result. For example, if the sample Y1 is input to the anomaly detection model to obtain the model output D1, and the determined interpretation features are t1 and t2, then the features t1 and t2 included in the sample Y1 have a higher contribution value to the output D1, possibly due to Only the sample features t1 and t2 lead to D1. The interpretation feature may be a partial feature determined from the above-mentioned sample features. For example, the sample feature may include F1, F2, and F3, and the interpretation feature may be F1 and F2 therein.
在上述特征说明的基础上,下面描述本说明书实施例的解释特征确定方法。On the basis of the above-mentioned feature description, the following describes an explanation feature determination method of an embodiment of this specification.
请参见图1所示,异常检测的过程包括“训练”和“预测”两个过程。其中,在“训练”阶段可以通过训练集数据去训练异常检测模型。在“预测”阶段,就可以将测试集数据中的某个样本作为该异常检测模型的输入,以预测该输入的样本是否是异常数据。而本说明书至少一个实施例提供的对异常检测模型的解释方案中,与上述的训练异常检测模型和应用该模型进行预测是无关的,即,模型的解释和模型的训练预测是两个独立运行的部分。As shown in FIG. 1, the process of anomaly detection includes two processes of "training" and "prediction". Among them, in the "training" stage, the anomaly detection model can be trained through the training set data. In the "prediction" stage, you can use a sample in the test set data as the input of the anomaly detection model to predict whether the input sample is abnormal data. However, in the explanation scheme for the abnormality detection model provided by at least one embodiment of the present specification, it is irrelevant to the above-mentioned training abnormality detection model and the application of the model for prediction, that is, the model interpretation and the model training prediction are two independent operations. part.
请继续参见图1,并结合图2所示,图2描述了一种异常检测的解释特征的确定方法。其中,首先需要说明的是,该方法在解释异常检测模型时,采用的是局部模型解释,即针对某一条具体样本的预测提供相应解释。Please continue to refer to FIG. 1, and in conjunction with FIG. 2, FIG. 2 describes a method for determining an explanation feature of anomaly detection. Among them, the first thing that needs to be explained is that this method uses local model interpretation when interpreting the anomaly detection model, that is, to provide a corresponding interpretation for the prediction of a specific sample.
如图2所示,该方法可以包括:As shown in FIG. 2, the method may include:
在步骤200中,根据异常检测模型的训练集数据,分别获得所述训练集数据中各个样本特征的分布参数。In step 200, according to the training set data of the anomaly detection model, the distribution parameters of each sample feature in the training set data are obtained respectively.
本步骤中,该异常检测模型可以是无监督模型。In this step, the anomaly detection model may be an unsupervised model.
所述的训练集数据,可以是用于训练异常检测模型的数据,该训练集数据中可以包括多个样本,每个样本中可以包括至少一个样本特征。The training set data may be data for training an anomaly detection model. The training set data may include multiple samples, and each sample may include at least one sample feature.
示例性的,该样本可以是用户标识为1100的用户,该样本中包括的至少一个样本特征可以包括:该用户的年龄、住址、工作年限、年收入等。Exemplarily, the sample may be a user whose user ID is 1100, and at least one sample characteristic included in the sample may include: the user's age, address, working years, and annual income.
每一个样本特征都可以得到一个对应的分布参数,例如,样本特征“年龄”对应一个分布参数S1,样本特征“工作年限”对应一个分布参数S2。Each sample feature can obtain a corresponding distribution parameter. For example, the sample feature "age" corresponds to a distribution parameter S1, and the sample feature "working years" corresponds to a distribution parameter S2.
而每个样本特征的分布参数的获得,可以是由所述训练集数据的各个样本中分别获取相同的样本特征,该相同的样本特征可以称为目标样本特征,进而得到包括多个目标样本特征的目标特征集;并根据所述目标特征集,确定所述目标样本特征的分布参数。The distribution parameter of each sample feature can be obtained by obtaining the same sample feature from each sample of the training set data. The same sample feature can be called a target sample feature, and then a plurality of target sample features are obtained. The target feature set of; and according to the target feature set, determine the distribution parameter of the target sample feature.
例如,以样本特征“年收入”为例,训练集数据中可以包括多个样本,假设包括标识为1100的用户、标识为1101的用户以及标识为1102的用户。每个用户的样本特征 中都包括该“年收入”。可以由各个样本中分别获取该“年收入”样本特征,该特征可以称为目标样本特征。可以得到一个目标特征集,该目标特征集中包括上述三个用户的“年收入”。接着可以根据该目标特征集中的“年收入”的特征值,确定该特征“年收入”对应的分布参数。For example, taking the sample feature "annual income" as an example, the training set data may include multiple samples, assuming that the user identified as 1100, the user identified as 1101, and the user identified as 1102 are included. Each user's sample features include this "annual income". The "annual income" sample feature can be obtained from each sample, and this feature can be called the target sample feature. A target feature set can be obtained, and the target feature set includes the "annual income" of the above three users. Then, the distribution parameter corresponding to the feature "annual income" can be determined according to the feature value of the "annual income" in the target feature set.
分布参数可以用于表示样本特征在异常检测模型的训练集数据中的分布特点。例如,在异常检测中,多元高斯模型是一种经典算法,其数据假设为每维特征分布满足正态分布,在这个假设之下有一个著名的3-sigma原则,在均值附近3个方差区域范围内包含了99.7%的数据,而在这个区域以外就可以被认为是一个异常点(outlier)。当然还可以有2-sigma原则、1-sigma原则等。Distribution parameters can be used to represent the distribution characteristics of sample features in the training set data of the anomaly detection model. For example, in anomaly detection, the multivariate Gaussian model is a classic algorithm. Its data is assumed to have a normal distribution for each dimensional feature distribution. Under this assumption, there is a well-known 3-sigma principle, and there are 3 variance regions around the mean. The range contains 99.7% of the data, and outside this area can be considered as an outlier. Of course, there can be 2-sigma principle, 1-sigma principle, etc.
上述的描述即表示了一种数据分布特点,异常检测所要检测识别的异常点,由分布特点上来看,通常是偏离大多数数据所在区域的点,而所述的大多数数据所在区域是有一定特点的,比如,在均值附近3个方差的区域范围内。The above description shows a data distribution characteristic. The abnormal point to be detected and identified by anomaly detection is usually a point that deviates from the area where most data is located in terms of distribution characteristics. Characteristic, for example, within the range of 3 variances around the mean.
基于上述,例如,本步骤中计算的分布参数可以包括:样本特征的均值和方差。例如,均值可以用u表示,方差可以用s表示。Based on the above, for example, the distribution parameters calculated in this step may include: the mean and variance of the sample features. For example, the mean can be represented by u, and the variance can be represented by s.
在步骤202中,对于输入异常检测模型的一个样本,所述输入样本包括至少一个样本特征,根据每个样本特征的分布参数确定所述样本特征的偏移度。In step 202, for a sample of the input anomaly detection model, the input sample includes at least one sample feature, and the degree of deviation of the sample feature is determined according to the distribution parameter of each sample feature.
本步骤中,所述的样本是测试集数据中的一个样本,测试集数据可以包括多个样本,每个样本可以包括至少一个样本特征。如前所述的,本方法对异常检测的解释方案,是应用于局部模型解释,即对每一个具体样本的异常检测进行解释。In this step, the sample is a sample in the test set data. The test set data may include multiple samples, and each sample may include at least one sample feature. As mentioned before, this method's interpretation scheme for anomaly detection is applied to local model interpretation, that is, to explain the anomaly detection of each specific sample.
例如,样本Y1输入训练完成的异常检测模型得到模型输出结果D1,样本Y2输入异常检测模型得到模型输出结果D2,而本方法的模型解释应用于分别解释Y1和D1之间的关联、以及Y2和D2之间的关联。比如,Y1的哪些特征对得到结果D1的贡献较大,Y2的哪些特征对得到D2的贡献较大。因此,步骤202和步骤204可以是对测试集数据中的其中一个样本执行。For example, the sample Y1 input training anomaly detection model gets the model output result D1, and the sample Y2 input anomaly detection model gets the model output result D2, and the model interpretation of this method should be used to explain the association between Y1 and D1, The relationship between D2. For example, which features of Y1 contribute more to the result D1, and which features of Y2 contribute more to get D2. Therefore, step 202 and step 204 may be performed on one of the samples in the test set data.
与训练集数据类似的,测试集数据中的每一个样本也可以包括多个样本特征。本步骤中,对每个样本特征计算其对应的偏移度,该偏移度可以是一个用于衡量该样本特征是否处于上述的“大多数数据所在区域”的指标。Similar to the training set data, each sample in the test set data may also include multiple sample features. In this step, a corresponding offset degree is calculated for each sample feature, and the offset degree may be an index for measuring whether the sample feature is in the above-mentioned "region where most data is located".
例如,可以基于如下原则来计算偏移度:对每一维特征,可以计算每一个新样本偏离训练集上均值几倍方差的距离,偏离越多则证明数据越异常。那么,以分布参数为均 值和方差为例,如下的公式(1)可以作为偏移度的计算公式:For example, the degree of offset can be calculated based on the following principle: for each dimension feature, the distance that each new sample deviates from the mean of the training set by several times the variance can be calculated. The greater the deviation, the more abnormal the data. Then, taking the distribution parameters as mean and variance as an example, the following formula (1) can be used as the calculation formula of the degree of deviation:
n=(v-u)/s.............(1)n = (v-u) / s ............. (1)
在上述的公式(1)中,n是偏移度,该n可以为不同的样本特征提供一个统一的异常衡量指标。v是样本中的一个样本特征在所述样本中的实际特征值;u是基于训练集数据统计得到的该样本特征的均值;s是基于训练集数据统计得到的该样本特征的方差。根据公式(1),确定所述实际值偏离所述均值几倍方差的距离,作为所述偏移度。In the above formula (1), n is the degree of offset, and this n can provide a uniform abnormality measurement index for different sample characteristics. v is the actual feature value of a sample feature in the sample in the sample; u is the mean value of the sample feature based on the statistics of the training set; s is the variance of the sample feature based on the statistics of the training set. According to formula (1), the distance from which the actual value deviates from the mean by several times the variance is determined as the degree of deviation.
在步骤204中,根据所述样本中的各个样本特征的偏移度,确定至少一个样本特征作为所述样本对应的本次异常检测的解释特征。In step 204, at least one sample feature is determined as the interpretation feature of this anomaly detection corresponding to the sample according to the degree of deviation of each sample feature in the sample.
其中,所述解释特征用于解释在本次异常检测中输入的所述样本和模型输出结果之间的关联。比如,将样本Y1输入异常检测模型得到模型输出结果D1,且确定的解释特征是t1和t2,那么,样本Y1中包括该特征t1和t2,并且,该t1和t2对输出D1的贡献值较高,可能是由于这两个样本特征t1和t2才导致得到了模型输出结果D1。当然,还可以在解释特征的基础上进一步详细分析本次Y1对应的异常检测输出结果D1的原因。Wherein, the interpretation feature is used to explain the association between the sample input in this anomaly detection and the output result of the model. For example, if the sample Y1 is input to the anomaly detection model to obtain the model output result D1, and the determined interpretation features are t1 and t2, then the sample Y1 includes the features t1 and t2, and the contribution value of the t1 and t2 to the output D1 is relatively High, it may be that the two sample features t1 and t2 lead to the model output D1. Of course, you can further analyze the cause of the anomaly detection output D1 corresponding to Y1 this time on the basis of explaining the characteristics.
例如,解释特征的获得方法可以是:根据输入模型的样本中的各个样本特征的偏移度,将所述各个样本特征进行降序排列,并将排序在前预设位数的至少一个样本特征作为所述解释特征。该方法是选取了几个偏移度较高的样本特征作为解释特征。具体实施中,不局限于该方法,例如,还可以设定偏移度阈值,将偏移度高于该阈值的样本特征作为解释特征。For example, the interpretation feature may be obtained by sorting the features of each sample in descending order according to the offset of each sample feature in the sample of the input model, and using at least one sample feature that is ranked in the preset number of digits as The explained feature. This method selects several sample features with higher offset as the interpretation features. In a specific implementation, it is not limited to this method. For example, an offset degree threshold may also be set, and sample features whose offset degree is higher than the threshold value are used as interpretation features.
上述的各个步骤,可以分别在同一设备上执行,也可以在不同设备上执行。比如,步骤200可以在一个设备执行,属于训练阶段,即异常检测模型的训练阶段可以包括两个部分,一部分是常规的异常检测模型的训练,另一部分是根据训练集数据得到分布参数。而步骤202和步骤204可以在另一个设备执行(也可以同一设备),属于模型的预测阶段,即异常检测模型的预测阶段也包括两个部分,一部分是常规的利用模型进行预测是否异常,另一部分是根据分布参数得到解释特征。在每个阶段,训练阶段或者预测阶段,模型解释方案和模型的训练预测方案,可以是独立运行。当然,也可以是一边训练一边计算分布参数,或者一边预测一边根据输入样本计算解释特征。The above steps can be executed on the same device or on different devices. For example, step 200 can be performed on one device and belongs to the training phase, that is, the training phase of the anomaly detection model can include two parts, one is the training of the conventional anomaly detection model, and the other is to obtain the distribution parameters according to the training set data. Steps 202 and 204 can be executed on another device (or the same device). It belongs to the prediction phase of the model, that is, the prediction phase of the anomaly detection model also includes two parts. One part is the conventional use of the model to predict whether it is abnormal, and One part is to get the interpretation characteristics according to the distribution parameters. In each phase, training phase or prediction phase, the model interpretation scheme and the model's training prediction scheme can be run independently. Of course, it is also possible to calculate the distribution parameters while training, or to calculate and interpret features based on input samples while predicting.
本说明书至少一个实施例的异常检测的解释特征的确定方法,通过根据分布参数找到异常的解释特征,这是基于样本特征的特征值本身的数据分布特点,来找到解释特征, 与模型无关且不依赖于模型,因此,模型相关信息的不完善比如样本不平衡性不会影响到解释特征的检测,并且,利用分布参数识别解释特征,符合异常检测的异常点数据分布特点,解释特征获取的准确性较高。The method for determining the interpretation features of anomaly detection in at least one embodiment of this specification finds the interpretation features by finding the anomaly interpretation features based on the distribution parameters. This is based on the data distribution characteristics of the feature values of the sample features themselves. Depends on the model, therefore, imperfect information about the model, such as sample imbalance, will not affect the detection of interpretation features, and the use of distribution parameters to identify interpretation features conforms to the characteristics of abnormal point data distribution of anomaly detection and accurate interpretation feature acquisition Sexuality is higher.
图3为本说明书一个或多个实施例提供的一种异常检测的解释特征确定装置,如图3所示,该装置可以包括:偏移度计算模块31和特征确定模块32。FIG. 3 is an anomaly detection interpretation feature determination device provided by one or more embodiments of the present specification. As shown in FIG. 3, the device may include: an offset calculation module 31 and a feature determination module 32.
偏移度计算模块31,用于对于输入异常检测模型的一个样本,所述样本包括至少一个样本特征,根据每个样本特征的分布参数确定所述样本特征的偏移度;所述分布参数用于表示该样本特征在所述异常检测模型的训练集数据中的分布特点;所述异常检测模型是无监督模型;The offset calculation module 31 is used for a sample of the input anomaly detection model, the sample includes at least one sample feature, and the offset of the sample feature is determined according to the distribution parameter of each sample feature; the distribution parameter is used Yu represents the distribution characteristics of the sample features in the training set data of the anomaly detection model; the anomaly detection model is an unsupervised model;
特征确定模块32,用于根据所述样本中的各个样本特征的偏移度,确定至少一个样本特征作为所述样本对应的解释特征,所述解释特征用于解释所述样本与对应的所述异常检测模型的模型输出结果之间的关联。The feature determination module 32 is configured to determine at least one sample feature as an interpretation feature corresponding to the sample according to the degree of deviation of each sample feature in the sample, and the interpretation feature is used to interpret the sample and the corresponding The correlation between the model output results of the anomaly detection model.
图4为本说明书一个或多个实施例提供的另一种异常检测的解释特征确定装置,如图4所示,该装置在图3所示结构的基础上,还可以包括:分布计算模块33。FIG. 4 is another apparatus for determining an explanatory feature for anomaly detection provided by one or more embodiments of the present specification. As shown in FIG. 4, the device may further include: a distribution calculation module 33 based on the structure shown in FIG. .
分布计算模块33,用于由训练集数据的各个样本中分别获取目标样本特征,得到包括多个目标样本特征的目标特征集;根据所述目标特征集,确定所述目标样本特征的分布参数;所述训练集数据包括多个样本,每个样本包括至少一个样本特征。The distribution calculation module 33 is used to obtain target sample features from each sample of the training set data to obtain a target feature set including multiple target sample features; according to the target feature set, determine the distribution parameters of the target sample features; The training set data includes multiple samples, and each sample includes at least one sample feature.
在另一个例子中,偏移度计算模块31,具体用于:对于所述异常检测模型的测试集数据中所述样本的其中一个样本特征,确定所述样本特征在所述样本中的实际值;获取所述样本特征在训练集数据中的均值;确定所述实际值偏离所述均值几倍方差的距离,作为所述偏移度;所述分布参数包括:所述样本特征的均值和方差。In another example, the offset calculation module 31 is specifically configured to: for one of the sample features of the sample in the test set data of the anomaly detection model, determine the actual value of the sample feature in the sample Obtain the mean value of the sample features in the training set data; determine the distance that the actual value deviates from the mean several times the variance as the degree of offset; the distribution parameters include: the mean and variance of the sample features .
本说明书至少一个实施例还提供了一种异常检测的解释特征确定设备,所述设备包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时实现以下步骤:At least one embodiment of the present specification also provides an interpretation feature determination device for anomaly detection. The device includes a memory, a processor, and a computer program stored on the memory and executable on the processor. The processor executes the The program implements the following steps:
对于输入异常检测模型的一个样本,所述样本包括至少一个样本特征,根据每个样本特征的分布参数确定所述样本特征的偏移度;所述分布参数用于表示该样本特征在所述异常检测模型的训练集数据中的分布特点;所述异常检测模型是无监督模型;For a sample of the input anomaly detection model, the sample includes at least one sample feature, and the degree of deviation of the sample feature is determined according to the distribution parameter of each sample feature; the distribution parameter is used to indicate that the sample feature is in the anomaly Distribution characteristics in the training set data of the detection model; the anomaly detection model is an unsupervised model;
根据所述样本中的各个样本特征的偏移度,确定至少一个样本特征作为所述样本对应的解释特征,所述解释特征用于解释所述样本与对应的所述异常检测模型的模型输出 结果之间的关联。According to the deviation degree of each sample feature in the sample, at least one sample feature is determined as the interpretation feature corresponding to the sample, and the interpretation feature is used to interpret the model output result of the sample and the corresponding anomaly detection model Associations.
上述方法实施例中所示流程中的各个步骤,其执行顺序不限制于流程图中的顺序。此外,各个步骤的描述,可以实现为软件、硬件或者其结合的形式,例如,本领域技术人员可以将其实现为软件代码的形式,可以为能够实现所述步骤对应的逻辑功能的计算机可执行指令。当其以软件的方式实现时,所述的可执行指令可以存储在存储器中,并被设备中的处理器执行。The execution steps of the steps shown in the above method embodiments are not limited to the sequence in the flowchart. In addition, the description of each step can be implemented in the form of software, hardware, or a combination thereof. For example, those skilled in the art can implement it in the form of software code, which can be executable by a computer capable of implementing the logical function corresponding to the step instruction. When it is implemented in software, the executable instructions can be stored in the memory and executed by the processor in the device.
上述实施例阐明的装置或模块,具体可以由计算机芯片或实体实现,或者由具有某种功能的产品来实现。一种典型的实现设备为计算机,计算机的具体形式可以是个人计算机、膝上型计算机、蜂窝电话、相机电话、智能电话、个人数字助理、媒体播放器、导航设备、电子邮件收发设备、游戏控制台、平板计算机、可穿戴设备或者这些设备中的任意几种设备的组合。The device or module explained in the above embodiments may be implemented by a computer chip or entity, or by a product with a certain function. A typical implementation device is a computer, and the specific form of the computer may be a personal computer, a laptop computer, a cellular phone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email sending and receiving device, and a game control Desk, tablet computer, wearable device, or any combination of these devices.
为了描述的方便,描述以上装置时以功能分为各种模块分别描述。当然,在实施本说明书一个或多个实施例时可以把各模块的功能在同一个或多个软件和/或硬件中实现。For the convenience of description, when describing the above device, the functions are divided into various modules and described separately. Of course, when implementing one or more embodiments of this specification, the functions of each module may be implemented in one or more software and / or hardware.
本领域内的技术人员应明白,本说明书一个或多个实施例可提供为方法、系统、或计算机程序产品。因此,本说明书一个或多个实施例可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本说明书一个或多个实施例可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。Those skilled in the art should understand that one or more embodiments of this specification may be provided as a method, system, or computer program product. Therefore, one or more embodiments of this specification may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware. Moreover, one or more embodiments of this specification may employ computer programs implemented on one or more computer usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer usable program code The form of the product.
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions may also be stored in a computer readable memory that can guide a computer or other programmable data processing device to work in a specific manner, so that the instructions stored in the computer readable memory produce an article of manufacture including an instruction device, the instructions The device implements the functions specified in one block or multiple blocks of the flowchart one flow or multiple flows and / or block diagrams.
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions can also be loaded onto a computer or other programmable data processing device, so that a series of operating steps are performed on the computer or other programmable device to produce computer-implemented processing, which is executed on the computer or other programmable device The instructions provide steps for implementing the functions specified in one block or multiple blocks of the flowchart one flow or multiple flows and / or block diagrams.
还需要说明的是,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、商品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、商品或者设备所固有 的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、商品或者设备中还存在另外的相同要素。It should also be noted that the terms "include", "include" or any other variant thereof are intended to cover non-exclusive inclusion, so that a process, method, commodity or device that includes a series of elements not only includes those elements, but also includes Other elements not explicitly listed, or include elements inherent to such processes, methods, goods, or equipment. Without more restrictions, the element defined by the sentence "include one ..." does not exclude that there are other identical elements in the process, method, commodity, or equipment that includes the element.
本说明书一个或多个实施例可以在由计算机执行的计算机可执行指令的一般上下文中描述,例如程序模块。一般地,程序模块包括执行特定任务或实现特定抽象数据类型的例程、程序、对象、组件、数据结构等等。也可以在分布式计算环境中实践本说明书一个或多个实施例,在这些分布式计算环境中,由通过通信网络而被连接的远程处理设备来执行任务。在分布式计算环境中,程序模块可以位于包括存储设备在内的本地和远程计算机存储介质中。One or more embodiments of this specification may be described in the general context of computer-executable instructions executed by a computer, such as program modules. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform specific tasks or implement specific abstract data types. One or more embodiments of this specification can also be practiced in distributed computing environments in which tasks are performed by remote processing devices connected through a communication network. In a distributed computing environment, program modules may be located in local and remote computer storage media including storage devices.
本说明书中的各个实施例均采用递进的方式描述,各个实施例之间相同相似的部分互相参见即可,每个实施例重点说明的都是与其他实施例的不同之处。尤其,对于数据采集设备或者数据处理设备实施例而言,由于其基本相似于方法实施例,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。The embodiments in this specification are described in a progressive manner. The same or similar parts between the embodiments can be referred to each other, and each embodiment focuses on the differences from other embodiments. In particular, for the embodiments of the data collection device or the data processing device, since they are basically similar to the method embodiments, the description is relatively simple. For the relevant parts, please refer to the description of the method embodiments.
上述对本说明书特定实施例进行了描述。其它实施例在所附权利要求书的范围内。在一些情况下,在权利要求书中记载的动作或步骤可以按照不同于实施例中的顺序来执行并且仍然可以实现期望的结果。另外,在附图中描绘的过程不一定要求示出的特定顺序或者连续顺序才能实现期望的结果。在某些实施方式中,多任务处理和并行处理也是可以的或者可能是有利的。The foregoing describes specific embodiments of the present specification. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve the desired results. In addition, the processes depicted in the drawings do not necessarily require the particular order shown or sequential order to achieve the desired results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
以上所述仅为本说明书一个或多个实施例的较佳实施例而已,并不用以限制本公开,凡在本公开的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本公开保护的范围之内。The above are only preferred embodiments of one or more embodiments of this specification, and are not intended to limit this disclosure. Any modification, equivalent replacement, improvement, etc., made within the spirit and principle of this disclosure, All should be included in the protection scope of the present disclosure.

Claims (10)

  1. 一种异常检测的解释特征确定方法,所述方法包括:An interpretation feature determination method for anomaly detection, the method includes:
    对于输入异常检测模型的一个样本,所述样本包括至少一个样本特征,根据每个样本特征的分布参数确定所述样本特征的偏移度;所述分布参数用于表示该样本特征在所述异常检测模型的训练集数据中的分布特点;所述异常检测模型是无监督模型;For a sample of the input anomaly detection model, the sample includes at least one sample feature, and the degree of deviation of the sample feature is determined according to the distribution parameter of each sample feature; the distribution parameter is used to indicate that the sample feature is in the anomaly Distribution characteristics in the training set data of the detection model; the anomaly detection model is an unsupervised model;
    根据所述样本中的各个样本特征的偏移度,确定至少一个样本特征作为所述样本对应的解释特征,所述解释特征用于解释所述样本与对应的所述异常检测模型的模型输出结果之间的关联。According to the deviation degree of each sample feature in the sample, at least one sample feature is determined as the interpretation feature corresponding to the sample, and the interpretation feature is used to interpret the model output result of the sample and the corresponding anomaly detection model Associations.
  2. 根据权利要求1所述的方法,所述根据每个样本特征的分布参数确定所述样本特征的偏移度之前,所述方法还包括:The method according to claim 1, before determining the degree of deviation of the sample feature according to the distribution parameter of each sample feature, the method further comprises:
    根据所述异常检测模型的训练集数据,分别获得所述训练集数据中各个样本特征的分布参数。According to the training set data of the anomaly detection model, the distribution parameters of each sample feature in the training set data are obtained respectively.
  3. 根据权利要求2所述的方法,所述分别获得所述训练集数据中各个样本特征的分布参数,包括:According to the method of claim 2, the separately obtaining the distribution parameters of each sample feature in the training set data includes:
    所述训练集数据包括多个样本,每个样本包括至少一个样本特征;The training set data includes multiple samples, and each sample includes at least one sample feature;
    由所述训练集数据的各个样本中分别获取目标样本特征,得到包括多个目标样本特征的目标特征集;Obtaining target sample features from each sample of the training set data, to obtain a target feature set including multiple target sample features;
    根据所述目标特征集,确定所述目标样本特征的分布参数。According to the target feature set, the distribution parameters of the target sample features are determined.
  4. 根据权利要求1所述的方法,The method according to claim 1,
    所述分布参数包括:所述样本特征的均值和方差。The distribution parameters include: the mean and variance of the sample features.
  5. 根据权利要求4所述的方法,所述根据每个样本特征的分布参数确定所述样本特征的偏移度,包括:The method according to claim 4, the determining the degree of deviation of the sample feature according to the distribution parameter of each sample feature comprises:
    对于所述异常检测模型的测试集数据中所述样本的其中一个样本特征,确定所述样本特征在所述样本中的实际值;For one of the sample features of the sample in the test set data of the anomaly detection model, determine the actual value of the sample feature in the sample;
    获取所述样本特征在训练集数据中的均值;Obtaining the mean value of the sample features in the training set data;
    确定所述实际值偏离所述均值几倍方差的距离,作为所述偏移度。The distance of the actual value from the mean value by several times the variance is determined as the degree of deviation.
  6. 根据权利要求1所述的方法,所述根据样本中的各个样本特征的偏移度,确定至少一个样本特征作为所述样本对应的解释特征,包括:According to the method of claim 1, the determining at least one sample feature as the interpretation feature corresponding to the sample according to the degree of deviation of each sample feature in the sample includes:
    根据所述样本中的各个样本特征的偏移度,将所述各个样本特征进行降序排列,并将排序在前预设位数的所述至少一个样本特征作为所述解释特征。According to the deviation degree of each sample feature in the sample, the respective sample features are sorted in descending order, and the at least one sample feature sorted in the first preset number of digits is used as the interpretation feature.
  7. 一种异常检测的解释特征确定装置,所述装置包括:An interpretation feature determination device for anomaly detection, the device comprising:
    偏移度计算模块,用于对于输入异常检测模型的一个样本,所述样本包括至少一个样本特征,根据每个样本特征的分布参数确定所述样本特征的偏移度;所述分布参数用于表示该样本特征在所述异常检测模型的训练集数据中的分布特点;所述异常检测模型是无监督模型;An offset calculation module, for a sample of the input anomaly detection model, the sample includes at least one sample feature, and the offset of the sample feature is determined according to the distribution parameter of each sample feature; the distribution parameter is used for Indicates the distribution characteristics of the sample feature in the training set data of the anomaly detection model; the anomaly detection model is an unsupervised model;
    特征确定模块,用于根据所述样本中的各个样本特征的偏移度,确定至少一个样本特征作为所述样本对应的解释特征,所述解释特征用于解释所述样本与对应的所述异常检测模型的模型输出结果之间的关联。A feature determination module, configured to determine at least one sample feature as an interpretation feature corresponding to the sample according to the deviation degree of each sample feature in the sample, and the interpretation feature is used to interpret the sample and the corresponding anomaly Check the correlation between the model output results of the model.
  8. 根据权利要求7所述的装置,所述装置还包括:The device according to claim 7, further comprising:
    分布计算模块,用于由训练集数据的各个样本中分别获取目标样本特征,得到包括多个目标样本特征的目标特征集;根据所述目标特征集,确定所述目标样本特征的分布参数;所述训练集数据包括多个样本,每个样本包括至少一个样本特征。The distribution calculation module is used to obtain target sample features from each sample of the training set data to obtain a target feature set including multiple target sample features; according to the target feature set, determine the distribution parameters of the target sample features; The training set data includes multiple samples, and each sample includes at least one sample feature.
  9. 根据权利要求7所述的装置,The device according to claim 7,
    偏移度计算模块,具体用于:对于所述异常检测模型的测试集数据中所述样本的其中一个样本特征,确定所述样本特征在所述样本中的实际值;获取所述样本特征在训练集数据中的均值;确定所述实际值偏离所述均值几倍方差的距离,作为所述偏移度;所述分布参数包括:所述样本特征的均值和方差。The offset calculation module is specifically configured to: for one of the sample features of the sample in the test set data of the anomaly detection model, determine the actual value of the sample feature in the sample; obtain the sample feature in The mean value in the training set data; determine the distance that the actual value deviates from the mean by several times the variance as the degree of offset; and the distribution parameters include: the mean and variance of the sample features.
  10. 一种异常检测的解释特征确定设备,所述设备包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时实现以下步骤:An interpretation feature determination device for anomaly detection. The device includes a memory, a processor, and a computer program stored on the memory and executable on the processor. When the processor executes the program, the following steps are implemented:
    对于输入异常检测模型的一个样本,所述样本包括至少一个样本特征,根据每个样本特征的分布参数确定所述样本特征的偏移度;所述分布参数用于表示该样本特征在所述异常检测模型的训练集数据中的分布特点;所述异常检测模型是无监督模型;For a sample of the input anomaly detection model, the sample includes at least one sample feature, and the degree of deviation of the sample feature is determined according to the distribution parameter of each sample feature; the distribution parameter is used to indicate that the sample feature is in the anomaly Distribution characteristics in the training set data of the detection model; the anomaly detection model is an unsupervised model;
    根据所述样本中的各个样本特征的偏移度,确定至少一个样本特征作为所述样本对应的解释特征,所述解释特征用于解释所述样本与对应的所述异常检测模型的模型输出结果之间的关联。According to the deviation degree of each sample feature in the sample, at least one sample feature is determined as the interpretation feature corresponding to the sample, and the interpretation feature is used to interpret the model output result of the sample and the corresponding anomaly detection model Associations.
PCT/CN2019/097171 2018-10-17 2019-07-23 Interpretation feature determination method and device for anomaly detection WO2020078059A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811208609.2A CN109583470A (en) 2018-10-17 2018-10-17 A kind of explanation feature of abnormality detection determines method and apparatus
CN201811208609.2 2018-10-17

Publications (1)

Publication Number Publication Date
WO2020078059A1 true WO2020078059A1 (en) 2020-04-23

Family

ID=65920123

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/097171 WO2020078059A1 (en) 2018-10-17 2019-07-23 Interpretation feature determination method and device for anomaly detection

Country Status (3)

Country Link
CN (1) CN109583470A (en)
TW (1) TWI723476B (en)
WO (1) WO2020078059A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111767938A (en) * 2020-05-09 2020-10-13 北京奇艺世纪科技有限公司 Abnormal data detection method and device and electronic equipment
CN116304641A (en) * 2023-05-15 2023-06-23 山东省计算中心(国家超级计算济南中心) Anomaly detection interpretation method and system based on reference point search and feature interaction
CN116881724A (en) * 2023-09-07 2023-10-13 中国电子科技集团公司第十五研究所 Sample labeling method, device and equipment

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109583470A (en) * 2018-10-17 2019-04-05 阿里巴巴集团控股有限公司 A kind of explanation feature of abnormality detection determines method and apparatus
CN112148763A (en) * 2019-06-28 2020-12-29 京东数字科技控股有限公司 Unsupervised data anomaly detection method and device and storage medium
CN111027607B (en) * 2019-11-29 2023-10-17 泰康保险集团股份有限公司 Unsupervised high-dimensional data feature importance assessment and selection method and device
CN111340102B (en) * 2020-02-24 2022-03-01 支付宝(杭州)信息技术有限公司 Method and apparatus for evaluating model interpretation tools
CN111262887B (en) * 2020-04-26 2020-08-28 腾讯科技(深圳)有限公司 Network risk detection method, device, equipment and medium based on object characteristics
CN116130095B (en) * 2023-04-04 2023-07-11 深圳市金瑞铭科技有限公司 State monitoring method and device based on sensing technology and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108038211A (en) * 2017-12-13 2018-05-15 南京大学 A kind of unsupervised relation data method for detecting abnormality based on context
CN108108743A (en) * 2016-11-24 2018-06-01 百度在线网络技术(北京)有限公司 Abnormal user recognition methods and the device for identifying abnormal user
CN108512827A (en) * 2018-02-09 2018-09-07 世纪龙信息网络有限责任公司 The identification of abnormal login and method for building up, the device of supervised learning model
US20180268005A1 (en) * 2015-11-24 2018-09-20 Huawei Technologies Co., Ltd. Data processing method and apparatus
CN109583470A (en) * 2018-10-17 2019-04-05 阿里巴巴集团控股有限公司 A kind of explanation feature of abnormality detection determines method and apparatus

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9536361B2 (en) * 2012-03-14 2017-01-03 Autoconnect Holdings Llc Universal vehicle notification system
TW201831881A (en) * 2012-07-25 2018-09-01 美商提拉諾斯股份有限公司 Image analysis and measurement of biological samples
US20200333777A1 (en) * 2016-09-27 2020-10-22 Tokyo Electron Limited Abnormality detection method and abnormality detection apparatus

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180268005A1 (en) * 2015-11-24 2018-09-20 Huawei Technologies Co., Ltd. Data processing method and apparatus
CN108108743A (en) * 2016-11-24 2018-06-01 百度在线网络技术(北京)有限公司 Abnormal user recognition methods and the device for identifying abnormal user
CN108038211A (en) * 2017-12-13 2018-05-15 南京大学 A kind of unsupervised relation data method for detecting abnormality based on context
CN108512827A (en) * 2018-02-09 2018-09-07 世纪龙信息网络有限责任公司 The identification of abnormal login and method for building up, the device of supervised learning model
CN109583470A (en) * 2018-10-17 2019-04-05 阿里巴巴集团控股有限公司 A kind of explanation feature of abnormality detection determines method and apparatus

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111767938A (en) * 2020-05-09 2020-10-13 北京奇艺世纪科技有限公司 Abnormal data detection method and device and electronic equipment
CN111767938B (en) * 2020-05-09 2023-12-19 北京奇艺世纪科技有限公司 Abnormal data detection method and device and electronic equipment
CN116304641A (en) * 2023-05-15 2023-06-23 山东省计算中心(国家超级计算济南中心) Anomaly detection interpretation method and system based on reference point search and feature interaction
CN116304641B (en) * 2023-05-15 2023-09-15 山东省计算中心(国家超级计算济南中心) Anomaly detection interpretation method and system based on reference point search and feature interaction
CN116881724A (en) * 2023-09-07 2023-10-13 中国电子科技集团公司第十五研究所 Sample labeling method, device and equipment
CN116881724B (en) * 2023-09-07 2023-12-19 中国电子科技集团公司第十五研究所 Sample labeling method, device and equipment

Also Published As

Publication number Publication date
TWI723476B (en) 2021-04-01
CN109583470A (en) 2019-04-05
TW202044111A (en) 2020-12-01

Similar Documents

Publication Publication Date Title
WO2020078059A1 (en) Interpretation feature determination method and device for anomaly detection
US11488055B2 (en) Training corpus refinement and incremental updating
US10210189B2 (en) Root cause analysis of performance problems
US11379845B2 (en) Method and device for identifying a risk merchant
US20100049686A1 (en) Methods and apparatus for visual recommendation based on user behavior
US10504028B1 (en) Techniques to use machine learning for risk management
US10311067B2 (en) Device and method for classifying and searching data
JP2020501232A (en) Risk control event automatic processing method and apparatus
US10754744B2 (en) Method of estimating program speed-up in highly parallel architectures using static analysis
Gupta et al. Data Quality Toolkit: Automatic assessment of data quality and remediation for machine learning datasets
US11314616B2 (en) Machine learning models applied to interaction data for facilitating modifications to online environments
CN110633989A (en) Method and device for determining risk behavior generation model
US20190026108A1 (en) Recommendations based on the impact of code changes
US10354192B2 (en) Recommender system for exploratory data analysis
US9460393B2 (en) Inference of anomalous behavior of members of cohorts and associate actors related to the anomalous behavior based on divergent movement from the cohort context centroid
Panda et al. mgcpy: A comprehensive high dimensional independence testing python package
US9852371B2 (en) Using radial basis function networks and hyper-cubes for excursion classification in semi-conductor processing equipment
Saroha et al. Software effort estimation using enhanced use case point model
US20170004511A1 (en) Identifying Drivers for a Metric-of-Interest
Ribeiro et al. A method for assessing parameter impact on control-flow discovery algorithms
US20230161683A1 (en) Method and apparatus for detecting outliers in a set of runs of software applications
US20230195838A1 (en) Discovering distribution shifts in embeddings
US20170124177A1 (en) Method and system for locating underlying patterns in datasets using hierarchically structured categorical clustering
US11853751B2 (en) Indirect function call target identification in software
CN112363669B (en) Operation behavior determination method and device, electronic equipment and computer-readable storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19874149

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19874149

Country of ref document: EP

Kind code of ref document: A1