WO2021139249A1 - 数据异常检测方法、装置、设备及存储介质 - Google Patents
数据异常检测方法、装置、设备及存储介质 Download PDFInfo
- Publication number
- WO2021139249A1 WO2021139249A1 PCT/CN2020/118524 CN2020118524W WO2021139249A1 WO 2021139249 A1 WO2021139249 A1 WO 2021139249A1 CN 2020118524 W CN2020118524 W CN 2020118524W WO 2021139249 A1 WO2021139249 A1 WO 2021139249A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- data
- hypersphere
- classification model
- value
- distance
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3065—Monitoring arrangements determined by the means or processing involved in reporting the monitored data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2413—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
- G06F18/24133—Distances to prototypes
- G06F18/24143—Distances to neighbourhood prototypes, e.g. restricted Coulomb energy networks [RCEN]
Definitions
- This application relates to the technical field of data processing in artificial intelligence, and in particular to a data abnormality detection method, device, equipment and storage medium.
- AIOps intelligent operations
- the CPU and disks of the computer system will generate a large amount of index data, which will also contain some abnormal values.
- index data which will also contain some abnormal values.
- the cause of the system abnormality can be found, and suggestions can be provided for subsequent operations. Therefore, anomaly detection technology plays an important role in the field of intelligent operation.
- Traditional anomaly detection includes statistical-based methods and density-based methods.
- the proportion of abnormal data is relatively small, and it is more cumbersome to find potential abnormal points and their corresponding classifications from a large amount of data.
- Density-based methods belong to unsupervised learning, which can be completed without data labeling, but the detection accuracy is usually not high, and there is a lack of theoretical support for the classification results.
- a data abnormality detection method including:
- the primary abnormal data is identified and marked and stored in the marked first data set to form a second data set, and a pre-trained hypersphere classification model is trained through the second data set, wherein the super
- the sphere classification model is a classification model that can fit a hypersphere covering most sample values in a high-dimensional space to the currently labeled data, and identify abnormal data and normal data with the surface of the hypersphere as the boundary;
- the unlabeled data is input into the hypersphere classification model under the training termination condition for classification screening, so as to obtain target abnormal data.
- the training method of the hypersphere classification model includes:
- Different penalty coefficients are respectively set for abnormal data and normal data to generate a loss function, wherein the penalty coefficient is a constant within a preset preset;
- the sphere center value representing the center position of the hypersphere and the radius value representing the distance between the sphere center value of the hypersphere and the surface of the hypersphere in the hypersphere classification model are calculated;
- a decision function for identifying normal values and abnormal values is generated according to the sphere center value and the radius value.
- inputting the unlabeled data into the hypersphere classification model under the training termination condition for classification screening to obtain target abnormal data includes:
- the unmarked data is output and marked as target abnormal data.
- the query strategy screens the primary abnormal data based on the above-mentioned pre-trained hypersphere classification model, and the preset condition is that the weighted distance value from the surface of the hypersphere of the hypersphere classification model is the smallest.
- the extracting primary abnormal data from the unmarked data according to a preset query strategy includes:
- the nearest spherical distance and the nearest neighbor sample distance are normalized, and weighted by a preset coefficient to obtain a weighted distance value of each of the unmarked data.
- the method for normalizing the closest spherical distance includes:
- the difference between the closest spherical distance of each unmarked data and the first minimum value is divided by the first maximum value to obtain the normalized closest spherical distance corresponding to all the unmarked data.
- the method for normalizing the distance of the nearest neighbor sample includes:
- the difference between each of the unlabeled data and the second minimum is calculated separately, and these differences are divided by the second maximum to obtain the normalized nearest neighbor sample distance of all unlabeled data.
- a data abnormality detection device including:
- Acquisition module configured to execute acquisition of unmarked data
- Query module configured to perform the extraction of primary abnormal data from the unmarked data according to a preset query strategy, where the primary abnormal data is that the unmarked data filtered through the query strategy meets a preset condition
- Training module configured to perform the identification and marking of the primary abnormal data and then store it in the marked first data set to form a second data set, and use the second data set to pre-trained hyperspheres
- the classification model is trained, where the hypersphere classification model can fit a hypersphere covering most sample values in a high-dimensional space to the currently labeled data, and identify abnormal data and normal data with the surface of the hypersphere as the boundary.
- Data classification model ;
- Recognition module configured to perform recognition of whether the hypersphere classification model meets the training termination condition
- Result output module configured to execute when the training termination condition is reached, input the unlabeled data into the hypersphere classification model under the training termination condition for classification screening, so as to obtain target abnormal data.
- a computer device including a memory, a processor, and a computer program stored on the memory and running on the processor, and the processor executes the computer program Steps to realize the data anomaly detection method:
- the primary abnormal data is identified and marked and stored in the marked first data set to form a second data set, and a pre-trained hypersphere classification model is trained through the second data set, wherein the super
- the sphere classification model is a classification model that can fit a hypersphere covering most sample values in a high-dimensional space to the currently labeled data, and identify abnormal data and normal data with the surface of the hypersphere as the boundary;
- the unlabeled data is input into the hypersphere classification model under the training termination condition for classification screening, so as to obtain target abnormal data.
- a computer-readable storage medium on which a computer program is stored, and the program is executed by a processor to implement the steps of the data abnormality detection method:
- the primary abnormal data is identified and marked and stored in the marked first data set to form a second data set, and a pre-trained hypersphere classification model is trained through the second data set, wherein the super
- the sphere classification model is a classification model that can fit a hypersphere covering most sample values in a high-dimensional space to the currently labeled data, and identify abnormal data and normal data with the surface of the hypersphere as the boundary;
- the unlabeled data is input into the hypersphere classification model under the training termination condition for classification screening, so as to obtain target abnormal data.
- the data anomaly detection method provided in the embodiment of the application uses a small amount of labeled data to train the hypersphere classification model, and after the training termination condition is reached, the hypersphere classification model is used to classify the unlabeled data, otherwise the updated labeled data is used to continue training the hypersphere.
- Sphere classification model This method combines unsupervised and supervised methods.
- the hypersphere classification model trained with a small amount of labeled data has no restrictions on the original distribution of the data, and has a wider range of use.
- the query strategy based on boundary distance and sample density can be used.
- FIG. 1 shows a flowchart of a data abnormality detection method according to an embodiment of the present application
- FIG. 2 shows a flowchart of steps involved in extracting primary abnormal data from the unmarked data according to a preset query strategy according to an embodiment of the present application
- FIG. 3 shows a flowchart of a method for normalizing the distance to the closest spherical surface according to an embodiment of the present application
- FIG. 4 shows a flowchart of a method for normalizing the distance of the nearest neighbor sample according to an embodiment of the present application
- FIG. 5 shows a flowchart of a training method of a hypersphere classification model according to an embodiment of the present application
- FIG. 6 shows a flowchart of steps involved in inputting the unlabeled data into the hypersphere classification model under training termination conditions for classification screening to obtain target abnormal data according to an embodiment of the present application
- FIG. 7 shows a structural block diagram of a data anomaly detection device according to an embodiment of the present application.
- FIG. 8 shows a schematic diagram of the hardware architecture of a computer device according to an embodiment of the present application.
- Fig. 9 shows a flowchart of a data abnormality detection method according to another embodiment of the present application.
- an embodiment of the present application provides a data abnormality detection method, including:
- the data generated by the computer system is often unbalanced, and most of the data is normal data. Therefore, the abnormal data detection in the operation process can be regarded as a single classification problem.
- the trained classification model needs to have the ability to distinguish whether the high-dimensional space data is normal or not.
- the data generated by the computer system is divided into marked data and unmarked data, the marked data is divided into marked data sets, and the unmarked data is divided into unmarked data sets.
- the classification model can also be called a classifier.
- monitoring index data of the computer system in a certain embodiment is shown in Table 1:
- the query strategy screens primary abnormal data based on the above-mentioned pre-trained hypersphere classification model, and the preset condition is that the weighted distance value from the surface of the hypersphere of the hypersphere classification model is the smallest.
- step S2 includes:
- the surface of the hypersphere classification model in the classification model is the key to distinguishing whether the index data is normal or not, it is also the most uncertain area in the high-dimensional space. Therefore, the distance from the data x to the surface of the hypersphere classification model is used as the measurement standard, which is recorded as the closest spherical distance
- the distance between the data and the nearest data is selected to measure the distribution density, which is recorded as the nearest neighbor sample distance d(x, NN 1 (x)).
- the greater the distribution density the smaller the nearest neighbor sample distance. Therefore, if the distance between the two points and the boundary is the same, the sample with the highest density nearby (that is, the nearest neighbor sample distance is the smallest) is preferred.
- the query strategy selects the data with the smallest weighted distance each time.
- the data with the smallest weighted distance is the most representative data, that is, the primary abnormal data.
- the method of normalizing the closest spherical distance includes:
- the method for normalizing the distance of the nearest neighbor sample includes:
- the normalized nearest spherical distance and normalized nearest neighbor sample distance of each unlabeled data are weighted with a coefficient of 0.5 respectively, and the corresponding weighted distance can be obtained as:
- the hypersphere classification model is a classification model that can fit a hypersphere covering most sample values in a high-dimensional space for the currently labeled data, and identify abnormal data and normal data with the surface of the hypersphere as a boundary.
- Identify the primary abnormal data as normal data or abnormal data through a computer or manually, mark the primary abnormal data according to the recognition result, and use the marked primary abnormal data as new marked data.
- the judgment mark of the most representative unmarked data is received to obtain new marked data, that is, new marked data.
- the unlabeled data x107048 with the smallest weighted distance is selected and handed over to the AI operator for judgment and labeling, and new labeled data is obtained.
- the most representative unlabeled data can be judged manually or by a computer as normal data or abnormal data, and the most representative unlabeled data can be marked, and the most representative unlabeled data can be judged The mark becomes the new marked data.
- the new marked data is added to the marked data set to obtain an updated marked data set.
- the secondary abnormal data obtained by identifying and marking the primary abnormal data is stored in the marked first data set to form a second data set.
- the first marked data set is the marked data set.
- the second data set is an updated marked data set obtained after storing the primary abnormal data in the marked data set.
- the pre-trained hypersphere classification model is trained through the second data set.
- the hypersphere classification model is a classification model that can fit a hypersphere covering most sample values in a high-dimensional space to the currently labeled data, and identify abnormal data and normal data with the surface of the hypersphere as a boundary.
- the training method of the hypersphere classification model includes:
- S33 Generate a decision function for identifying normal values and abnormal values according to the sphere center value and the radius value.
- using the labeled data in the labeled data set to train the hypersphere classification model includes: for the labeled data in the current labeled data set, fitting a hypersphere model in a high-dimensional space.
- the sphere model contains a number of marked data, and the number of marked data contained in the hypersphere model meets a preset condition; the preset condition can be set according to actual needs, for example, it can be located on the surface of the hypersphere model and the hypersphere model
- the number of marked data in the hypersphere model is the largest, or the proportion of the number of marked data located on the surface of the hypersphere model and the hypersphere model reaches a preset threshold, etc.; the center and radius of the hypersphere model are determined, so as to obtain A hypersphere classification model that uses the surface of the hypersphere model as the interface to classify data; using the surface of the hypersphere classification model as a boundary, the data located on the surface of the hypersphere classification model and the data in the hypersphere classification model are normal data and are located in The data outside the hypersphere classification model
- the constructed loss function is as follows:
- a is the center of the sphere
- R is the radius of the hypersphere model
- ⁇ i, ⁇ j are slack variables
- xi, xj are labeled data
- Lin is the labeled normal data set
- i and j are numbers used to label different data
- Lout is the marked abnormal data set
- the penalty coefficients C1 and C2 are constants, ranging from 0 to 1.
- the above problem is a non-convex optimization problem, and the Lagrangian multiplier method cannot find the global optimal solution.
- the constraint conditions including slack variables are expressed in the form of a risk function, so that the problem expressed by the above formula is transformed into an unconstrained optimization problem, as follows:
- ⁇ i l(R 2 -
- ⁇ j l(
- ⁇ (x) is the transformation mapping function, which is used to map the original data x to the new feature space after feature transformation;
- the function of the constant ⁇ is to constrain the value of t, so that the hazard function l(t) can be second-order derivation within a small range, and the value of the initial hazard function is small.
- ⁇ 0.5
- ⁇ 0.5
- (xixj) represents the inner product of the i-th sample and the j-th sample vector.
- the constrained problem solving process usually uses the Lagrangian method to obtain the optimal solution.
- step S5 includes:
- the query strategy screens primary abnormal data based on the above-mentioned pre-trained hypersphere classification model, and the preset condition is that the weighted distance value from the surface of the hypersphere of the hypersphere classification model is the smallest.
- the updated labeled data set is used to retrain the classification model, and this cycle is repeated.
- the center position, radius and decision function of the hypersphere classification model will be adjusted accordingly.
- the change of the classification model after each iteration is less than the preset threshold, and the training termination condition is reached.
- the trained classification model is obtained, and the final decision function is obtained.
- the decision function f(x)
- -R represents the difference between the distance between the data x and the center a and the radius R.
- the unlabeled data xi is substituted into the decision function f(x) to determine whether it is positive or negative. If f(x) ⁇ 0, the unlabeled system index data is considered normal data. If f(x)> 0, It is considered that the corresponding system indicator data is abnormal.
- the unlabeled data xi when performing classification, calculate the distance between the unlabeled data xi and the center of the hypersphere classification model to determine whether the distance is greater than the radius of the hypersphere classification model; if the distance is less than or equal to the radius of the hypersphere classification model, then The unlabeled data xi is normal data. If the distance is greater than the radius of the hypersphere classification model, the unlabeled data xi is abnormal.
- -R, f(x) -49.09 ⁇ 0, then the data xi is in the hypersphere classification model, so xi can be considered as normal data.
- AI operators can continue to perform root cause analysis, etc., find out the cause of the system abnormality, and give repair suggestions.
- another embodiment of the present application further provides a data abnormality detection device, including:
- Obtaining module 100 configured to perform acquisition of unmarked data
- Query module 200 configured to perform extraction of primary abnormal data from the unmarked data according to a preset query strategy, where the primary abnormal data is the unmarked data filtered through the query strategy that reaches the preset value.
- Conditional data
- the training module 300 is configured to perform the identification and marking of the primary abnormal data and then store the marked first data set to form a second data set, and use the second data set to classify a pre-trained hypersphere classification model
- the hypersphere classification model is capable of fitting a hypersphere covering most sample values in a high-dimensional space to the currently labeled data, and identifying abnormal data and normal data with the surface of the hypersphere as the boundary.
- Recognition module 400 configured to perform recognition of whether the hypersphere classification model meets the training termination condition
- the result output module 500 is configured to execute when the training termination condition is reached, input the unlabeled data into the hypersphere classification model under the training termination condition for classification screening, so as to obtain target abnormal data.
- a computer device 600 which includes a memory 601, a processor 602, and a computer program stored on the memory 601 and running on the processor 602,
- the processor 602 implements the aforementioned data abnormality detection method when the computer program is executed.
- the computer device is a device that can automatically perform numerical calculation and/or information processing in accordance with pre-set or stored instructions.
- the computer equipment 600 at least includes, but is not limited to, a memory 601, a processor 602, and a network interface 603 that can communicate with each other through a device bus. among them:
- the memory 601 includes at least one type of computer-readable storage medium.
- the readable storage medium includes flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.), random access memory ( RAM), static random access memory (SRAM), read only memory (ROM), electrically erasable programmable read only memory (EEPROM), programmable read only memory (PROM), magnetic memory, magnetic disks, optical disks, etc.
- the memory 601 may be an internal storage unit of the computer device 600, such as a hard disk or memory of the computer device 600.
- the memory 601 may also be an external storage device of the computer device 600, such as a plug-in hard disk, a smart media card (SMC), and a secure digital (Secure Digital) equipped on the computer device 600. ,SD) card, flash card (Flash Card), etc.
- the memory 601 may also include both the internal storage unit of the computer device 600 and its external storage device.
- the memory 601 is generally used to store operating devices and various application software installed in the computer equipment 600, such as the program code of the abnormal medical insurance group identification device 500.
- the memory 601 can also be used to temporarily store various types of data that have been output or will be output.
- the processor 602 may be a central processing unit (Central Processing Unit, CPU), a controller, a microcontroller, a microprocessor, or other data processing chips in some embodiments.
- the processor 602 is generally used to control the overall operation of the computer device 600.
- the processor 602 is configured to run the program code or process data stored in the memory 601, such as the data anomaly detection device 500, to implement the data anomaly detection method in each of the foregoing embodiments.
- the network interface 603 may include a wireless network interface or a wired network interface.
- the network interface 603 is generally used to establish a communication connection between the computer device 600 and other electronic devices.
- the network interface 603 is used to connect the computer device 600 to an external terminal through a network, and to establish a data transmission channel and a communication connection between the computer device 600 and the external terminal.
- the network may be an intranet, the Internet, a global system of mobile communication (GSM), a wideband code division multiple access (WCDMA), a 4G network, and 5G Network, Bluetooth (Bluetooth), Wi-Fi and other wireless or wired networks.
- GSM global system of mobile communication
- WCDMA wideband code division multiple access
- 4G 4G network
- 5G Network Bluetooth (Bluetooth), Wi-Fi and other wireless or wired networks.
- FIG. 8 only shows a computer device 600 with components 601-603, but it should be understood that it is not required to implement all the components shown, and more or fewer components may be implemented instead.
- the abnormal medical insurance group identification device 500 stored in the memory 601 may also be divided into one or more program modules, and the one or more program modules are stored in the memory 601 and configured by One or more processors (processor 602 in this embodiment) are executed to complete the data abnormality detection method of the present application.
- This embodiment also provides a computer-readable storage medium.
- the above-mentioned storage medium may be a non-volatile storage medium or a volatile storage medium.
- Such as flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.), random access memory (RAM), static random access memory (SRAM), read only memory (ROM), electrically erasable and programmable memory Read memory (EEPROM), programmable read-only memory (PROM), magnetic memory, magnetic disks, optical disks, servers, App application malls, etc. have computer programs stored thereon, and the corresponding functions are realized when the programs are executed by the processor.
- the computer-readable storage medium of this embodiment is used to store the abnormal medical insurance group identification device 500 to implement the data abnormality detection method of the present application when it is executed by a processor.
- Another embodiment of the present application also discloses a computer-readable storage medium on which a computer program is stored, and the program is executed by a processor to implement the above-mentioned data abnormality detection method.
- another embodiment of the present application provides a data abnormality detection method, including:
- step S60 Identify whether the hypersphere classification model meets the training termination condition; if the training termination condition is met, go to step S70; if the training termination condition is not met, go to step S20.
- S70 Use the hypersphere classification model to classify unlabeled data, and input the unlabeled data into the hypersphere classification model under training termination conditions for classification and screening, so as to obtain target abnormal data.
- the data anomaly detection method provided by the embodiment of the application starts from the data, combines the unsupervised learning method with the supervised learning method, and uses a small amount of labeled data to construct a hypersphere classification model that does not limit the original distribution of the data, and the scope of application Broader; and the query strategy based on boundary distance and sample density can more accurately find the most valuable data and reduce the impact of noise, greatly reducing the amount of data that operators need to mark, which not only ensures the accuracy of data classification, but also saves This reduces the cost of artificial intelligence operations, and is more suitable for actual industry scenarios, facilitating large-scale deployment.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Quality & Reliability (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
一种数据异常检测方法、装置、设备及存储介质,涉及大数据领域,该方法包括:获取未标记数据(S1);根据预设的查询策略从所述未标记数据中提取出初级异常数据(S2);将所述初级异常数据进行识别标记后存入已标记的第一数据集合中组成第二数据集合,并通过所述第二数据集合对预先训练的超球体分类模型进行训练(S3);识别所述超球体分类模型是否达到训练终止条件(S4);当达到所述训练终止条件,将所述未标记数据输入训练终止条件下的所述超球体分类模型中进行分类筛选,以得到目标异常数据(S5)。该方法利用少量已标记数据训练分类模型,达到训练终止条件后利用该分类模型对未标记数据进行分类,对数据的原始分布没有限制,减少了运营人员需要标记的数据量,分类结果准确度高。
Description
本申请要求于2020年05月28日提交中国专利局、申请号为202010468770.4,发明名称为“数据异常检测方法、装置、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
本申请涉及人工智能中的数据处理技术领域,具体涉及一种数据异常检测方法、装置、设备及存储介质。
对计算机系统的监控是智能运营(AIOps)的一个重要组成部分,在对计算机系统进行监控的过程中,计算机系统的CPU、磁盘等均会产生大量指标数据,其中也会包含部分异常值。通过对异常点的分支可以找出系统异常的原因,可以对后续的运营提供建议。因此异常检测技术在智能运营领域中发挥着重大作用。
传统的异常检测包括基于统计的方法和基于密度的方法。
基于统计的方法往往是通过对大量已标记数据进行训练,从中找出疑似的异常点,属于有监督学习。由以往的经验可知,有监督学习在异常检测的实际应用中存在以下问题:
1.程序运行过程中产生的海量数据大多数未经标记,而数据标记往往需要专业人士来进行,因此想要获得足够的数据标签需要耗费大量的人力、物力和财力。
2.异常数据所占比重较小,从大量数据中找到潜在的异常点及其对应分类也较为繁琐。
基于密度的方法属于无监督学习,无需数据标记即可完成,但检测准确率通常不高,对于分类结果缺乏理论支持。
本申请的目的是提供一种数据异常检测方法、装置、设备及存储介质。为了对披露的实施例的一些方面有一个基本的理解,下面给出了简单的概括。该概括部分不是泛泛评述,也不是要确定关键/重要组成元素或描绘这些实施例的保护范围。其唯一目的是用简单的形式呈现一些概念,以此作为后面的详细说明的序言。
根据本申请实施例的一个方面,提供一种数据异常检测方法,包括:
获取未标记数据;
根据预设的查询策略从所述未标记数据中提取出初级异常数据,其中,所述初级异常数据为通过查询策略筛选出的所述未标记数据中达到预设条件的数据;
将所述初级异常数据进行识别标记后存入已标记的第一数据集合中组成第二数据集合,并通过所述第二数据集合对预先训练的超球体分类模型进行训练,其中,所述超球体分类模型为可对当前已标记的数据,在高维空间中拟合一个涵盖大多数样本值的超球体,以超球体表面为界识别异常数据与正常数据的分类模型;
识别所述超球体分类模型是否达到训练终止条件;
当达到所述训练终止条件,将所述未标记数据输入训练终止条件下的所述超球体分类模型中进行分类筛选,以得到目标异常数据。
进一步地,所述超球体分类模型的训练方法包括:
对异常数据和正常数据分别设置不同的惩罚系数以生成损失函数,其中,所述惩罚系数为在预设预置内的常数;
设置约束条件后计算得到所述超球体分类模型中表征超球体中心位置的球心值和表征超球体球心值与超球体表面之间距离的半径值;
根据所述球心值和所述半径值生成识别正常值与异常值的决策函数。
进一步地,所述当达到所述训练终止条件,将所述未标记数据输入训练终止条件下的所述超球体分类模型中进行分类筛选,以得到目标异常数据,包括:
将所述未标记数据分别代入所述决策函数中以生成决策结果值;
判断所述决策结果值是否大于或等于零;
当大于或等于零,则输出该未标记数据,并标记为目标异常数据。
进一步地,所述查询策略基于上述预先训练好的超球体分类模型筛选初级异常数据,所述预设条件为距离所述超球体分类模型的超球体表面的加权距离值最小。
进一步地,所述根据预设的查询策略从所述未标记数据中提取出初级异常数据,包括:
将所述未标记数据带入所述决策函数中并取绝对值得到最近球面距离;
计算所述未标记数据之间的距离值取最小值作为最近邻样本距离;
将所述最近球面距离与所述最近邻样本距离归一化处理,并以预设系数进行加权得到各个所述未标记数据的加权距离值。
进一步地,所述最近球面距离归一化处理的方法包括:
从所有所述未标记数据的最近球面距离中选出数值最小的第一最小值和数值最大的第一最大值;
用每个所述未标记数据的最近球面距离与所述第一最小值之差除以所述第一最大值,得到所有所述未标记数据对应的归一化最近球面距离。
进一步地,所述最近邻样本距离归一化处理的方法包括:
从所有未标记数据的最近邻样本距离中选取数值最小的第二最小值和数值最大的第二最大值;
分别计算每个所述未标记数据与所述第二最小值之差,再用这些差值除以所述第二最大值,得到所有未标记数据的归一化最近邻样本距离。
根据本申请实施例的另一个方面,提供一种数据异常检测装置,包括:
获取模块:被配置为执行获取未标记数据;
查询模块:被配置为执行根据预设的查询策略从所述未标记数据中提取出初级异常数据,其中,所述初级异常数据为通过查询策略筛选出的所述未标记数据中达到预设条件的数据;训练模块:被配置为执行将所述初级异常数据进行识别标记后存入已标记的第一数据集合中组成第二数据集合,并通过所述第二数据集合对预先训练的超球体分类模型进行训练,其中,所述超球体分类模型为可对当前已标记的数据,在高维空间中拟合一个涵盖大多数样本值的超球体,以超球体表面为界识别异常数据与正常数据的分类模型;
识别模块:被配置为执行识别所述超球体分类模型是否达到训练终止条件;
结果输出模块:被配置为执行当达到所述训练终止条件,将所述未标记数据输入训练终止条件下的所述超球体分类模型中进行分类筛选,以得到目标异常数据。
根据本申请实施例的另一个方面,提供一种计算机设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现数据异常检测方法的步骤:
获取未标记数据;
根据预设的查询策略从所述未标记数据中提取出初级异常数据,其中,所述初级异常数据为通过查询策略筛选出的所述未标记数据中达到预设条件的数据;
将所述初级异常数据进行识别标记后存入已标记的第一数据集合中组成第二数据集合,并通过所述第二数据集合对预先训练的超球体分类模型进行训练,其中,所述超球体分类模型为可对当前已标记的数据,在高维空间中拟合一个涵盖大多数样本值的超球体,以超球体表面为界识别异常数据与正常数据的分类模型;
识别所述超球体分类模型是否达到训练终止条件;
当达到所述训练终止条件,将所述未标记数据输入训练终止条件下的所述超球体分类模型中进行分类筛选,以得到目标异常数据。
根据本申请实施例的另一个方面,提供一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行,以实现数据异常检测方法的步骤:
获取未标记数据;
根据预设的查询策略从所述未标记数据中提取出初级异常数据,其中,所述初级异常数据为通过查询策略筛选出的所述未标记数据中达到预设条件的数据;
将所述初级异常数据进行识别标记后存入已标记的第一数据集合中组成第二数据集合,并通过所述第二数据集合对预先训练的超球体分类模型进行训练,其中,所述超球体分类模型为可对当前已标记的数据,在高维空间中拟合一个涵盖大多数样本值的超球体,以超球体表面为界识别异常数据与正常数据的分类模型;
识别所述超球体分类模型是否达到训练终止条件;
当达到所述训练终止条件,将所述未标记数据输入训练终止条件下的所述超球体分类模型中进行分类筛选,以得到目标异常数据。
本申请实施例提供的数据异常检测方法,利用少量已标记数据训练超球体分类模型,达到训练终止条件后利用该超球体分类模型对未标记数据进行分类,否则用更新的已标记数据继续训练超球体分类模型;该方法将无监督与有监督方法相结合,利用少量已标记数据训练的超球体分类模型对数据的原始分布没有限制,使用范围更广,基于边界距离与样本密度的查询策略能较为精准的找出最有价值的数据并且减少噪声的影响,大大减少了运营人员需要标记的数据量,既保证了超球体分类模型的分类精度,又节约了人工智能运营的成本,更适用于实际业界场景,便于大规模部署。
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请中记载的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1示出了本申请的一个实施例的数据异常检测方法的流程图;
图2示出了本申请一实施例的根据预设的查询策略从所述未标记数据中提取出初级异常数据所包括的步骤流程图;
图3示出了本申请一实施例的最近球面距离归一化处理的方法流程图;
图4示出了本申请一实施例的最近邻样本距离归一化处理的方法流程图;
图5示出了本申请的一个实施例的超球体分类模型的训练方法流程图;
图6示出了本申请一实施例的将所述未标记数据输入训练终止条件下的所述超球体分类模型中进行分类筛选以得到目标异常数据所包括的步骤流程图;
图7示出了本申请的一个实施例的数据异常检测装置的结构框图;
图8示出了本申请的一个实施例的计算机设备的硬件架构示意图;
图9示出了本申请另一实施例的数据异常检测方法的流程图。
如图1所示,本申请的一个实施例提供了一种数据异常检测方法,包括:
S1、获取未标记数据。
实际智能运营过程中,计算机系统产生的数据往往是不平衡的,绝大部分数据属于正常数据,因此可以将运营过程的异常数据检测视为单分类问题。考虑 到计算机系统的监控指标数据分布于高维空间中,因此训练的分类模型需要具备区分高维空间数据正常与否的能力。计算机系统产生的数据分为已标记数据和未标记数据,将已标记数据划分到已标记数据集中,将未标记数据划分到未标记数据集中。分类模型也可以称为分类器。
例如,某实施方式中的计算机系统的监控指标数据如表1所示:
表1系统监控指标数据
S2、根据预设的查询策略从所述未标记数据中提取出初级异常数据,其中,所述初级异常数据为通过查询策略筛选出的所述未标记数据中达到预设条件的数据。
考虑到运营人员精力有限而未标记数据较多,无法将其逐一标记,故使用查询策略决定选取哪些未标记数据交由运营人员进行标记。
所述查询策略基于上述预先训练好的超球体分类模型筛选初级异常数据,所述预设条件为距离所述超球体分类模型的超球体表面的加权距离值最小。
在某些实施方式中,如图2所示,步骤S2包括:
S21、将所述未标记数据带入所述决策函数中并取绝对值得到最近球面距离;
S22、计算所述未标记数据之间的距离值取最小值作为最近邻样本距离;
S23、将所述最近球面距离与所述最近邻样本距离归一化处理,并以预设系数进行加权得到各个所述未标记数据的加权距离值。
由于分类模型中的超球体分类模型表面是区分指标数据正常与否的关键,也是高维空间中最不确定的区域。因此,采用数据x到超球体分类模型表面的距离作为衡量标准,记作最近球面距离|f(x)|。
除此之外,考虑到超球体分类模型表面经过的区域数据分布越集中,其数据代表性越高。因此选择数据与其最近的一个数据之间的距离来衡量分布密度,记作最近邻样本距离d(x,NN
1(x))。分布密度越大,最近邻样本距离越小。因此,若两点与边界距离相同的情况下,优先选择附近密度较大(即最近邻样本距离最小)的样本。
故查询策略每次选取加权距离最小的数据。加权距离最小的数据即为最具代表性的数据,也即初级异常数据。
如图3所示,最近球面距离归一化处理的方法包括:
S231、从所有所述未标记数据的最近球面距离中选出数值最小的第一最小 值和数值最大的第一最大值;
S232、用每个所述未标记数据的最近球面距离与所述第一最小值之差除以所述第一最大值,得到所有所述未标记数据对应的归一化最近球面距离。
实际操作过程中,计算所有未标记数据的归一化最近球面距离时,先分别将所有未标记数据代入决策函数f(x)=||x-a||-R中并取绝对值,得到各未标记数据的最近球面距离|f(x)|,从所有|f(x)|中取出最小值和最大值,分别记为
U代表未标记数据集,当x=x1时,|f(x)|取得最小值,当x=x2时,|f(x)|取得最大值。决策函数f(x)=||x-a||-R所代表的含义为数据x与圆心a的距离与半径R之差。数据与分类模型的球心之间的距离可以称为对应于该数据的至球心距离。
如图4所示,最近邻样本距离归一化处理的方法包括:
S231’、从所有未标记数据的最近邻样本距离中选取数值最小的第二最小值和数值最大的第二最大值;
S232’、分别计算每个所述未标记数据与所述第二最小值之差,再用这些差值除以所述第二最大值,得到所有未标记数据的归一化最近邻样本距离。
具体地,计算所有未标记数据的归一化最近邻样本距离时,针对每个数据x,计算该数据x到其他所有数据之间的距离,取距离最小值记作最近邻样本距离,找到数据x最近邻的点记作d(x,NN
1(x))。取所有数据的最近邻样本距离中的最小值以及最大值,分别记作
U代表未标记数据集,当x=x3时,d(x,NN1(x))取得最小值,当x=x4时,d(x,NN1(x))取得最大值。
再进行归一化操作,分别用每个数据的最近邻数据减去所有最近邻样本距离中的最小值,得到一个差,再用该差除以所有最近邻样本距离中的最大值,得到所有数据的归一化最近邻样本距离
将每一未标记数据的归一化后的最近球面距离与归一化后的最近邻样本距离分别以0.5为系数进行加权,即可得到对应的加权距离为:
将所有数据的加权距离按照从小到大的顺序排列,取前五个数据如下:
表2未标记数据的加权距离前五
S3、将所述初级异常数据进行识别标记后存入已标记的第一数据集合中组成第二数据集合,并通过所述第二数据集合对预先训练的超球体分类模型进行训练,其中,所述超球体分类模型为可对当前已标记的数据,在高维空间中拟合一个涵盖大多数样本值的超球体,以超球体表面为界识别异常数据与正常数据的分类模型。
识别所述初级异常数据并进行标记,得到新已标记数据。
通过计算机或者人工识别所述初级异常数据为正常数据或异常数据,并根据识别结果对所述初级异常数据进行标记,将标记后的初级异常数据作为新已标记数据。接收对该最具代表性的未标记数据的判定标记,得到新的已标记数据,即新已标记数据。
因此,根据查询策略的规则,选取加权距离最小的未标记数据x107048,交给AI运营人员进行判定标注,得到新的已标记数据。
可以通过人工判定或者通过计算机判定该最具代表性的未标记数据属于正常数据还是异常数据,并对该最具代表性的未标记数据进行标记,该最具代表性的未标记数据即得到判定标记,变成新的已标记数据。
将所述新已标记数据加入所述已标记数据集中,得到更新的已标记数据集。
将所述初级异常数据进行识别标记后得到的次级异常数据存入已标记的第一数据集合中组成第二数据集合。已标记的第一数据集合即已标记数据集。第二数据集合为将初级异常数据存入已标记数据集之后得到的更新的已标记数据集。
将新的已标记数据加入已标记数据集,从而更新已标记数据集。本实施例中将标记后的x107048加入已标记数据集中。
利用所述更新的已标记数据集中的已标记数据训练所述超球体分类模型。
通过所述第二数据集合对预先训练的超球体分类模型进行训练。其中,所述超球体分类模型为可对当前已标记的数据在高维空间中拟合一个涵盖大多数样本值的超球体,以超球体表面为界识别异常数据与正常数据的分类模型。
在某些实施方式中,如图5所示,超球体分类模型的训练方法包括:
S31、对异常数据和正常数据分别设置不同的惩罚系数以生成损失函数,其中,所述惩罚系数为在预设预置内的常数;
S32、设置约束条件后计算得到所述超球体分类模型中表征超球体中心位置的球心值和表征超球体球心值与超球体表面之间距离的半径值;
S33、根据所述球心值和所述半径值生成识别正常值与异常值的决策函数。
在某些实施方式中,利用已标记数据集中的已标记数据训练超球体分类模型,包括:针对当前的已标记数据集中的已标记数据,在高维空间中拟合一个超球体模型,该超球体模型包含有若干已标记数据,该超球体模型所包含的已标记数据的数量满足预设条件;该预设条件可以根据实际需要进行设定,例如可以为位于超球体模型表面及超球体模型内的已标记数据的数量最多,或者位于超球体模型表面及超球体模型内的已标记数据的数量占比达到预设阈值等等;确定所述 超球体模型的圆心和半径,从而得到用于以所述超球体模型的表面为分界面对数据进行分类的超球体分类模型;以该超球体分类模型表面为界,位于超球体分类模型表面和超球体分类模型内的数据为正常数据,位于超球体分类模型以外的数据是异常数据(分类边界即超球体分类模型的表面),而无需考虑已标记数据的原始分布情况。
确定超球体分类模型的圆心和半径,需要利用损失函数和约束条件进行求解。由于已标记数据中正常数据数量众多而异常数据较少,在构建分类模型的损失函数时,对正常数据和异常数据设置不同的惩罚系数进行区分,以提高异常数据对分类模型的影响力度。故构建的损失函数如下:
约束条件为:
||x
i-a||
2≤R
2+ξ
i, i:x
i∈L
in
||x
j-a||
2≤R
2-ξ
j, j:x
j∈L
out
ξ
i,ξ
j≥0
其中,a是球心,R为超球体模型的半径,ξi,ξj是松弛变量,xi,xj是已标记数据,Lin是已标记的正常数据集合,i和j为用于标记不同数据的数字,Lout是已标记的异常数据集合,惩罚系数C1、C2是常数,范围在0到1之间。
上述问题为非凸优化问题,拉格朗日乘子方法无法找到全局最优解。为解决上述问题,将包含松弛变量的约束条件以风险函数的形式表示,从而将上式表示的问题转化为无约束最优化问题,如下:
ξ
i=l(R
2-||φ(x
i)-a||
2)
ξ
j=l(||φ(x
j)-a||
2-R
2)
φ(x)为变换映射函数,用于将原始数据x经过特征变换后映射到新的特征空间中;
l(t)为风险函数,风险函数l(t)的函数值取值为max{-t,0};为了将风险函数l(t)中与样本无关的变量进行合并,以便于求解,令T=R
2-a
2,得到最优化目标:
然而当采用风险函数l(t)时,函数二阶导数不存在,从而无法应用梯度法求解,为此采用如下风险函数l(t):
其中常数ε的作用是约束t的取值,使得风险函数l(t)在一个较小的范围 内可以进行二阶求导,同时与初始风险函数数值相差较小。此处,根据实际经验,令ε=0.5,将风险函数l(t)代入最优化目标表达式,得到
其中,矩阵K的元素k
ij=k(x
i,x
j)=<φ(x
i),φ(x
j)>=(x
ix
j),e
i表示矩阵R
n+m的标准基,利用对偶形式求解,忽略常数项,化简得到损失函数:
i,j:x
i,x
j∈L
in, l,m:x
l,x
m∈L
out
(xixj)代表的含义是第i个样本与第j个样本向量内积。
将约束条件的两个不等式进行化简后得到:
ξ
i≥‖x
i-a‖
2-R
2, i:x
i∈L
in
ξ
j≤R
2-||x
j-a||
2, j:x
j∈L
out
用ξ
i,ξ
j分别乘以拉格朗日系数。
这样,有约束的问题求解过程通常使用拉格朗日方法以得到最优解。
当通过求解上述函数得到对应于xi的拉格朗日乘数αi后,进而计算球心
紧接着将圆心a的值代入损失函数中,利用最优化方法解出半径R即可;
这样即可得到一个初步训练完成的超球体分类模型。
S4、识别所述超球体分类模型是否达到训练终止条件。
S5、当达到所述训练终止条件,将所述未标记数据输入训练终止条件下的所述超球体分类模型中进行分类筛选,以得到目标异常数据。
如图6所示,步骤S5包括:
S51、将所述未标记数据分别代入所述决策函数中以生成决策结果值;
S52、判断所述决策结果值是否大于或等于零;
S53、当大于或等于零,则输出该未标记数据,并标记为目标异常数据。
所述查询策略基于上述预先训练好的超球体分类模型筛选初级异常数据,所述预设条件为距离所述超球体分类模型的超球体表面的加权距离值最小。
具体地,若对分类模型的训练未达到训练终止条件,则利用更新后的已标记数据集重新训练分类模型,以此循环。随着已标记数据不断更新,分类模型每次重新训练后,超球体分类模型球心位置、半径和决策函数也会随之相应调整。分类模型每次迭代后的变化量小于预设阈值即达到训练终止条件,此时即得到了训练好的分类模型,得到了最终的决策函数。决策函数f(x)=||x-a||-R所代表的含义为数据x与圆心a的距离与半径R之差。达到训练终止条件后,分类模型的圆心a与半径R的值最终确定。
进行分类时,将未标记数据xi代入决策函数f(x)中判断正负,如果f(x) ≤0,则认为该未标记的系统指标数据为正常数据,如果f(x)>0,则认为相应的系统指标数据异常。
换言之,进行分类时,计算未标记数据xi与超球体分类模型的球心之间的距离,判断该距离是否大于超球体分类模型的半径;若该距离小于或等于超球体分类模型的半径,则该未标记数据xi为正常数据,若该距离大于超球体分类模型的半径,则该未标记数据xi异常。
例如,利用上述步骤得到的一个超球体分类模型的球心a=(92.69%,3.28%,3.49%,52.36%,495.53,63,69.72%,98,357,54,91.77%,58.92%),半径R=602.94。以表1中的实际数据为例,每个数据有12个属性值,将数据x
i=(94.76%,3.76%,1.29%,47%,434.78,59,78.37%,104,379,50,95.47%,64.55%)代入决策函数f(x)=||x-a||-R,得到f(x)=-49.09<0,则数据xi位于超球体分类模型内,因此可以认为xi是正常数据。
对于异常数据,AI运营人员可以继续进行根因分析等,找出系统异常的原因,给出修复建议。
如图7所示,本申请的另一实施例还提供了一种数据异常检测装置,包括:
获取模块100:被配置为执行获取未标记数据;
查询模块200:被配置为执行根据预设的查询策略从所述未标记数据中提取出初级异常数据,其中,所述初级异常数据为通过查询策略筛选出的所述未标记数据中达到预设条件的数据;
训练模块300:被配置为执行将所述初级异常数据进行识别标记后存入已标记的第一数据集合中组成第二数据集合,并通过所述第二数据集合对预先训练的超球体分类模型进行训练,其中,所述超球体分类模型为可对当前已标记的数据,在高维空间中拟合一个涵盖大多数样本值的超球体,以超球体表面为界识别异常数据与正常数据的分类模型;
识别模块400:被配置为执行识别所述超球体分类模型是否达到训练终止条件;
结果输出模块500:被配置为执行当达到所述训练终止条件,将所述未标记数据输入训练终止条件下的所述超球体分类模型中进行分类筛选,以得到目标异常数据。
如图8所示,本申请的另一个实施例公开了一种计算机设备600,包括存储器601、处理器602及存储在所述存储器601上并可在所述处理器602上运行的计算机程序,所述处理器602执行所述计算机程序时实现上述的数据异常检测方法。该计算机设备是一种能够按照事先设定或者存储的指令,自动进行数值计算和/或信息处理的设备。如图所示,所述计算机设备600至少包括,但不限于,可通过装置总线相互通信连接存储器601、处理器602、网络接口603。其中:
本实施例中,存储器601至少包括一种类型的计算机可读存储介质,所述可读存储介质包括闪存、硬盘、多媒体卡、卡型存储器(例如,SD或DX存储器等)、随机访问存储器(RAM)、静态随机访问存储器(SRAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、可编程只读存储器(PROM)、磁性存储器、磁盘、光盘等。在一些实施例中,存储器601可以是计算机设备600的内部存储单元,例如所述计算机设备600的硬盘或内存。在另一些实施例中,存储器601也可以是计算机设备600的外部存储设备,例如所述计算机设备600上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。当然,存储器601还可以既包括计算机设备 600的内部存储单元也包括其外部存储设备。本实施例中,存储器601通常用于存储安装于计算机设备600的操作装置和各类应用软件,例如异常医保群组识别装置500的程序代码等。此外,存储器601还可以用于暂时地存储已经输出或者将要输出的各类数据。
处理器602在一些实施例中可以是中央处理器(Central Processing Unit,CPU)、控制器、微控制器、微处理器、或其他数据处理芯片。所述处理器602通常用于控制计算机设备600的总体操作。本实施例中,处理器602用于运行存储器601中存储的程序代码或者处理数据,例如运数据异常检测装置500,以实现上述各个实施例中的数据异常检测方法。
所述网络接口603可包括无线网络接口或有线网络接口,所述网络接口603通常用于在所述计算机设备600与其他电子装置之间建立通信连接。例如,所述网络接口603用于通过网络将所述计算机设备600与外部终端相连,在所述计算机设备600与外部终端之间的建立数据传输通道和通信连接等。所述网络可以是企业内部网(Intranet)、互联网(Internet)、全球移动通讯装置(Global System of Mobile communication,GSM)、宽带码分多址(Wideband Code Division Multiple Access,WCDMA)、4G网络、5G网络、蓝牙(Bluetooth)、Wi-Fi等无线或有线网络。
需要指出的是,图8仅示出了具有部件601-603的计算机设备600,但是应理解的是,并不要求实施所有示出的部件,可以替代的实施更多或者更少的部件。
在本实施例中,存储于存储器601中的所述异常医保群组识别装置500还可以被分割为一个或者多个程序模块,所述一个或者多个程序模块被存储于存储器601中,并由一个或多个处理器(本实施例为处理器602)所执行,以完成本申请数据异常检测方法。
本实施例还提供一种计算机可读存储介质,上述存储介质可以是非易失性存储介质,也可以是易失性存储介质。如闪存、硬盘、多媒体卡、卡型存储器(例如,SD或DX存储器等)、随机访问存储器(RAM)、静态随机访问存储器(SRAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、可编程只读存储器(PROM)、磁性存储器、磁盘、光盘、服务器、App应用商城等等,其上存储有计算机程序,程序被处理器执行时实现相应功能。本实施例的计算机可读存储介质用于存储异常医保群组识别装置500,以被处理器执行时实现本申请之数据异常检测方法。
本申请的另一实施例还公开了一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行,以实现上述的数据异常检测方法。
如图9所示,本申请的另一个实施例提供了一种数据异常检测方法,包括:
S00、获取未标记数据。
S10、利用已标记数据集中的已标记数据训练一个超球体分类模型。
S20、根据预设的查询策略从所述未标记数据中提取出初级异常数据,其中,所述初级异常数据为通过查询策略筛选出的所述未标记数据中达到预设条件的数据。
S30、识别所述初级异常数据并进行标记,得到新已标记数据。
S40、将所述新已标记数据加入所述已标记数据集中,得到更新的已标记数据集。
S50、利用所述更新的已标记数据集中的已标记数据训练所述超球体分类模 型。
S60、识别该超球体分类模型是否达到训练终止条件;若达到所述训练终止条件,则转向步骤S70;若未达到所述训练终止条件,则转向步骤S20。
S70、利用该超球体分类模型对未标记数据进行分类,将所述未标记数据输入训练终止条件下的所述超球体分类模型中进行分类筛选,以得到目标异常数据。
本申请的实施例提供的数据异常检测方法,从数据着手,将无监督学习方法与有监督学习方法相结合,利用少量已标记数据构建的超球体分类模型对数据的原始分布没有限制,使用范围更广;而基于边界距离与样本密度的查询策略能较为精准的找出最有价值的数据并且减少噪声的影响,大大减少了运营人员需要标记的数据量,既保证了数据分类精度,又节约了人工智能运营的成本,更适用于实际业界场景,便于大规模部署。
以上对本公开的实施例进行了描述。但是,这些实施例仅仅是为了说明的目的,而并非为了限制本公开的范围。本公开的范围由所附权利要求及其等价物限定。不脱离本公开的范围,本领域技术人员可以做出多种替代和修改,这些替代和修改都应落在本公开的范围之内。
Claims (20)
- 一种数据异常检测方法,其中,包括:获取未标记数据;根据预设的查询策略从所述未标记数据中提取出初级异常数据,其中,所述初级异常数据为通过查询策略筛选出的所述未标记数据中达到预设条件的数据;将所述初级异常数据进行识别标记后存入已标记的第一数据集合中组成第二数据集合,并通过所述第二数据集合对预先训练的超球体分类模型进行训练,其中,所述超球体分类模型为可对当前已标记的数据,在高维空间中拟合一个涵盖大多数样本值的超球体,以超球体表面为界识别异常数据与正常数据的分类模型;识别所述超球体分类模型是否达到训练终止条件;当达到所述训练终止条件,将所述未标记数据输入训练终止条件下的所述超球体分类模型中进行分类筛选,以得到目标异常数据。
- 根据权利要求1所述的方法,其中,所述超球体分类模型的训练方法包括:对异常数据和正常数据分别设置不同的惩罚系数以生成损失函数,其中,所述惩罚系数为在预设预置内的常数;设置约束条件后计算得到所述超球体分类模型中表征超球体中心位置的球心值和表征超球体球心值与超球体表面之间距离的半径值;根据所述球心值和所述半径值生成识别正常值与异常值的决策函数。
- 根据权利要求2所述的方法,其中,所述当达到所述训练终止条件,将所述未标记数据输入训练终止条件下的所述超球体分类模型中进行分类筛选,以得到目标异常数据,包括:将所述未标记数据分别代入所述决策函数中以生成决策结果值;判断所述决策结果值是否大于或等于零;当大于或等于零,则输出该未标记数据,并标记为目标异常数据。
- 根据权利要求2所述的方法,其中,所述查询策略基于上述预先训练好的超球体分类模型筛选初级异常数据,所述预设条件为距离所述超球体分类模型的超球体表面的加权距离值最小。
- 根据权利要求4所述的方法,其中,所述根据预设的查询策略从所述未标记数据中提取出初级异常数据,包括:将所述未标记数据带入所述决策函数中并取绝对值得到最近球面距离;计算所述未标记数据之间的距离值取最小值作为最近邻样本距离;将所述最近球面距离与所述最近邻样本距离归一化处理,并以预设系数进行加权得到各个所述未标记数据的加权距离值。
- 根据权利要求5所述的方法,其中,所述最近球面距离归一化处理的方法包括:从所有所述未标记数据的最近球面距离中选出数值最小的第一最小值和数值最大的第一最大值;用每个所述未标记数据的最近球面距离与所述第一最小值之差除以所述第一最大值,得到所有所述未标记数据对应的归一化最近球面距离。
- 根据权利要求5所述的方法,其中,所述最近邻样本距离归一化处理的 方法包括:从所有未标记数据的最近邻样本距离中选取数值最小的第二最小值和数值最大的第二最大值;分别计算每个所述未标记数据与所述第二最小值之差,再用这些差值除以所述第二最大值,得到所有未标记数据的归一化最近邻样本距离。
- 一种数据异常检测装置,其中,包括:获取模块:被配置为执行获取未标记数据;查询模块:被配置为执行根据预设的查询策略从所述未标记数据中提取出初级异常数据,其中,所述初级异常数据为通过查询策略筛选出的所述未标记数据中达到预设条件的数据;训练模块:被配置为执行将所述初级异常数据进行识别标记后存入已标记的第一数据集合中组成第二数据集合,并通过所述第二数据集合对预先训练的超球体分类模型进行训练,其中,所述超球体分类模型为可对当前已标记的数据,在高维空间中拟合一个涵盖大多数样本值的超球体,以超球体表面为界识别异常数据与正常数据的分类模型;识别模块:被配置为执行识别所述超球体分类模型是否达到训练终止条件;结果输出模块:被配置为执行当达到所述训练终止条件,将所述未标记数据输入训练终止条件下的所述超球体分类模型中进行分类筛选,以得到目标异常数据。
- 一种计算机设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,其中,所述处理器执行所述计算机程序时实现数据异常检测方法的步骤:获取未标记数据;根据预设的查询策略从所述未标记数据中提取出初级异常数据,其中,所述初级异常数据为通过查询策略筛选出的所述未标记数据中达到预设条件的数据;将所述初级异常数据进行识别标记后存入已标记的第一数据集合中组成第二数据集合,并通过所述第二数据集合对预先训练的超球体分类模型进行训练,其中,所述超球体分类模型为可对当前已标记的数据,在高维空间中拟合一个涵盖大多数样本值的超球体,以超球体表面为界识别异常数据与正常数据的分类模型;识别所述超球体分类模型是否达到训练终止条件;当达到所述训练终止条件,将所述未标记数据输入训练终止条件下的所述超球体分类模型中进行分类筛选,以得到目标异常数据。
- 根据权利要求9所述的计算机设备,其中,所述超球体分类模型的训练方法包括:对异常数据和正常数据分别设置不同的惩罚系数以生成损失函数,其中,所述惩罚系数为在预设预置内的常数;设置约束条件后计算得到所述超球体分类模型中表征超球体中心位置的球心值和表征超球体球心值与超球体表面之间距离的半径值;根据所述球心值和所述半径值生成识别正常值与异常值的决策函数。
- 根据权利要求10所述的计算机设备,其中,所述当达到所述训练终止条件,将所述未标记数据输入训练终止条件下的所述超球体分类模型中进行分类筛选,以得到目标异常数据,包括:将所述未标记数据分别代入所述决策函数中以生成决策结果值;判断所述决策结果值是否大于或等于零;当大于或等于零,则输出该未标记数据,并标记为目标异常数据。
- 根据权利要求10所述的计算机设备,其中,所述查询策略基于上述预先训练好的超球体分类模型筛选初级异常数据,所述预设条件为距离所述超球体分类模型的超球体表面的加权距离值最小。
- 根据权利要求12所述的计算机设备,其中,所述根据预设的查询策略从所述未标记数据中提取出初级异常数据,包括:将所述未标记数据带入所述决策函数中并取绝对值得到最近球面距离;计算所述未标记数据之间的距离值取最小值作为最近邻样本距离;将所述最近球面距离与所述最近邻样本距离归一化处理,并以预设系数进行加权得到各个所述未标记数据的加权距离值。
- 根据权利要求13所述的计算机设备,其中,所述最近球面距离归一化处理的方法包括:从所有所述未标记数据的最近球面距离中选出数值最小的第一最小值和数值最大的第一最大值;用每个所述未标记数据的最近球面距离与所述第一最小值之差除以所述第一最大值,得到所有所述未标记数据对应的归一化最近球面距离。
- 根据权利要求13所述的计算机设备,其中,所述最近邻样本距离归一化处理的方法包括:从所有未标记数据的最近邻样本距离中选取数值最小的第二最小值和数值最大的第二最大值;分别计算每个所述未标记数据与所述第二最小值之差,再用这些差值除以所述第二最大值,得到所有未标记数据的归一化最近邻样本距离。
- 一种计算机可读存储介质,其上存储有计算机程序,其中,该程序被处理器执行,以实现数据异常检测方法的步骤:获取未标记数据;根据预设的查询策略从所述未标记数据中提取出初级异常数据,其中,所述初级异常数据为通过查询策略筛选出的所述未标记数据中达到预设条件的数据;将所述初级异常数据进行识别标记后存入已标记的第一数据集合中组成第二数据集合,并通过所述第二数据集合对预先训练的超球体分类模型进行训练,其中,所述超球体分类模型为可对当前已标记的数据,在高维空间中拟合一个涵盖大多数样本值的超球体,以超球体表面为界识别异常数据与正常数据的分类模型;识别所述超球体分类模型是否达到训练终止条件;当达到所述训练终止条件,将所述未标记数据输入训练终止条件下的所述超球体分类模型中进行分类筛选,以得到目标异常数据。
- 根据权利要求16所述的计算机可读存储介质,其中,所述超球体分类模型的训练方法包括:对异常数据和正常数据分别设置不同的惩罚系数以生成损失函数,其中,所述惩罚系数为在预设预置内的常数;设置约束条件后计算得到所述超球体分类模型中表征超球体中心位置的球心值和表征超球体球心值与超球体表面之间距离的半径值;根据所述球心值和所述半径值生成识别正常值与异常值的决策函数。
- 根据权利要求17所述的计算机可读存储介质,其中,所述当达到所述训练终止条件,将所述未标记数据输入训练终止条件下的所述超球体分类模型中进行分类筛选,以得到目标异常数据,包括:将所述未标记数据分别代入所述决策函数中以生成决策结果值;判断所述决策结果值是否大于或等于零;当大于或等于零,则输出该未标记数据,并标记为目标异常数据。
- 根据权利要求17所述的计算机可读存储介质,其中,所述查询策略基于上述预先训练好的超球体分类模型筛选初级异常数据,所述预设条件为距离所述超球体分类模型的超球体表面的加权距离值最小。
- 根据权利要求19所述的计算机可读存储介质,其中,所述根据预设的查询策略从所述未标记数据中提取出初级异常数据,包括:将所述未标记数据带入所述决策函数中并取绝对值得到最近球面距离;计算所述未标记数据之间的距离值取最小值作为最近邻样本距离;将所述最近球面距离与所述最近邻样本距离归一化处理,并以预设系数进行加权得到各个所述未标记数据的加权距离值。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010468770.4 | 2020-05-28 | ||
CN202010468770.4A CN111813618A (zh) | 2020-05-28 | 2020-05-28 | 数据异常检测方法、装置、设备及存储介质 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2021139249A1 true WO2021139249A1 (zh) | 2021-07-15 |
Family
ID=72847794
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2020/118524 WO2021139249A1 (zh) | 2020-05-28 | 2020-09-28 | 数据异常检测方法、装置、设备及存储介质 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN111813618A (zh) |
WO (1) | WO2021139249A1 (zh) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114443635A (zh) * | 2022-01-20 | 2022-05-06 | 广西壮族自治区林业科学研究院 | 土壤大数据分析中的数据清洗方法及装置 |
CN114996409A (zh) * | 2022-06-29 | 2022-09-02 | 上海外高桥造船有限公司 | 一种船体零件标号的查询方法、装置及电子设备 |
CN118467930A (zh) * | 2024-07-09 | 2024-08-09 | 西安传显行风网络科技有限公司 | 一种应用于机器人的异常数据处理方法 |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112306835B (zh) * | 2020-11-02 | 2024-05-28 | 平安科技(深圳)有限公司 | 用户数据监控分析方法、装置、设备及介质 |
CN113590392B (zh) * | 2021-06-30 | 2024-04-02 | 中国南方电网有限责任公司超高压输电公司昆明局 | 换流站设备异常检测方法、装置、计算机设备和存储介质 |
CN113687972B (zh) * | 2021-08-30 | 2023-07-25 | 中国平安人寿保险股份有限公司 | 业务系统异常数据的处理方法、装置、设备及存储介质 |
CN117333486B (zh) * | 2023-11-30 | 2024-03-22 | 清远欧派集成家居有限公司 | Uv面漆性能检测数据分析方法、装置及存储介质 |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190108422A1 (en) * | 2017-10-05 | 2019-04-11 | Applied Materials, Inc. | Fault detection classification |
CN110320894A (zh) * | 2019-08-01 | 2019-10-11 | 陕西工业职业技术学院 | 一种准确划分混叠区域数据类别的火电厂制粉系统故障检测方法 |
CN110555054A (zh) * | 2018-06-15 | 2019-12-10 | 泉州信息工程学院 | 一种基于模糊双超球分类模型的数据分类方法及系统 |
CN110796172A (zh) * | 2019-09-27 | 2020-02-14 | 北京淇瑀信息科技有限公司 | 金融数据的样本标签确定方法、装置及电子设备 |
CN110825545A (zh) * | 2019-08-31 | 2020-02-21 | 武汉理工大学 | 一种云服务平台异常检测方法与系统 |
-
2020
- 2020-05-28 CN CN202010468770.4A patent/CN111813618A/zh active Pending
- 2020-09-28 WO PCT/CN2020/118524 patent/WO2021139249A1/zh active Application Filing
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190108422A1 (en) * | 2017-10-05 | 2019-04-11 | Applied Materials, Inc. | Fault detection classification |
CN110555054A (zh) * | 2018-06-15 | 2019-12-10 | 泉州信息工程学院 | 一种基于模糊双超球分类模型的数据分类方法及系统 |
CN110320894A (zh) * | 2019-08-01 | 2019-10-11 | 陕西工业职业技术学院 | 一种准确划分混叠区域数据类别的火电厂制粉系统故障检测方法 |
CN110825545A (zh) * | 2019-08-31 | 2020-02-21 | 武汉理工大学 | 一种云服务平台异常检测方法与系统 |
CN110796172A (zh) * | 2019-09-27 | 2020-02-14 | 北京淇瑀信息科技有限公司 | 金融数据的样本标签确定方法、装置及电子设备 |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114443635A (zh) * | 2022-01-20 | 2022-05-06 | 广西壮族自治区林业科学研究院 | 土壤大数据分析中的数据清洗方法及装置 |
CN114443635B (zh) * | 2022-01-20 | 2024-04-09 | 广西壮族自治区林业科学研究院 | 土壤大数据分析中的数据清洗方法及装置 |
CN114996409A (zh) * | 2022-06-29 | 2022-09-02 | 上海外高桥造船有限公司 | 一种船体零件标号的查询方法、装置及电子设备 |
CN118467930A (zh) * | 2024-07-09 | 2024-08-09 | 西安传显行风网络科技有限公司 | 一种应用于机器人的异常数据处理方法 |
Also Published As
Publication number | Publication date |
---|---|
CN111813618A (zh) | 2020-10-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021139249A1 (zh) | 数据异常检测方法、装置、设备及存储介质 | |
CN110414462B (zh) | 一种无监督的跨域行人重识别方法及系统 | |
CN112926654B (zh) | 预标注模型训练、证件预标注方法、装置、设备及介质 | |
CN101551855B (zh) | 自适应核匹配追踪辅助诊断系统及其辅助诊断方法 | |
WO2021027142A1 (zh) | 图片分类模型训练方法、系统和计算机设备 | |
CN110222785B (zh) | 用于气体传感器漂移校正的自适应置信度主动学习方法 | |
WO2023201772A1 (zh) | 基于迭代域内适应和自训练的跨域遥感图像语义分割方法 | |
WO2021174820A1 (zh) | 难样本发现方法、装置及计算机设备 | |
CN116681961A (zh) | 基于半监督方法和噪声处理的弱监督目标检测方法 | |
CN114328942A (zh) | 关系抽取方法、装置、设备、存储介质和计算机程序产品 | |
CN116206208B (zh) | 一种基于人工智能的林业病虫害快速分析系统 | |
Montalvo et al. | A novel threshold to identify plant textures in agricultural images by Otsu and Principal Component Analysis | |
CN110807174B (zh) | 一种基于统计分布的污水厂厂群出水分析及异常识别方法 | |
CN117115565A (zh) | 一种基于自主感知的图像分类方法、装置及智能终端 | |
CN112115834A (zh) | 一种基于小样本匹配网络的标准证件照检测方法 | |
CN115511798A (zh) | 一种基于人工智能技术的肺炎分类方法及装置 | |
CN113128608B (zh) | 一种基于5g和图嵌入优化的tsvm模型自优化与预测方法、设备及存储介质 | |
CN111291820A (zh) | 一种结合定位信息和分类信息的目标检测方法 | |
JP2020181265A (ja) | 情報処理装置、システム、情報処理方法及びプログラム | |
CN114239753B (zh) | 可迁移的图像识别方法及装置 | |
CN118570505B (zh) | 基于深度学习的细胞自噬图像分析方法、系统及存储介质 | |
CN117292304B (zh) | 一种多媒体数据传输控制方法及系统 | |
CN116403074B (zh) | 基于主动标注的半自动图像标注方法及标注装置 | |
CN116863481B (zh) | 一种基于深度学习的业务会话风险处理方法 | |
CN113409923B (zh) | 骨髓图像个体细胞自动标记中的纠错方法及系统 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20912933 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20912933 Country of ref document: EP Kind code of ref document: A1 |