CN110825545A

CN110825545A - Anomaly detection method and system for cloud service platform

Info

Publication number: CN110825545A
Application number: CN201910820118.1A
Authority: CN
Inventors: 严俊伟; 杨赟; 娄平; 刘泉; 周祖德
Original assignee: Wuhan University of Technology WUT
Current assignee: Wuhan University of Technology WUT
Priority date: 2019-08-31
Filing date: 2019-08-31
Publication date: 2020-02-21

Abstract

The invention discloses a cloud service platform anomaly detection method and system. The method includes the following steps: 1) real-time collection of system metric data when a cloud platform host is working normally, and calculating a system operating environment vector according to the system metric data; 2) Using the normal system operating environment vector, combined with the maximum average deviation algorithm MMD and the support vector data description algorithm SVDD to train to obtain an anomaly detection model; 3) When receiving new host system metric data, use the hypersphere of the cluster where the host is located. Host system metrics are classified to detect host anomalies. The invention combines MMD and SVDD algorithm training to obtain an abnormality detection model, effectively solves the problem of extreme imbalance between normal and abnormal samples in the cloud service platform, enables it to detect unknown system abnormalities in the cloud service platform, and no longer needs An anomaly detection model is built for each host, which greatly reduces the time and system resource consumption of anomaly modeling.

Description

Anomaly detection method and system for cloud service platform

技术领域technical field

本发明涉及云计算安全技术，尤其涉及一种云服务平台异常检测方法与系统。The invention relates to cloud computing security technology, in particular to a cloud service platform anomaly detection method and system.

背景技术Background technique

云服务平台是一个开放的公共平台，为大量用户提供各种不同的应用服务。这些应用服务的可靠性对其消费者而言至关重要。云服务平台中存在异常会使其可靠性受到质疑。由于规模和复杂性，云服务平台会产生大量的系统异常，这些异常主要由云平台管理员操作错误、资源过度/欠配置、硬件/软件故障和网络攻击等引起的。因此，对云服务平台的系统运行状态进行实时的异常检测具有十分重要的意义。The cloud service platform is an open public platform that provides various application services for a large number of users. The reliability of these application services is critical to their consumers. The presence of anomalies in a cloud service platform can bring its reliability into question. Due to the scale and complexity, cloud service platforms will generate a large number of system exceptions, which are mainly caused by cloud platform administrators' operational errors, over/under-provisioning of resources, hardware/software failures, and network attacks. Therefore, it is of great significance to perform real-time anomaly detection on the system running state of the cloud service platform.

异常检测的基本原理是在系统监测的基础上，将系统、用户、进程或者网络的行为作为相应的轮廓模型，当系统运行状态偏离正常轮廓模型时，即可判定为异常。目前，已有相关的异常检测方法及其对应的异常检测系统。主要分为基于统计的异常检测算法和基于机器学习的异常检测算法。The basic principle of anomaly detection is to take the behavior of the system, user, process or network as the corresponding contour model on the basis of system monitoring. When the system running state deviates from the normal contour model, it can be judged as abnormal. At present, there are related anomaly detection methods and their corresponding anomaly detection systems. It is mainly divided into statistical-based anomaly detection algorithms and machine learning-based anomaly detection algorithms.

基于统计的异常检测方法首先通过采用统计学习方法挖掘性能数据特点，然后基于数据的分布特点计算样本数据的异常得分，如果超过指定阈值则发出告警信息。该方法通常需要知道云服务平台主机的系统性能数据的时间序列分布，或者可能不能很好地适应不断扩展的集群。The statistics-based anomaly detection method firstly mines the characteristics of the performance data by using the statistical learning method, and then calculates the anomaly score of the sample data based on the distribution characteristics of the data, and issues an alarm message if it exceeds the specified threshold. This approach typically requires knowledge of the time-series distribution of system performance data for cloud service platform hosts, or may not be well suited for expanding clusters.

基于机器学习的方法首先需要从大量的数据样本中进行学习建模，然后才能对新的数据样本进行检测从而判断是否异常。Hyunjoo Kim等人提出了一种基于机器学习的网络威胁检测方法，该方法首先使用随机森林选择显著特征，通过对从云平台收集的未标记数据应用K-means和DBSCAN生成聚类，并使用 new-Kyoto-2006+数据集对聚类结果进行标记。Elham Besharati等人使用三种不同分类器的组合：神经网络、决策树和线性区分来识别各种攻击，该方法具有很高的准确率，但是该方法需要为每一台主机构建一个神经网络和决策树模型，成本较大。Machine learning-based methods first need to learn and model from a large number of data samples, and then can detect new data samples to determine whether they are abnormal. Hyunjoo Kim et al. proposed a machine learning-based approach to cyber threat detection, which first uses random forests to select salient features, generates clusters by applying K-means and DBSCAN to unlabeled data collected from cloud platforms, and uses new -Kyoto-2006+ dataset to label clustering results. Elham Besharati et al. used a combination of three different classifiers: neural network, decision tree and linear discrimination to identify various attacks. This method has high accuracy, but this method requires building a neural network and a neural network for each host. Decision tree model, the cost is high.

现存在的技术难点主要有：(1)云服务平台大部分时间处于正常运行状态，异常的数据样本远远少于正常的数据样本；(2)基于机器学习的方法通常需要为每台主机都建立一个异常检测模型，而异常检测模型的训练过程需要花费大量的时间和系统资源，所以这些方法很难将其应用于大规模的云服务平台。The existing technical difficulties mainly include: (1) the cloud service platform is in normal operation most of the time, and the abnormal data samples are far less than the normal data samples; (2) the methods based on machine learning usually require Building an anomaly detection model, and the training process of anomaly detection model requires a lot of time and system resources, so these methods are difficult to apply to large-scale cloud service platforms.

发明内容SUMMARY OF THE INVENTION

本发明要解决的技术问题在于针对现有技术中的缺陷，提供一种云服务平台异常检测方法与系统。The technical problem to be solved by the present invention is to provide a cloud service platform anomaly detection method and system aiming at the defects in the prior art.

本发明解决其技术问题所采用的技术方案是：一种云服务平台异常检测方法，包括以下步骤：The technical solution adopted by the present invention to solve the technical problem is: a cloud service platform abnormality detection method, comprising the following steps:

1)实时采集云平台主机正常工作时的系统度量数据，并根据系统度量数据计算出系统运行环境向量；1) Collect real-time system metric data when the cloud platform host is working normally, and calculate the system operating environment vector according to the system metric data;

2)利用正常的系统运行环境向量，结合最大平均偏差算法MMD和支持向量数据描述算法SVDD(support vector domain description)训练以获取异常检测模型；2) Using the normal system operating environment vector, combined with the maximum average deviation algorithm MMD and the support vector data description algorithm SVDD (support vector domain description) training to obtain the anomaly detection model;

所述步骤2)中结合MMD和SVDD两种算法的训练获取异常检测模型的过程如下：In the step 2), the process of obtaining the anomaly detection model in combination with the training of the MMD and SVDD algorithms is as follows:

步骤2.1)使用MMD对系统主机的正常运行环境向量进行聚类，从而将云平台主机按系统运行环境相似性划分为多个集群；Step 2.1) use MMD to cluster the normal operating environment vector of the system host, thereby dividing the cloud platform host into multiple clusters according to the similarity of the system operating environment;

步骤2.2)根据正常的系统运行环境向量，使用SVDD为每一台主机集群构建一个超球面，用于描述每个主机正常运行时的数据分布情况，从而对异常数据进行检测；Step 2.2) According to the normal system operating environment vector, use SVDD to construct a hypersphere for each host cluster, which is used to describe the data distribution when each host is running normally, so as to detect abnormal data;

3)在接收到新的主机系统度量数据时，使用主机所在集群的超球面对主机系统度量数据进行分类，从而检测主机异常。3) When receiving new host system metric data, use the hypersphere of the cluster where the host is located to classify the host system metric data, so as to detect host abnormality.

按上述方案，所述步骤3)具体为：According to the above scheme, the step 3) is specifically:

步骤3.1)在接受到新的主机系统度量数据时，根据系统度量数据计算出系统运行环境向量；Step 3.1) when receiving the new host system measurement data, calculate the system operating environment vector according to the system measurement data;

步骤3.2)根据主机所在的集群，选择该集群所对应的超球面作为异常检测模型；Step 3.2) According to the cluster where the host is located, select the hypersphere corresponding to the cluster as the anomaly detection model;

步骤3.3)当主机系统运行环境向量落在超球面内或者超球面上，则判定主机当前处于正常状态；Step 3.3) When the host system operating environment vector falls on the hypersphere or on the hypersphere, it is determined that the host is currently in a normal state;

步骤3.4)当主机系统运行环境向量落在超球面外，则判定主机当前处于异常状态。Step 3.4) When the operating environment vector of the host system falls outside the hypersphere, it is determined that the host is currently in an abnormal state.

按上述方案，所述步骤2.2)中超球面的构建过程如下：According to the above scheme, the construction process of the hypersphere in the step 2.2) is as follows:

2.2.1)考虑运行环境向量集，其中，N是集群中主机的数量，则超球面的构建过程可以用如下公式表示：2.2.1) Considering the operating environment vector set, where N is the number of hosts in the cluster, the construction process of the hypersphere can be expressed by the following formula:

同时，上式满足如下约束：At the same time, the above formula satisfies the following constraints:

(x_i-a)^T(x_i-a)≤R²+ξ_i,ξ_i≥0( _xi -a) ^T ( _xi -a)≤R ² +ξ _i ,ξ _i ≥0

其中，R是超球面的半径，a是超球面的中心，上式表明如果数据点在超球体内部或者表面的时候ξ_i＝0，否则ξ_i＞0，C是一个常数，如果C取的很大，模型训练的时候就会尽量偏向于找一个更大的圆从而尽量囊括更多的点，如果C比较小，就偏向于找一个小的圆；Among them, R is the radius of the hypersphere, a is the center of the hypersphere, the above formula shows that if the data point is inside or on the surface of the hypersphere, ξ _i = 0, otherwise ξ _i > 0, C is a constant, if C takes If C is very large, when the model is trained, it will try to find a larger circle to include as many points as possible. If C is relatively small, it will prefer to find a small circle;

2.2.2)设置常数C的大小，在构建过程中忽略距离原点超过设定值的样本点；2.2.2) Set the size of the constant C, and ignore the sample points whose distance from the origin exceeds the set value during the construction process;

2.2.3)通过不断的拟合，最终计算得到超球面的半径和超球面的中心，拟合过程可以用如下公式表示：2.2.3) Through continuous fitting, the radius of the hypersphere and the center of the hypersphere are finally calculated. The fitting process can be expressed by the following formula:

按上述方案，所述步骤1)中系统度量数据包括CPU、内存、磁盘、以及网络相关的系统度量数据。According to the above solution, the system metric data in step 1) includes CPU, memory, disk, and network-related system metric data.

一种云服务平台异常检测系统，包括：A cloud service platform anomaly detection system, comprising:

采集模块，用于实时采集云平台主机正常工作时的系统度量数据，并根据系统度量数据计算出系统运行环境向量；The acquisition module is used to collect real-time system metric data when the cloud platform host is working normally, and calculate the system operating environment vector according to the system metric data;

模型构建模块，用于利用正常的系统运行环境向量，结合最大平均偏差算法MMD和支持向量数据描述算法SVDD(support vector domain description) 训练以获取异常检测模型；The model building module is used to use the normal system operating environment vector, combined with the maximum mean deviation algorithm MMD and the support vector data description algorithm SVDD (support vector domain description) training to obtain an anomaly detection model;

所述模型构建模块中结合MMD和SVDD两种算法的训练获取异常检测模型的过程如下：The process of obtaining the anomaly detection model by combining the training of the MMD and SVDD algorithms in the model building module is as follows:

1)使用MMD对系统主机的正常运行环境向量进行聚类，从而将云平台主机按系统运行环境相似性划分为多个集群；1) Use MMD to cluster the normal operating environment vector of the system host, thereby dividing the cloud platform host into multiple clusters according to the similarity of the system operating environment;

2)根据正常的系统运行环境向量，使用SVDD为每一台主机集群构建一个超球面，用于描述每个主机正常运行时的数据分布情况，从而对异常数据进行检测；2) According to the normal system operating environment vector, use SVDD to construct a hypersphere for each host cluster, which is used to describe the data distribution of each host during normal operation, so as to detect abnormal data;

检测模块，用于在接收到新的主机系统度量数据时，使用主机所在集群的超球面对主机系统度量数据进行分类，从而检测主机异常。The detection module is used for classifying the metric data of the host system by using the hypersphere of the cluster where the host is located, so as to detect the abnormality of the host when receiving new host system metric data.

按上述方案，所述检测模块具体为：According to the above scheme, the detection module is specifically:

1)在接受到新的主机系统度量数据时，根据系统度量数据计算出系统运行环境向量；1) When receiving the new host system metric data, calculate the system operating environment vector according to the system metric data;

2)根据主机所在的集群，选择该集群所对应的超球面作为异常检测模型；2) According to the cluster where the host is located, select the hypersphere corresponding to the cluster as the anomaly detection model;

3)当主机系统运行环境向量落在超球面内或者超球面上，则判定主机当前处于正常状态；3) When the operating environment vector of the host system falls on the hypersphere or on the hypersphere, it is determined that the host is currently in a normal state;

4)当主机系统运行环境向量落在超球面外，则判定主机当前处于异常状态。4) When the operating environment vector of the host system falls outside the hypersphere, it is determined that the host is currently in an abnormal state.

按上述方案，所述模型构建模块中超球面的构建过程如下：According to the above scheme, the construction process of the hypersphere in the model building module is as follows:

1)考虑运行环境向量集，其中，N是集群中主机的数量，则超球面的构建过程可以用如下公式表示：1) Considering the operating environment vector set, where N is the number of hosts in the cluster, the construction process of the hypersphere can be expressed by the following formula:

其中，R是超球面的半径，a是超球面的中心，上式表明如果数据点在超球体内部或者表面的时候ξ_i＝0，否则ξ_i＞0，C是一个常数，如果C取的很大，模型训练的时候就会尽量偏向于找一个更大的圆从而尽量囊括更多的点，如果C比较小，就偏向于找一个小的圆。Among them, R is the radius of the hypersphere, a is the center of the hypersphere, the above formula shows that if the data point is inside or on the surface of the hypersphere, ξ _i = 0, otherwise ξ _i > 0, C is a constant, if C takes If the value is very large, the model will try to find a larger circle to include as many points as possible during training. If C is relatively small, it will prefer to find a small circle.

2)设置常数C的大小，在构建过程中忽略距离原点超过设定值的样本点；2) Set the size of the constant C, and ignore the sample points whose distance from the origin exceeds the set value during the construction process;

3)通过不断的拟合，最终计算得到超球面的半径和超球面的中心，拟合过程可以用如下公式表示：3) Through continuous fitting, the radius of the hypersphere and the center of the hypersphere are finally calculated. The fitting process can be expressed by the following formula:

按上述方案，所述采集模块中系统度量数据包括CPU、内存、磁盘、以及网络相关的系统度量数据。According to the above solution, the system metric data in the collection module includes CPU, memory, disk, and network-related system metric data.

本发明产生的有益效果是：The beneficial effects that the present invention produces are:

1.结合MMD和SVDD两种算法训练以获取异常检测模型，有效解决了云服务平台中正常和异常样本极度不均衡的问题，使其能够检测云服务平台中未知的系统异常，同时不再需要为每一台主机都构建异常检测模型，从而大大的降低了异常建模的时间和系统资源消耗。1. Combine MMD and SVDD algorithm training to obtain anomaly detection model, which effectively solves the problem of extreme imbalance between normal and abnormal samples in the cloud service platform, enabling it to detect unknown system anomalies in the cloud service platform. An anomaly detection model is built for each host, which greatly reduces the time and system resource consumption of anomaly modeling.

2.通过构建的模型，对主机系统度量数据进行分类，实现云服务平台主机的在线异常检测，及时发现云服务平台中系统的异常行为，提高了云服务平台的可靠性。2. Through the built model, classify the host system measurement data, realize the online abnormal detection of the cloud service platform host, timely discover the abnormal behavior of the system in the cloud service platform, and improve the reliability of the cloud service platform.

附图说明Description of drawings

下面将结合附图及实施例对本发明作进一步说明，附图中：The present invention will be further described below in conjunction with the accompanying drawings and embodiments, in which:

图1是本发明实施例的方法流程图；Fig. 1 is the method flow chart of the embodiment of the present invention;

图2是本发明实施例的聚类过程的伪代码示意图；Fig. 2 is a pseudo-code schematic diagram of a clustering process according to an embodiment of the present invention;

图3是本发明实施例的超球面的构建过程流程图。FIG. 3 is a flowchart of a construction process of a hypersphere according to an embodiment of the present invention.

具体实施方式Detailed ways

为了使本发明的目的、技术方案及优点更加清楚明白，以下结合实施例，对本发明进行进一步详细说明。应当理解，此处所描述的具体实施例仅用以解释本发明，并不用于限定本发明。In order to make the objectives, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail below with reference to the embodiments. It should be understood that the specific embodiments described herein are only used to explain the present invention, but not to limit the present invention.

如图1所示，一种云服务平台异常检测方法，该方法包括：As shown in Figure 1, a cloud service platform anomaly detection method, the method includes:

步骤1：实时采集云平台主机的CPU/内存/磁盘/网络相关的系统度量数据，并根据系统度量数据计算出系统运行环境向量，计算过程如下：Step 1: Collect system metric data related to the CPU/memory/disk/network of the cloud platform host in real time, and calculate the system operating environment vector according to the system metric data. The calculation process is as follows:

1)考虑CPU度量数据，通过CPU系统时间(SYS)、CPU用户时间(USER)、 CPU磁盘IO等待时间(IO_WAIT)、CPU硬中断事件(IRQ)、CPU软中断事件 (SOFT_IRQ)、CPU空闲时间(IDLE)计算CPU的使用率Usage_cpu，计算公式如下：1) Consider CPU measurement data, through CPU system time (SYS), CPU user time (USER), CPU disk IO wait time (IO_WAIT), CPU hard interrupt event (IRQ), CPU soft interrupt event (SOFT_IRQ), CPU idle time (IDLE) Calculate the usage rate of the CPU Usage _cpu , the calculation formula is as follows:

2)考虑内存度量数据，通过总内存大小(TOTAL)、实际使用的内存大小 (ACTUAL)、缓存中的内存大小(CACHE+BUFFERS)计算内存使用率，计算公式如下：2) Considering the memory measurement data, the memory usage is calculated by the total memory size (TOTAL), the actual memory size (ACTUAL), and the memory size in the cache (CACHE+BUFFERS). The calculation formula is as follows:

3)考虑磁盘度量数据，通过磁盘读次数(READ_COUNT)、磁盘写次数 (WRITE_COUNT)和最大磁盘IO次数(MAX_IO_COUNT)计算磁盘IO频率，计算公式如下：3) Considering the disk metric data, the disk IO frequency is calculated by the number of disk reads (READ_COUNT), the number of disk writes (WRITE_COUNT) and the maximum number of disk IOs (MAX_IO_COUNT). The calculation formula is as follows:

4)考虑网络度量数据，通过网络进站流量大小(IN_SIEZ)、网络出站流量大小(OUT_SIEZ)和网络带宽(MAX_SIZE)计算网络负载，计算公式如下：4) Considering the network measurement data, the network load is calculated by the network inbound traffic size (IN_SIEZ), the network outbound traffic size (OUT_SIEZ) and the network bandwidth (MAX_SIZE). The calculation formula is as follows:

5)根据上述计算结果获得系统运行环境向量，系统运行环境向量表示如下：5) Obtain the system operating environment vector according to the above calculation results, and the system operating environment vector is expressed as follows:

RE＝(Usage_cpu,Usage_mem,Freq_disk,Load_net)RE=(Usage _cpu ,Usage _mem ,Freq _disk ,Load _net )

步骤2：利用正常的系统运行环境向量，结合MMD和SVDD两种算法训练以获取异常检测模型，使其不但能在异常样本缺失的情况下检测未知异常，还能同时检测多台具有相似运行环境的主机的异常，从而大大降低建模时间和系统的资源消耗；Step 2: Using the normal system operating environment vector, combined with MMD and SVDD algorithm training to obtain anomaly detection model, so that it can not only detect unknown anomalies in the absence of anomalous samples, but also detect multiple computers with similar operating environments at the same time. host exceptions, thereby greatly reducing modeling time and system resource consumption;

结合MMD和SVDD两种算法的训练过程如下：The training process combining the two algorithms of MMD and SVDD is as follows:

步骤2.1：使用MMD对系统主机的运行环境向量进行聚类，将具有相似运行环境的主机划分到一个簇中，而在不同运行环境簇的主机，它们的运行环境差异较大，聚类过程的伪代码如图2所示，聚类过程如下：Step 2.1: Use MMD to cluster the operating environment vectors of the system hosts, and divide the hosts with similar operating environments into one cluster, while hosts in different operating environment clusters have large differences in their operating environments, and the clustering process The pseudo code is shown in Figure 2, and the clustering process is as follows:

1)根据最小距离原则，从系统运行环境向量集合中选择距离原点最近的点作为第一个聚类中心，距离的计算采用欧式距离；1) According to the principle of minimum distance, select the point closest to the origin from the system operating environment vector set as the first cluster center, and use the Euclidean distance to calculate the distance;

2)从系统运行环境向量集合中选择距离第一个点最远的点作为第二个聚类中心；2) Select the point farthest from the first point from the system operating environment vector set as the second cluster center;

3)根据最小距离原则，将样本点划分到最近的聚类中心中，并更新聚类中心，如果样本点到所有聚类中心的距离都大于设定的阈值，则将该样本点作为新的聚类中心；3) According to the principle of minimum distance, the sample points are divided into the nearest cluster centers, and the cluster centers are updated. If the distance between the sample points and all the cluster centers is greater than the set threshold, the sample point is regarded as a new one. cluster center;

4)重复过程3)，直到所有样本点都划分完成。4) Repeat process 3) until all sample points are divided.

步骤2.2：使用SVDD算法对每个簇中的数据样本进行训练，为每个簇构建一个超球面，这些超球面描述了各个簇中的主机在正常运行状态下的运行环境数据分布情况，超球面的构建过程如图3所示，构建过程描述如下：Step 2.2: Use the SVDD algorithm to train the data samples in each cluster, and construct a hypersphere for each cluster. These hyperspheres describe the data distribution of the operating environment of the hosts in each cluster under normal operating conditions. The construction process is shown in Figure 3, and the construction process is described as follows:

步骤3：在接收到新的主机系统度量数据时，使用主机所在集群的超球面对主机系统度量数据进行分类，分类的公式如下：Step 3: When receiving the new host system metric data, use the hypersphere of the host cluster to classify the host system metric data. The classification formula is as follows:

其中，Ω表示超球面。当主机系统运行环境向量落在超球面内或者超球面上，则判定主机当前处于正常状态；当主机系统运行环境向量落在超球面外，则判定主机当前处于异常状态，并通过邮件进行告警。where Ω represents a hypersphere. When the operating environment vector of the host system falls within the hypersphere or on the hypersphere, it is determined that the host is currently in a normal state; when the operating environment vector of the host system falls outside the hypersphere, it is determined that the host is currently in an abnormal state, and an alarm is sent by email.

本实施拟提出的一种云平台异常检测方法具有以下有益效果：A cloud platform anomaly detection method proposed in this implementation has the following beneficial effects:

1)结合MMD和SVDD两种算法训练以获取异常检测模型，有效解决了云服务平台中正常和异常样本极度不均衡的问题，使其能够检测云服务平台中未知的系统异常，同时不再需要为每一台主机都构建异常检测模型，从而大大的降低了异常建模的时间和系统资源消耗。1) Combine MMD and SVDD algorithm training to obtain anomaly detection model, which effectively solves the problem of extreme imbalance between normal and abnormal samples in the cloud service platform, enabling it to detect unknown system anomalies in the cloud service platform. An anomaly detection model is built for each host, which greatly reduces the time and system resource consumption of anomaly modeling.

2)通过构建的模型，对主机系统度量数据进行分类，实现云服务平台主机的在线异常检测，及时发现云服务平台中系统的异常行为，提高了云服务平台的可靠性。2) Through the constructed model, the host system measurement data is classified, the online abnormal detection of the cloud service platform host is realized, the abnormal behavior of the system in the cloud service platform is detected in time, and the reliability of the cloud service platform is improved.

本发明进一步提出一种云服务平台异常检测系统，本发明提出的云服务平台异常检测系统，包括采集模块、通信模块、建模模块和检测模块：The present invention further provides a cloud service platform abnormality detection system. The cloud service platform abnormality detection system proposed by the present invention includes a collection module, a communication module, a modeling module and a detection module:

采集模块运行于云平台主机内，实时采集云服务平台主机的CPU/内存/磁盘 /网络相关的系统资源度量数据，并将采集的数据提交给通信模块客户端子模块。The collection module runs in the cloud platform host, collects the CPU/memory/disk/network related system resource measurement data of the cloud service platform host in real time, and submits the collected data to the client sub-module of the communication module.

通信模块客户端子模块在获取到采集模块提交的数据后，对数据进行封装，在数据包头中加入本机Mac地址用于区分不同的主机系统度量数据，并将封装后的数据包推送到Kafka消息队列的SYS_METRICS主题中。After the client sub-module of the communication module obtains the data submitted by the acquisition module, it encapsulates the data, adds the local Mac address to the data packet header to distinguish the measurement data of different host systems, and pushes the encapsulated data packet to the Kafka message in the SYS_METRICS topic of the queue.

通信模块服务端子模块定时从Kafka消息队列的SYS_METRICS主题中拉取并解析数据，将解析的数据按照一定的格式保存到数据库当中。The communication module service terminal module regularly pulls and parses data from the SYS_METRICS topic of the Kafka message queue, and saves the parsed data to the database in a certain format.

建模模块对正常情况下(即未受到任何攻击的系统度量数据进行分析建模，生成异常检测模型)。具体步骤为：The modeling module analyzes and models the system metric data under normal conditions (that is, has not been attacked, and generates an anomaly detection model). The specific steps are:

步骤1：从HBase数据库中提取增量数据，并根据系统度量数据计算出系统运行环境向量；Step 1: Extract incremental data from the HBase database, and calculate the system operating environment vector according to the system measurement data;

步骤2：利用正常的系统运行环境向量，结合MMD和SVDD两种算法训练以获取异常检测模型，使其不但能在异常样本缺失的情况下检测未知异常，还能同时检测多台具有相似运行环境的主机的异常，从而大大降低建模时间和系统的资源消耗，建模过程如下：Step 2: Using the normal system operating environment vector, combined with MMD and SVDD algorithm training to obtain anomaly detection model, so that it can not only detect unknown anomalies in the absence of anomalous samples, but also detect multiple computers with similar operating environments at the same time. The exception of the host, thereby greatly reducing the modeling time and system resource consumption, the modeling process is as follows:

步骤2.1：使用MMD对所有计算出的系统运行环境向量进行聚类，从而将云服务平台主机按系统运行环境相似性划分为多个集群；Step 2.1: Use MMD to cluster all the calculated system operating environment vectors, so that the cloud service platform hosts are divided into multiple clusters according to the similarity of the system operating environment;

步骤2.2：根据系统正常的运行环境向量，使用SVDD为每一个主机集群构建一个超球面，用于描述主机正常运行时的数据分布情况。Step 2.2: According to the normal operating environment vector of the system, use SVDD to construct a hypersphere for each host cluster, which is used to describe the data distribution when the host is running normally.

检测模块，加载所有主机集群对应的超球面，当接受到新的主机系统度量数据时，先根据主机系统度量数据计算主机系统运行环境向量；然后，根据主机所在的集群，选择该集群所对应的超球面对系统度量数据进行分类；当主机系统运行环境向量落在超球面内或者超球面上，则判定主机当前处于正常状态；当主机系统运行环境向量落在超球面外，则判定主机当前处于异常状态，并通过邮件进行告警。The detection module loads the hyperspheres corresponding to all host clusters. When receiving new host system metric data, first calculates the host system operating environment vector according to the host system metric data; The hypersphere classifies the system metric data; when the host system operating environment vector falls within the hypersphere or on the hypersphere, it is determined that the host is currently in a normal state; when the host system operating environment vector falls outside the hypersphere, it is determined that the host is currently in a normal state. It is in an abnormal state and will be alerted by email.

应当理解的是，对本领域普通技术人员来说，可以根据上述说明加以改进或变换，而所有这些改进和变换都应属于本发明所附权利要求的保护范围。It should be understood that, for those skilled in the art, improvements or changes can be made according to the above description, and all these improvements and changes should fall within the protection scope of the appended claims of the present invention.

Claims

1. a cloud service platform anomaly detection method, is characterized in that, comprises the following steps:

1) Collect real-time system metric data when the cloud platform host is working normally, and calculate the system operating environment vector according to the normal and abnormal system metric data;

2) Using the normal system operating environment vector, combined with the maximum average deviation algorithm MMD and the support vector data description algorithm SVDD training to obtain anomaly detection model;

In the step 2), the process of obtaining the anomaly detection model in combination with the training of the MMD and SVDD algorithms is as follows:

Step 2.1) use MMD to cluster the normal operating environment vector of the system host, thereby dividing the cloud platform host into multiple clusters according to the similarity of the system operating environment;

Step 2.2) According to the normal system operating environment vector, use SVDD to construct a hypersphere for each host cluster, which is used to describe the data distribution when each host is running normally, so as to detect abnormal data;

3) When receiving new host system metric data, use the hypersphere of the cluster where the host is located to classify the host system metric data, so as to detect host abnormality.

2. cloud service platform anomaly detection method according to claim 1, is characterized in that, described step 3) is specifically:

Step 3.1) when receiving the new host system measurement data, calculate the system operating environment vector according to the system measurement data;

Step 3.2) According to the cluster where the host is located, select the hypersphere corresponding to the cluster as the anomaly detection model;

Step 3.3) When the host system operating environment vector falls on the hypersphere or on the hypersphere, it is determined that the host is currently in a normal state;

Step 3.4) When the operating environment vector of the host system falls outside the hypersphere, it is determined that the host is currently in an abnormal state.

3. cloud service platform anomaly detection method according to claim 1, is characterized in that, the construction process of hypersphere in described step 2.2) is as follows:

2.2.1) Considering the operating environment vector set, where N is the number of hosts in the cluster, the construction process of the hypersphere can be expressed by the following formula:

At the same time, the above formula satisfies the following constraints:

( _xi -a) ^T ( _xi -a)≤R ² +ξ _i ,ξ _i ≥0

Among them, R is the radius of the hypersphere, a is the center of the hypersphere, the above formula shows that if the data point is inside or on the surface of the hypersphere, ξ _i = 0, otherwise ξ _i > 0, C is a constant;

2.2.2) Set the size of the constant C, and ignore the sample points whose distance from the origin exceeds the set value during the construction process;

2.2.3) Through continuous fitting, the radius of the hypersphere and the center of the hypersphere are finally calculated. The fitting process can be expressed by the following formula:

4 . The cloud service platform anomaly detection method according to claim 1 , wherein the system metric data in the step 1) includes CPU, memory, disk, and network-related system metric data. 5 .

5. A cloud service platform anomaly detection system, characterized in that, comprising:

The acquisition module is used to collect real-time system metric data when the cloud platform host is working normally, and calculate the system operating environment vector according to the system metric data;

The model building module is used to use the normal system operating environment vector, combined with the maximum average deviation algorithm MMD and the support vector data description algorithm SVDD training to obtain anomaly detection model;

The process of obtaining the anomaly detection model by combining the training of the MMD and SVDD algorithms in the model building module is as follows:

1) Use MMD to cluster the normal operating environment vector of the system host, thereby dividing the cloud platform host into multiple clusters according to the similarity of the system operating environment;

2) According to the normal system operating environment vector, use SVDD to construct a hypersphere for each host cluster, which is used to describe the data distribution of each host during normal operation, so as to detect abnormal data;

The detection module is used for classifying the metric data of the host system by using the hypersphere of the cluster where the host is located, so as to detect the abnormality of the host when receiving new host system metric data.

6. The cloud service platform anomaly detection system according to claim 5, wherein the detection module is specifically:

1) When receiving the new host system metric data, calculate the system operating environment vector according to the system metric data;

2) According to the cluster where the host is located, select the hypersphere corresponding to the cluster as the anomaly detection model;

3) When the operating environment vector of the host system falls on the hypersphere or on the hypersphere, it is determined that the host is currently in a normal state;

4) When the operating environment vector of the host system falls outside the hypersphere, it is determined that the host is currently in an abnormal state.

7. cloud service platform anomaly detection system according to claim 5, is characterized in that, the construction process of hypersphere in described model building module is as follows:

1) Considering the operating environment vector set, where N is the number of hosts in the cluster, the construction process of the hypersphere is expressed by the following formula:

At the same time, the above formula satisfies the following constraints:

( _xi -a) ^T ( _xi -a)≤R ² +ξ _i ,ξ _i ≥0

2) Set the size of the constant C, and ignore the sample points whose distance from the origin exceeds the set value during the construction process;

3) Through continuous fitting, the radius of the hypersphere and the center of the hypersphere are finally calculated, and the fitting process is expressed by the following formula:

8 . The cloud service platform anomaly detection system according to claim 5 , wherein the system metric data in the collection module includes CPU, memory, disk, and network-related system metric data. 9 .