CN110309037A

CN110309037A - A method for selecting features related to energy efficiency in data centers

Info

Publication number: CN110309037A
Application number: CN201811469430.2A
Authority: CN
Inventors: 李云; 张諝晟; 沈子钰; 夏彬; 刘峥
Original assignee: Nanjing Post and Telecommunication University
Current assignee: Nanjing Post and Telecommunication University
Priority date: 2018-11-28
Filing date: 2018-11-28
Publication date: 2019-10-08

Abstract

本发明提出了一种数据中心能效相关特征的选择方法，针对数据中心能效的特征选择问题，本发明采用了一种基于K近邻分类损失函数和分类间隔的特征选择方法，该方法通过收集数据中心能耗数据和对应的PUE值，然后将PUE值分级分类，通过样本找到对应的分类间隔，并更新特征权重和对特征权重排序，以此根据设定的阈值获得特征选择结果。本发明所述方法能够提取出与数据中心能效相关的特征并很好地处理噪声数据，从而提高后续能效预测的精度，有效防止过学习。The invention proposes a method for selecting features related to energy efficiency of data centers. Aiming at the feature selection problem of data center energy efficiency, the invention adopts a feature selection method based on K-nearest neighbor classification loss function and classification interval. The energy consumption data and the corresponding PUE value, and then the PUE value is graded and classified, the corresponding classification interval is found through the sample, and the feature weight is updated and the feature weight is sorted, so as to obtain the feature selection result according to the set threshold. The method of the invention can extract the features related to the energy efficiency of the data center and process the noise data well, thereby improving the accuracy of subsequent energy efficiency prediction and effectively preventing over-learning.

Description

A method for selecting features related to energy efficiency in data centers

技术领域technical field

本发明属于云计算和机器学习，具体涉及一种数据中心能效相关特征的选择方法。The invention belongs to cloud computing and machine learning, and particularly relates to a method for selecting features related to energy efficiency of a data center.

背景技术Background technique

数据中心是执行全天候大规模关键运算任务的基础设施，是支撑IT行业运转的重要设施。随着网络运营商和互联网公司的大规模云服务对数据计算、处理和存储的需求不断增长，拥有成千上万台服务器的大型数据中心激增。其次，高性能运算的云端化随着网络带宽的扩容而不断发展，这扩大了构建大规模计算基础架构的需求。因此，数据中心成为快速发展的IT行业的关键基础设施之一。A data center is an infrastructure that performs all-weather large-scale critical computing tasks and an important facility that supports the operation of the IT industry. Large-scale data centers with thousands of servers have proliferated as the demand for data computing, processing, and storage for large-scale cloud services by network operators and Internet companies continues to grow. Second, the cloudification of high-performance computing continues to develop with the expansion of network bandwidth, which expands the need to build large-scale computing infrastructure. As a result, data centers have become one of the key infrastructures in the rapidly growing IT industry.

近年来，由于数据中心的高经济效益与环境相关性，数据中心的能源效率的优化问题已经变得至关重要。首先，数据中心带来了许多经济效益，这使得数据中心的规模和数量也不断增长。随着用电量的急剧增加和电力成本的上升，电费已经成为当今数据中心的主要开支。在某些情况下，数据中心的电力成本可能高于原始资本投资的成本。其次，数据中心的能源使用会产生许多环境问题，如大量的电能消耗、空调等制冷设备的温室气体排放与冷却水的排放。而且即使数据中心的服务器处于空闲状态，同样会消耗大量的能量。出于这些原因，目前在数据中心运营过程中其能源效率需要优先考虑。In recent years, the optimization problem of energy efficiency of data centers has become critical due to the high economic efficiency and environmental relevance of data centers. First, the data center brings many economic benefits, which makes the size and number of data centers continue to grow. With the dramatic increase in electricity consumption and the rising cost of electricity, electricity bills have become a major expense in today's data centers. In some cases, the cost of electricity for a data center may be higher than the cost of the original capital investment. Secondly, the energy use of data centers will cause many environmental problems, such as large power consumption, greenhouse gas emissions from refrigeration equipment such as air conditioners, and cooling water emissions. And even if the servers in the data center are idle, they consume a lot of energy. For these reasons, energy efficiency is currently a priority in data center operations.

衡量数据中心能效的最常用指标是能源使用效率，即PUE。这个指标的定义是输入数据中心的总能耗除以IT设备使用的能耗。总能耗包括IT设备使用的能耗加上任何非计算和数据通信用途的设备(即冷却，照明设备等)所消耗的任何开销功耗。若数据中心的PUE值为2.0，这意味着该设施每供给IT设备1瓦特的能耗，其他非IT设备也会消耗1瓦特能耗。最理想的PUE是1.0，即除IT设备外无其他能耗的假设情况。该种情况在实际应用中是无法达到的，所以先进的数据中心都力求PUE趋近于1.0。The most commonly used metric to measure the energy efficiency of a data center is Energy Usage Efficiency, or PUE. This metric is defined as the total energy input into the data center divided by the energy used by IT equipment. Total energy consumption includes energy used by IT equipment plus any overhead power consumed by any equipment that is not used for computing and data communications (i.e. cooling, lighting, etc.). If a data center has a PUE of 2.0, this means that for every 1 watt of energy used by the facility to supply IT equipment, other non-IT equipment will also consume 1 watt of energy. The ideal PUE is 1.0, assuming no power consumption other than IT equipment. This situation cannot be achieved in practical applications, so advanced data centers strive to have a PUE close to 1.0.

基于上述情况，解决数据中心的能效预测问题已经迫在眉睫，此问题成为了国内外的研究热点。而能效预测中的核心任务之一是挑选与数据中心能效相关的关键属性(特征)。目前大多数能效预测研究都是基于极简的数据中心模型来实现的，比如简单的服务器CPU频率或性能计数器指标，因此特征选择比较容易。而对于大规模数据中心，其特征繁多且复杂，相关特征选择的研究比较少，仅有的模型大多都基于深度神经网络的黑盒模型，可解释性差。Based on the above situation, it is urgent to solve the problem of energy efficiency prediction of data centers, which has become a research hotspot at home and abroad. And one of the core tasks in energy efficiency prediction is to pick out key attributes (features) related to data center energy efficiency. Most of the current energy efficiency prediction research is based on minimal data center models, such as simple server CPU frequency or performance counter metrics, so feature selection is relatively easy. For large-scale data centers, the features are numerous and complex, and there are few studies on the selection of related features. Most of the only models are based on the black-box model of deep neural networks, which are poorly interpretable.

发明内容SUMMARY OF THE INVENTION

发明目的：针对上述现有技术对于特征数据的选择性不足，本发明提供一种数据中心能效相关特征的选择方法，能够针对所有数据中心进行相关特征的选择。Purpose of the invention: In view of the insufficient selectivity of feature data in the prior art, the present invention provides a method for selecting features related to energy efficiency of data centers, which can select related features for all data centers.

技术方案：一种数据中心能效相关特征的选择方法，包括以下步骤：Technical solution: a method for selecting features related to energy efficiency of a data center, comprising the following steps:

(1)收集数据中心能耗数据和对应的PUE值；(1) Collect data center energy consumption data and corresponding PUE value;

(2)按分级标准将PUE值分级；(2) Grading the PUE value according to the grading standard;

(3)随机选择样本并查找其K近邻，同时计算该样本所对应的分类间隔；(3) Randomly select a sample and find its K nearest neighbors, and calculate the classification interval corresponding to the sample at the same time;

(4)建立基于分类损失-间隔的特征选择评价准则；(4) Establish a feature selection evaluation criterion based on classification loss-interval;

(5)通过梯度下降优化所设计的评价准则更新特征权重；(5) The feature weight is updated by the evaluation criterion designed by gradient descent optimization;

(6)对特征权重排序，并通过设定阈值来获得特征选择结果。(6) Sort the feature weights and obtain the feature selection result by setting the threshold.

进一步的，由于步骤一得到的PUE是连续值，所以需要将PUE通过分级标准转化为离散值。所述步骤(2)中按分级标准将PUE值分级，根据电能利用效率分级表计算出每条数据x_i对应的PUE等级y_i∈{1,2,3}，x_i表示第i条数据的n维特征矢量，其中的x_ij则表示第i条数据的第j个实数特征值，其表达式如下：Further, since the PUE obtained in step 1 is a continuous value, it is necessary to convert the PUE into a discrete value through a grading standard. In the step (2), the PUE value is classified according to the classification standard, and the PUE level y _i ∈ {1, 2, 3} corresponding to each data x _i is calculated according to the power utilization efficiency classification table, and x _i represents the i-th data. The n-dimensional feature vector of , where x _ij represents the jth real eigenvalue of the ith data, and its expression is as follows:

步骤(3)所述的随机选择样本并查找其K近邻，同时计算该样本所对应的分类间隔具体步骤如下：The step (3) of randomly selecting a sample and finding its K nearest neighbors, and calculating the classification interval corresponding to the sample at the same time, the specific steps are as follows:

(31)获得二维二值标签对应关系矩阵B和目标近邻关系矩阵T，所述矩阵B中元素b_ij∈{0,1}表示PUE等级y_i和y_j是否相同，矩阵T中元素t_ij∈{0,1}表示样本x_j是否为x_i的目标近邻；(31) Obtain a two-dimensional binary label correspondence matrix B and a target neighbor relationship matrix T. The element b _ij ∈{0,1} in the matrix B indicates whether the PUE level y _i and y _j are the same, and the element t in the matrix T _ij ∈{0,1} indicates whether the sample x _j is the target neighbor of x _i ;

(32)将目标近邻的定义为与x_iPUE等级相同的K近邻同类样本，其中K>2；(32) The target neighbors are defined as the K-nearest neighbors of the same class with the same level as x _i PUE, where K>2;

(33)从N条样本中不放回选择样本x_i，找到与样本x_i最近邻且PUE等级相同的样本nearhit(x_i)和与样本x_i最近邻且PUE等级不同的样本nearmiss(x_i)，并计算分类间隔θ_i，用公式表示为：(33) Select the sample x _i from the N samples without replacement, find the sample nearhit(x _i ) that is the nearest neighbor to the sample x i and has the same PUE level and the sample nearmiss(x _i ) that is the nearest neighbor to the sample x _i and has a different PUE level _i ), and calculate the classification interval θ _i , which is expressed as:

θ_i＝|‖x_i-nearmiss(x_i)‖²-‖x_i-nearhit(x_i)‖²|。θ _i =|‖x _i -nearmiss(x _i )‖ ² -‖x _i -nearhit(x _i )‖ ² |.

步骤(4)包括将样本x_i基于特征权重w的损失函数L_s(w,x_i)作为特征选择的评价准则，其定义为：Step (4) includes taking the loss function L _s (w, _xi ) of the sample _xi based on the feature weight w as the evaluation criterion for feature selection, which is defined as:

其中，c为正常数，通常通过交叉验证得到；h为hinge损失，表示为：Among them, c is a positive number, usually obtained through cross-validation; h is the hinge loss, expressed as:

[a]₊＝max(a,0)[a] ₊ =max(a,0)

其中，加权的欧式距离计算公式为：Among them, the weighted Euclidean distance calculation formula is:

所述步骤(5)包括计算每个特征f的损失函数的梯度最终得到关于所有特征的n维梯度向量通过更新特征权重向量w；对于特征f的损失函数梯度计算表达式如下：The step (5) includes calculating the gradient of the loss function for each feature f Finally get the n-dimensional gradient vector about all features pass Update the feature weight vector w; the loss function gradient calculation expression for feature f is as follows:

其中，hinge损失的梯度定义如下：Among them, the gradient of hinge loss is defined as follows:

g(w_f)＝2w_f((x_if-x_jf)²-(x_if-x_pf)²)g(w _f )=2w _f ((x _if -x _jf ) ² -(x _if -x _pf ) ² )

特征权重向量w更新公式如下：The update formula of the feature weight vector w is as follows:

最后基于所设定的迭代次数，重复步骤(3)-步骤(5)。Finally, based on the set number of iterations, steps (3) to (5) are repeated.

所述步骤(6)包括将特征按权重w排序后通过设定阈值来确定最终的特征子集，所述特征子集中所有特征为与数据中心能效相关的关键特征。The step (6) includes sorting the features according to the weight w and then setting a threshold to determine a final feature subset, where all the features in the feature subset are key features related to the energy efficiency of the data center.

有益效果：本发明与现有技术相比，其显著的效果在于：第一，本发明只需要计算n个特征的权重并将它们排序，相对于传统方法计算复杂度较低，且能有效防止过学习；第二，本发明所采用的基于K近邻算法能更好地处理云数据中心可能存在的噪声数据，提高数据的精准性。Beneficial effects: Compared with the prior art, the present invention has the following significant effects: first, the present invention only needs to calculate the weights of n features and sort them, which is less computationally complex than traditional methods, and can effectively prevent Second, the K-nearest neighbor-based algorithm adopted in the present invention can better handle the noise data that may exist in the cloud data center and improve the accuracy of the data.

附图说明Description of drawings

图1是本发明的结构示意图；Fig. 1 is the structural representation of the present invention;

图2是PUE结构示意图；Fig. 2 is a schematic diagram of a PUE structure;

图3是实施例分类间隔θ_i表示图。FIG. 3 is a diagram showing the classification interval θ _i of the embodiment.

具体实施方式Detailed ways

为了详细的说明本发明所公开的技术方案，下面结合说明书附图及具体实施例做进一步的阐述。In order to describe the technical solutions disclosed in the present invention in detail, further description will be given below in conjunction with the accompanying drawings and specific embodiments of the description.

首先，关于本发明所涉及的相关变量介绍如下：First, the relevant variables involved in the present invention are introduced as follows:

假设已经收集N条数据中心能耗数据及其PUE值，表示为：Assuming that N pieces of data center energy consumption data and their PUE values have been collected, they are expressed as:

其中x_i表示第i条数据的n维特征矢量，x_ij则表示第i条数据的第j个特征数据的实数特征值，即where x _i represents the n-dimensional feature vector of the i-th data, and x _ij represents the real eigenvalue of the j-th feature data of the i-th data, that is,

z_i则表示为第i条数据对应的PUE值，根据z_i可得到对应的PUE等级y_i∈{1,2,3}，因而可得到新的样本，表示为：z _i is expressed as the PUE value corresponding to the i-th data, and the corresponding PUE level y _i ∈ {1,2,3} can be obtained according to zi _i , so a new sample can be obtained, which is expressed as:

w为原始能效数据集中每个特征的权重所构成的权重向量，其中每个特征权重初始值为1。w is the weight vector formed by the weight of each feature in the original energy efficiency dataset, where the initial value of each feature weight is 1.

与x_iPUE等级相同的K近邻同类样本定义为目标近邻，其中，K>2。The K-nearest neighbors of the same class as the _xi PUE level are defined as the target neighbors, where K>2.

本发明下文中提到的距离均代表欧式距离。The distances mentioned below in the present invention all represent Euclidean distances.

本发明所提供的是一种针对所有数据中心预测问题的特征选择方法，其流程如图1所示，具体步骤如下：What the present invention provides is a feature selection method for all data center prediction problems, the process of which is shown in Figure 1, and the specific steps are as follows:

步骤一：收集N条数据中心的能耗数据，及其对应的PUE值。Step 1: Collect energy consumption data of N data centers and their corresponding PUE values.

采集的数据中心能耗数据的特征如下：服务器总IT负载；核心网络房间总IT负载；运行流程水泵总数；流程水泵变频器平均速度；冷凝水泵总数；冷凝水泵变频器平均速度；运行冷却塔总数；冷却塔出水平均设定温度；运行冷水机总数；运行干冷机总数；运行冷冻水注水泵总数；冷冻水注水泵平均设定温度；换热器的平均温度；室外空气湿球温度；室外空气干球温度；室外空气焓值；室外空气相对湿度；室外风速；室外风向等。不同的数据中心可根据设备或布局的不同采集不同的特征。The characteristics of the collected data center energy consumption data are as follows: total IT load of servers; total IT load of core network rooms; total number of operating process pumps; average speed of process pump inverters; total number of condensate pumps; average speed of condensate pump inverters; total number of operating cooling towers ; average set temperature of cooling tower outlet water; total number of operating chillers; total number of operating dry coolers; total number of operating chilled water injection pumps; average set temperature of chilled water injection pumps; average temperature of heat exchangers; outdoor air wet bulb temperature; outdoor air Dry bulb temperature; outdoor air enthalpy; outdoor air relative humidity; outdoor wind speed; outdoor wind direction, etc. Different data centers can capture different characteristics based on equipment or layout.

步骤二：根据DB11/T 1139-2014标准提出的电能利用效率(PUE)分级表得到每条数据x_i对应的PUE等级y_i。PUE分级的结构示意图如图2所示，PUE分级表如表1所示。Step 2: Obtain the PUE level _yi corresponding to each piece of data _xi according to the Power Utilization Efficiency (PUE) grading table proposed by the DB11/T 1139-2014 standard. A schematic diagram of the structure of the PUE classification is shown in FIG. 2 , and the PUE classification table is shown in Table 1.

表1PUE分级表Table 1PUE grading table

级别level I级Class I Ⅱ级Class II Ⅲ级Class III PUE值PUE value 1＜PUE≤1.51＜PUE≤1.5 1.5＜PUE≤1.81.5＜PUE≤1.8 1.8＜PUE≤2.01.8＜PUE≤2.0

比如说收集到的数据中心的某条数据对应的PUE值为1.4，则可以得到该数据中心对应的PUE等级为1。For example, if the PUE value corresponding to a certain piece of data in the collected data center is 1.4, it can be obtained that the PUE level corresponding to the data center is 1.

步骤三：随机选择样本x_i并查找其K近邻，同时计算该样本对应的分类间隔θ_i。具体的，首先需要获得二维二值标签对应关系矩阵B和目标近邻关系矩阵T；其中，矩阵B中元素b_ij∈{0,1}表示PUE等级y_i和y_j是否相同，即当样本x_i和样本x_j对应的PUE等级y_i,y_j相同时，b_ij＝1，不同则b_ij＝0；矩阵T中元素t_ij∈{0,1}表示样本x_j是否为x_i的目标近邻，即当样本x_j是与样本x_i距离最相近前K个样本之一且对应的PUE等级y_j,y_i相同时，t_ij＝1，否则，t_ij＝0。Step 3: randomly select the sample _xi and find its K nearest neighbors, and calculate the classification interval θ _i corresponding to the sample at the same time. Specifically, it is first necessary to obtain the two-dimensional binary label correspondence matrix B and the target neighbor relationship matrix T; wherein, the element b _ij ∈{0,1} in the matrix B indicates whether the PUE level y _i and y _j are the same, that is, when the sample _When x _i and sample _x _j have the same _PUE level y _i _, _y _j , that is, when the sample x _j is one of the K samples with the closest distance to the sample x _i and the corresponding PUE levels y _j , y _i are the same, t _ij =1, otherwise, t _ij =0.

从N条样本中不放回选择样本x_i，找到与样本x_i最近邻且PUE等级相同的样本nearhit(x_i)和与样本x_i最近邻且PUE等级不同的样本nearmiss(x_i)，并计算分类间隔θ_i，如图3所示。用公式表示为：Select the sample xi without replacement from the N samples, find the sample nearhit(x _i ) that is the nearest neighbor to the sample _xi and has the same PUE level and the sample nearmiss( _xi ₎ that is the nearest neighbor to the sample _xi and has a different PUE level, And calculate the classification interval θ _i , as shown in Figure 3. The formula is expressed as:

θ_i＝|‖x_i-nearmiss(x_i)‖²-‖x_i-nearhit(x_i)‖²|θ _i =|‖x _i -nearmiss(x _i )‖ ² -‖x _i -nearhit(x _i )‖ ² |

该间隔θ_i表示为与样本x_i距离最相近且PUE等级相同的样本nearhit(x_i)和样本x_i的距离的平方减去与样本x_i距离最相近且PUE等级不同的样本nearmiss(x_i)和样本x_i的距离平方的绝对值。The interval θ _i is expressed as the square of the distance between the sample nearhit(x _i ) and the sample _xi that is the closest to the sample _xi and has the same PUE level and the square of the distance to the sample _xi minus the nearmiss(x _i ) The absolute value of the squared distance from the sample _xi .

步骤四：设计基于分类间隔θ_i的特征选择评价准则。Step 4: Design a feature selection evaluation criterion based on the classification interval θ _i .

将样本x_i基于特征权重w的损失函数L_s(w,x_i)作为特征选择的评价准则，其定义为：The loss function L _s (w, _xi ) of the sample _xi based on the feature weight w is used as the evaluation criterion for feature selection, which is defined as:

其中，c为正常数，通常通过交叉验证得到；h为hinge损失，表示为Among them, c is a positive number, usually obtained through cross-validation; h is the hinge loss, expressed as

[a]₊＝max(a,0)[a] ₊ =max(a,0)

其中，加权的距离计算公式为Among them, the weighted distance calculation formula is

损失函数的第一项表示样本x_i的目标近邻的K个样本与样本x_i的加权距离的平方的和，通过对特征权重的更新来最小化与样本x_i距离较远的目标近邻样本的加权距离；而第二项则表示对于所有样本x_i的目标近邻样本，其与样本x_i的加权距离的平方加上分类间隔再减去和样本x_i最相近且PUE等级不同的K个样本x_p的加权距离的平方，如果该值小于0，则说明样本x_p相对样本x_i距离较远，超过了当前样本x_i的目标近邻样本与样本x_i之间的加权距离加上分类间隔的大小，所以通过hinge损失，让该值为0；若该值大于0，则说明样本x_p与样本x_i的距离比较近，需要通过权重的更新来最小化该项，从而使得样本x_p与样本x_i的加权距离更远。The first term of the loss function represents the sum of the squares of the weighted distances between the K samples of the target neighbors of the sample _xi and the sample _xi , and the feature weight is updated to minimize the distance from the sample _xi . Weighted distance; and the second item means that for all samples _xi 's target neighbors, the square of the weighted distance from sample _xi plus the classification interval minus the K samples that are closest to sample _xi and have different PUE levels The square of the weighted distance of x _p , if the value is less than 0, it means that the sample x _p is far away from the sample x _i , which exceeds the weighted distance between the target nearest neighbor sample of the current sample x _i and the sample x _i plus the classification interval The size of , so the hinge loss is used to make the value 0; if the value is greater than 0, it means that the distance between the sample x _p and the sample x _i is relatively close, and the weight needs to be updated to minimize this item, so that the sample x _p The weighted distance from sample _xi is farther.

通过损失函数第二项采用的hinge损失，将评价函数转换为软间隔标准，从而能有效减小异常值的影响。同时通过将目标近邻的K值设定为K>2，使得对于最近邻分类能很好的过滤噪声数据。Through the hinge loss used in the second term of the loss function, the evaluation function is converted into a soft interval standard, which can effectively reduce the influence of outliers. At the same time, by setting the K value of the target neighbors to K>2, the noise data can be well filtered for the nearest neighbor classification.

步骤五：通过梯度下降优化评价准则来更新特征权重w。Step 5: Update the feature weight w by optimizing the evaluation criterion by gradient descent.

计算每个特征f的损失函数的梯度最终得到关于所有特征的n维梯度向量通过更新特征权重向量w。对于特征f的损失函数梯度计算如下：Calculate the gradient of the loss function for each feature f Finally get the n-dimensional gradient vector about all features pass Update the feature weight vector w. The gradient of the loss function for feature f is calculated as follows:

基于所设定的迭代次数，重复步骤三、四、五。Repeat steps 3, 4, and 5 based on the set number of iterations.

在本发明中计算时间主要涉及到矩阵B,T和向量w的计算，它们的时间复杂度分别为O(N²),O(N²)和O(N²Kn)。通常来说，K为一个小常数。所以本发明中的方法总时间复杂度为2O(N²)+O(N²n)≈O(N²n)，要优于传统方法的时间复杂度O(N²n²)。In the present invention, the computation time mainly involves the computation of matrices B, T and vector w, and their time complexity is O(N ² ), O(N ² ) and O(N ² Kn) respectively. Usually, K is a small constant. Therefore, the total time complexity of the method in the present invention is 2O(N ² )+O(N ² n)≈O(N ² n), which is better than the time complexity O(N ² n ² ) of the traditional method.

步骤六：将特征按权重w排序后通过设定阈值来确定最终的特征子集。特征子集中所有特征为与数据中心能效相关的关键特征。Step 6: After sorting the features according to the weight w, set a threshold to determine the final feature subset. All features in the feature subset are key features related to data center energy efficiency.

假定根据权重递减排序后的特征为：服务器总IT负载；运行冷水机总数；运行冷却塔总数；冷却塔出水平均设定温度；运行流程水泵总数；流程水泵变频器平均速度；运行干冷机总数；室外空气湿球温度；室外空气焓值；核心网络房间总IT负载；冷凝水泵总数；冷凝水泵变频器平均速度；运行冷冻水注水泵总数；冷冻水注水泵平均设定温度；换热器的平均温度；室外空气干球温度；室外空气相对湿度；室外风速；室外风向。It is assumed that the features sorted according to the weight are: total IT load of servers; total number of operating chillers; total number of operating cooling towers; average set temperature of cooling tower outlet water; total number of operating process pumps; Outdoor air wet bulb temperature; outdoor air enthalpy; total IT load of core network rooms; total number of condensate pumps; average speed of condensate pump inverters; total number of running chilled water injection pumps; average set temperature of chilled water injection pumps; temperature; outdoor air dry bulb temperature; outdoor air relative humidity; outdoor wind speed; outdoor wind direction.

假设设置阈值为14，则最后得到的特征子集为：服务器总IT负载；运行冷水机总数；运行冷却塔总数；冷却塔出水平均设定温度；运行流程水泵总数；流程水泵变频器平均速度；运行干冷机总数；室外空气湿球温度；室外空气焓值；核心网络房间总IT负载；冷凝水泵总数；冷凝水泵变频器平均速度；运行冷冻水注水泵总数；冷冻水注水泵平均设定温度。Assuming that the threshold is set to 14, the resulting feature subsets are: total server IT load; total number of operating chillers; total number of operating cooling towers; average set temperature of cooling tower outlet water; total number of operating process pumps; average speed of process pump inverters; Total number of running dry coolers; outdoor air wet bulb temperature; outdoor air enthalpy value; total IT load of core network rooms; total number of condensate pumps; average speed of condensate pump inverters;

然后就可以将该特征子集作为之后数据中心能效预测的特征来继续后续操作。The subset of features can then be used as a feature for subsequent data center energy efficiency predictions for subsequent operations.

Claims

1. a method for selecting energy efficiency-related features of a data center, characterized in that: comprising the following steps:

(1) Collect data center energy consumption data and corresponding PUE value;

(2) Grading the PUE value according to the grading standard;

(3) Randomly select a sample and find its K nearest neighbors, and calculate the classification interval corresponding to the sample at the same time;

(4) Establish a feature selection evaluation criterion based on classification loss-interval;

(5) The feature weight is updated by the evaluation criterion designed by gradient descent optimization;

(6) Sort the feature weights and obtain the feature selection result by setting the threshold.

2. the selection method of a kind of data center energy efficiency related feature according to claim 1, it is characterized in that: in described step (2), according to the classification standard, the PUE value is graded, according to the electric energy utilization efficiency classification table, calculate each data x The PUE level corresponding to _i _{i i} ∈ {1, 2, 3}, x _i represents the n-dimensional feature vector of the i-th data, where x _ij represents the j-th real eigenvalue of the i-th data, and its expression as follows:

3. The method for selecting a data center energy-efficiency-related feature according to claim 1, wherein the random selection of the sample in step (3) and its K nearest neighbors are searched, and the specific steps of calculating the classification interval corresponding to the sample simultaneously as follows:

(31) Obtain a two-dimensional binary label correspondence matrix B and a target neighbor relationship matrix T. The elements b _ij ∈ {0, 1} in the matrix B indicate whether the PUE levels y _i and y _j are the same, and the element t in the matrix T _ij ∈ {0, 1} indicates whether the sample x _j is the target neighbor of x _i ;

(32) The target neighbors are defined as the K-nearest neighbors of the same class with the same level as x _i PUE, where K>2;

(33) Select the sample x _i from the N samples without replacement, find the sample nearhit(x _i ) that is the nearest neighbor to the sample x i and has the same PUE level and the sample nearmiss(x _i ) that is the nearest neighbor to the sample x _i and has a different PUE level _i ), and calculate the classification interval θ _i , which is expressed as:

θ _i =|||x _i -nearmiss(x _i )|| ² -||x _i -nearhit(x _i )|| ² |.

4. the selection method of a kind of data center energy efficiency related feature according to claim 1, is characterized in that: step (4) comprises that sample x _i is based on the loss function L _s (w, x _i ) of feature weight w as feature selection The evaluation criteria are defined as:

Among them, c is a positive number, usually obtained through cross-validation; h is the hinge loss, expressed as:

[a] ₊ = max(a, 0)

Among them, the weighted Euclidean distance calculation formula is:

5. The method for selecting energy-efficiency-related features of a data center according to claim 1, wherein the step (5) comprises calculating the gradient of the loss function of each feature f Finally get the n-dimensional gradient vector about all features pass Update the feature weight vector w; the loss function gradient calculation expression for feature f is as follows:

Among them, the gradient of hinge loss is defined as follows:

g(w _f )=2w _f ((x _if -x _jf ) ² -(x _if -x _pf ) ² )

The update formula of the feature weight vector w is as follows:

Finally, based on the set number of iterations, steps (3) to (5) are repeated.

6. A method for selecting features related to energy efficiency of a data center according to claim 1, wherein the step (6) comprises sorting the features by weight w to determine the final feature subset by setting a threshold, so All the features in the described feature subset are key features related to the energy efficiency of the data center.