CN105739929A

CN105739929A - Data center selection method for big data to migrate to cloud

Info

Publication number: CN105739929A
Application number: CN201610067866.3A
Authority: CN
Inventors: 张江涛; 黄荷姣; 王轩
Original assignee: Harbin Institute of Technology Shenzhen
Current assignee: Harbin Institute of Technology Shenzhen
Priority date: 2016-01-29
Filing date: 2016-01-29
Publication date: 2016-07-06
Anticipated expiration: 2036-01-29
Also published as: CN105739929B

Abstract

The present invention proposes a data center selection method when big data is migrated to the cloud. First, considering the unavailability of DC due to factors such as user preferences and legal restrictions, an incomplete graph modeling is carried out; the activation level is adopted To describe the user's data generation; four criteria are defined: fair data placement FDP, preferred data placement PDP, transmission cost minimization data placement TCMDP, and cost minimization data placement CMDP; DC selection is based on the above criteria. The method proposed by the present invention aims at the requirement when BD moves to the cloud, studies the moving mechanism from the user's point of view, can shorten the data access delay, and reduce the data cost. The method of the present invention can reflect the availability of DCs as well as user preferences. The method of the invention can use the network to automatically perform low-cost and low-delay data migration, avoiding the use of hardware, and is beneficial to the implementation of automatic management.

Description

How to choose a data center when migrating big data to the cloud

技术领域technical field

本发明涉及云计算技术领域，尤其涉及一种大数据向云端迁移时的数据中心的选择方法。The invention relates to the technical field of cloud computing, in particular to a method for selecting a data center when big data is migrated to the cloud.

背景技术Background technique

云计算已经成为了大数据(BD)分析的优选平台。特别的当数据时从多个跨地域分布的地点产生，而且本地用户需要经常用到本地数据，并且有时数据又需要进一步整合以进行进一步分析时，尤其如此。例如，对于一个具有很多遍布全球的子公司的跨国销售公司来说，每个国家的子公司为了商业目的需要及时对本土用户产生的数据进行分析。所有的数据又要被汇总分析以报给总部，或者支持跨国交易。一般来说，一个大型的云以分布式进行组网并具有多个跨地域分布的数据中心(DC，比如Amazon至少有遍布4个大洲的11个DC，Google至少有遍布4个洲的13个DC)。每个DC都已按需付费的方式配置有计算以及存储资源。这种基础设施能够提供就近服务，特别适合于跨地域分布。Cloud computing has become the platform of choice for big data (BD) analysis. This is especially true when data is generated from multiple geographically distributed locations, and local users often need to use local data, and sometimes the data needs to be further integrated for further analysis. For example, for a multinational sales company with many subsidiaries all over the world, the subsidiaries in each country need to analyze the data generated by local users in a timely manner for business purposes. All data must be aggregated and analyzed to report to the headquarters, or to support cross-border transactions. Generally speaking, a large cloud is networked in a distributed manner and has multiple geographically distributed data centers (DCs, for example, Amazon has at least 11 DCs spread over 4 continents, and Google has at least 13 DCs spread across 4 continents. DC). Each DC is configured with computing and storage resources on a pay-as-you-go basis. This kind of infrastructure can provide close service, especially suitable for cross-regional distribution.

为了在云中处理BD，前提条件是将BD迁移并存储到合适的DC上。直接移动硬件是移动大规模数据的一种可选方式。比如，AmazonImport/Export服务推荐用可移动存储设备来运送数据。有时，甚至有可能移动整个机器。但这只适合于间歇性的，或者一次性的大批量数据移动。这种方式有很大的延迟，不能满足不断增长的数据实时分析需求。而且它也和自动管理理念相矛盾，并且需要更多的变得越来越贵的劳动力参与。在Inter网上传数据非常昂贵，而且因为太大的延迟而不实用。据Amazon数据，通过10MBInter网传输1TB的数据大致需要13天时间。实时数据通常被建议用高速专用连接传送(如AWSdirectconnect)。这种方式能够加快传输速度。但即便依赖于高速专用连接，跨大洲进行数据传输仍然十分困难。例如，AWSdirectconnect不提供跨大洲的服务。而国际专线却太贵。这就限制了将通常遍布全球的大规模数据移动到一个DC上。而且，用一个DC来存储数据会导致更经常的本地数据分析延迟更大。In order to process BDs in the cloud, the prerequisite is to migrate and store the BDs on a suitable DC. Moving hardware directly is an optional way to move large amounts of data. For example, the Amazon Import/Export service recommends removable storage devices for shipping data. Sometimes, it is even possible to move the entire machine. But this is only suitable for intermittent, or one-time bulk data movement. This method has a large delay and cannot meet the growing demand for real-time data analysis. And it also contradicts the idea of automatic management and requires more labor participation which is becoming more and more expensive. Uploading data over the Internet is very expensive and impractical due to too much latency. According to Amazon data, it takes roughly 13 days to transmit 1TB of data through a 10MBInternet. Real-time data is usually recommended to be sent over a high-speed dedicated connection (such as AWSdirectconnect). This method can speed up the transmission speed. But even relying on high-speed dedicated connections, transferring data across continents is difficult. For example, AWSdirectconnect does not provide services across continents. The international line is too expensive. This limits the ability to move large amounts of data, often spread across the globe, to a single DC. Also, using a DC to store data results in greater latency for more often local data analysis.

特别在一些区域，数据安全法律要求一些数据必须存储在本地(如欧盟的一些国家)。总而言之，用户有必要遵循一些规则来为他们的数据选择合适的存储地点。就像Amazon建议的那样：离用户更近以减少数据使用延迟，满足特定的法律规要求，或者减少成本等。Especially in some regions, data security laws require that some data must be stored locally (such as some countries in the European Union). All in all, it is necessary for users to follow some rules to choose a suitable storage location for their data. Just like Amazon suggests: closer to users to reduce data usage delays, meet specific legal requirements, or reduce costs, etc.

当前，一些基于MapReduce的框架，比如G-Hadoop和G-MR，已能够实现跨集群和DC的数据分析。和只用一个DC的机制相比，使用多个DC的机制不仅能满足综合分析的需求，而且能保证更快的数据使用和具有更低的成本。Currently, some MapReduce-based frameworks, such as G-Hadoop and G-MR, have been able to implement cross-cluster and DC data analysis. Compared with the mechanism using only one DC, the mechanism using multiple DCs can not only meet the needs of comprehensive analysis, but also ensure faster data usage and lower cost.

将BD移动到云端时多个DC的选择问题和设施选择问题(facilitylocationproblem，FLP)以及k-中间点问题相关。FLP旨在基于不同准则选择设施来服务客户。DC可以被看做是设施，而本地数据用户即是客户。k-中间点问题试图找到不多于k个点，其余没有被选择的点将被指配到一个被选择的点，使得这些点对之间的边长和最小。The problem of selecting multiple DCs when moving BDs to the cloud is related to the facility location problem (FLP) and the k-midpoint problem. FLP aims to select facilities to serve customers based on different criteria. DCs can be viewed as facilities, while local data users are customers. The k-intermediate point problem tries to find no more than k points, and the remaining unselected points will be assigned to a selected point such that the sum of the edge lengths between these pairs of points is minimized.

FLP问题的变种中，k-供应商问题需要从给定集合中选择至多k个供应商(对应DC)使得每个客户和离他最近的供应商之间的最大距离最小。一般的，供应商和客户网络被建模成一个完全图对于一个广义的k-供应商问题变种，每一个供应商被赋予一个权值，要求所有被选择的供应商权值不大于k。但是，受限于法规，一些DC可能不能用来服务某些数据，所以图不总是完全图。而且，数据是和用户相关的，而不是和DC(供应商)相关。Among the variants of the FLP problem, the k-supplier problem requires selecting at most k suppliers (corresponding to DCs) from a given set such that the maximum distance between each customer and its nearest supplier is minimized. Generally, the supplier and customer network is modeled as a complete graph. For a generalized variant of the k-supplier problem, each supplier is assigned a weight, and all selected suppliers are required to have a weight no greater than k. However, due to regulations, some DCs may not be able to serve certain data, so the graph is not always a complete graph. Also, the data is related to the user, not to the DC (supplier).

无容量限制的设施选择问题(Uncapacitatedfacilitylocation，UFL)是FLP的另一个变种，其中设施没有容量限制，而且每一个设施有一个固定的开设成本权重。问题目标是极小化总的固定成本和总的服务成本。目前所得到的算法都有一个常数的违约因子。即这意味着算法需要的设施个数不少于k的某个倍数。而k-中间点问题不考虑设施的权重，但是数据的移动是与数据的大小相关的。Uncapacitated facility location (UFL) is another variant of FLP in which facilities have no capacity constraints and each facility has a fixed set-up cost weight. The problem objective is to minimize total fixed costs and total service costs. The resulting algorithms so far have a constant default factor. That is, this means that the number of facilities required by the algorithm is not less than a certain multiple of k. The k-intermediate problem does not consider the weight of facilities, but the movement of data is related to the size of data.

发明内容Contents of the invention

为了解决现有技术中的问题，本发明提出了一种大数据向云端迁移时的数据中心的选择方法，实现分布式云计算中BD向云端移动时的低成本、高速率的数据存取目标。In order to solve the problems in the prior art, the present invention proposes a data center selection method when big data migrates to the cloud, and realizes the low-cost, high-speed data access target when BD moves to the cloud in distributed cloud computing .

本发明通过以下技术方案实现：The present invention is realized through the following technical solutions:

一种大数据向云端迁移时的数据中心的选择方法，所述方法包括：构建底层非完全图，采用激活级别的方式来描述用户的数据产生量，定义公平数据放置FDP、优选数据放置PDP、传输成本最小化数据放置TCMDP和成本最小化数据放置CMDP等四种准则，以及基于上述准则之一进行DC的选择；其中，所述非完全图G＝(U,V,E)，U代表用户，V代表DC，边长e_ij∈E(i∈U,j∈V)满足三角不等式，正整数k(k≤|U|,k≤|V|)，对于任意i∈U以及j∈V，如果用户i的数据能够被移动到DCj，则它们之间存在一条边；所述方法旨在从可用的DC集合V中找到一个DC子集D(|D|≤k)来按照不同准则存储U中所有用户的数据；所述FDP准则为：最大的用户和被指配到的DC之间的距离极小化，使得每个本地用户可以以最小的时延接入数据：所述PDP准则为：最大的用户和被指配到的DC之间的加权距离极小化，使得具有更多数据的本地用户可以以最小的时延接入数据：所述TCMDP准则为：所有用户和其被指配到的DC之间的加权距离的和极小化：所述CMDP)准则为：所有用户的成本之和极小化： A method for selecting a data center when big data is migrated to the cloud, the method comprising: constructing an incomplete graph at the bottom layer, using activation levels to describe user data generation, defining fair data placement FDP, preferred data placement PDP, Four criteria such as transmission cost minimization data placement TCMDP and cost minimization data placement CMDP, and DC selection based on one of the above criteria; wherein, the incomplete graph G=(U, V, E), U represents the user , V represents DC, side length e _ij ∈ E (i ∈ U, j ∈ V) satisfies the triangle inequality, positive integer k (k ≤ | U |, k ≤ | V |), for any i ∈ U and j ∈ V , if user i's data can be moved to DCj, there is an edge between them; the method aims to find a DC subset D(|D|≤k) from the available DC set V to store according to different criteria Data of all users in U; the FDP criterion is: the distance between the largest user and the assigned DC is minimized, so that each local user can access data with the minimum delay: The PDP criterion is: the weighted distance between the largest user and the assigned DC is minimized, so that local users with more data can access data with the minimum delay: The TCMDP criterion is: the sum of weighted distances between all users and their assigned DCs is minimized: The CMDP) criterion is: the sum of the costs of all users is minimized:

本发明的有益效果是：本发明提出的方法针对BD向云端移动时的需求，从用户角度研究了移动机制，可以缩短数据接入时延，降低数据成本。本发明研究了四种准则：公平数据放置FDP，优选数据放置PDP，传输成本最小化数据放置TCMDP和成本最小化数据放置CMDP，本发明的方法可以反映DC的可用性以及用户的偏好，对于前两种准则，算法能够保证找到的解至少不差于3倍的最优解。本发明的方法可以利用网络自动进行低成本，低延迟的数据迁移，避免采用硬件方式，有利于自动化管理的实施。The beneficial effect of the present invention is that: the method proposed by the present invention studies the mobile mechanism from the user's point of view, aiming at the requirement when the BD moves to the cloud, which can shorten the data access delay and reduce the data cost. The present invention studies four criteria: fair data placement FDP, preferred data placement PDP, transmission cost minimization data placement TCMDP and cost minimization data placement CMDP, the method of the present invention can reflect the availability of DC and user's preference, for the first two This criterion, the algorithm can guarantee that the solution found is at least not worse than 3 times the optimal solution. The method of the invention can use the network to automatically perform low-cost and low-delay data migration, avoiding the use of hardware, and is beneficial to the implementation of automatic management.

附图说明Description of drawings

图1是本发明的分布式用户数据和数据中心的非完全二分图；Fig. 1 is the incomplete bipartite graph of distributed user data and data center of the present invention;

图2是门限图的示意图。FIG. 2 is a schematic diagram of a threshold map.

具体实施方式detailed description

为了使本发明的目的、技术方案及优点更加清楚明白，以下结合附图及实施例，对本发明进行进一步详细说明。应当理解，此处所描述的具体实施例仅用以解释本发明，并不用于限定本发明。In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention.

本发明考虑将跨地域分布的大数据移动到云端的可行性，研究了多个目标DC选择问题，用非完全二分图来表示底层设施，从而克服了已有问题都是完全二分图的局限。更加符合实际情况中由于用户偏好，或者是法律限制而导致的并不是每个DC都可用的情况。The present invention considers the feasibility of moving cross-regional distributed big data to the cloud, studies multiple target DC selection problems, and uses an incomplete bipartite graph to represent the underlying facilities, thereby overcoming the limitation that existing problems are all complete bipartite graphs. It is more in line with the actual situation where not every DC is available due to user preference or legal restrictions.

附图1模拟了一个分布式用户数据和分布式云计算中的DC构成的非完全二分图，其中用户有所偏好或者受安全法律限制，并不是每个DC都能被每一个用户选择。有边连接表示本DC可用或者用户不排斥。Attached Figure 1 simulates an incomplete bipartite graph composed of distributed user data and DCs in distributed cloud computing, where users have preferences or are restricted by security laws, and not every DC can be selected by every user. A side connection indicates that the DC is available or the user does not exclude it.

本发明是目的是寻求更快的数据接入和更低的成本。这个问题推广了传统的k-供应商问题、UFL和k-中间点问题。The present invention seeks faster data access and lower costs. This problem generalizes the traditional k-suppliers problem, UFL and k-midpoint problem.

考虑底层非完全图G＝(U,V,E)，其中U代表用户，V代表DC，边长e_ij∈E(i∈U,j∈V)满足三角不等式，正整数k(k≤|U|,k≤|V|)，本发明旨在从可用的DC集合V中找到一个DC子集D(|D|≤k)来按照不同准则存储U中所有用户的数据。对于任意i∈U以及j∈V，如果用户i的数据能够被移动到DCj(至少不被法规限制或者不被用户排除)，则它们之间存在一条边。假定所有的i都临近至少一个j，否则问题无解。假定|E|＝m，其中m≤|U|*|V|。Consider the underlying incomplete graph G=(U, V, E), where U represents the user, V represents the DC, and the side length e _ij ∈ E (i ∈ U, j ∈ V) satisfies the triangle inequality, and the positive integer k(k≤| U|, k≤|V|), the present invention aims to find a DC subset D (|D|≤k) from the available DC set V to store the data of all users in U according to different criteria. For any i ∈ U and j ∈ V, there exists an edge between them if the data of user i can be moved to DC j (at least not restricted by regulations or excluded by users). Assume that all i are adjacent to at least one j, otherwise the problem has no solution. Assume that |E|=m, where m≤|U|*|V|.

用户权重定义：每个用户都被赋予一个权重w_i，表示当前或者可见的未来的数据产生的激活级别，或者本地用户的重要程度。w_i随着数据量或者重要程度的增加而增加。用激活级别而不是数据量可以容忍数据的动态变化同时提供对数据量的适度近似。激活级别可以根据每天上载数据量而定。例如对于一个典型的每天上载200GB的公司而言，10GB可以被用来作为激活级别的判断门限。如果一个子公司每天产生的数据小于10GB，可以赋予权值1。对于在20-30GB之间的子公司，则赋予权值3，以此类推。对于用户具有激活级别w_i的用户i，移动数据到DCj将需要支付费用w_ie_ij。Definition of user weight: each user is assigned a weight w _i , which represents the activation level generated by the current or visible future data, or the importance of the local user. w _i increases with the increase of data volume or importance. Using activation levels instead of data volumes can tolerate dynamic changes in data while providing a reasonable approximation of data volumes. The activation level can be based on the amount of uploaded data per day. For example, for a typical company that uploads 200GB per day, 10GB can be used as the threshold for determining the activation level. If a subsidiary produces less than 10GB of data per day, it can be given a weight of 1. For subsidiaries between 20-30GB, assign a weight of 3, and so on. For user i with activation level w _i , moving data to DCj will require payment of fee w _i e _ij .

DC权重定义：每一个DC都具有不同的计算和存储资源价格。为了更经济的存储和处理数据，当然优选更低的价格。给定DCj，假定一个VM实例每小时处理数据的价格是a_j。平均来说，这种实例每小时能够分析b_jGB数据。则处理10GB数据的价格是p'_j＝10/b_j*a_j。如果10GB数据的存储费用是p”_j，那么，对于具有激活级别1的用户在DC侧的总费用是p_j＝p'_j+p”_j。对于具有激活级别w_i的用户i，如果它想在DCj存储和处理数据，它需要支付w_ip_j。DC weight definition: Each DC has different computing and storage resource prices. For more economical storage and processing of data, lower prices are of course preferred. Given DCj, assume that the price of a VM instance processing data per hour is a _j . On average, such an instance can analyze b _j GB of data per hour. Then the price for processing 10GB data is p' _j =10/b _j *a _j . If the storage charge for 10GB data is p" _j , then the total charge on the DC side for a user with activation level 1 is p _j = p' _j + p" _j . For user i with activation level w _i , if it wants to store and process data in DCj, it needs to pay w _i p _j .

具有激活级别w_i的用户的总费用是w_i(p_j+e_ij。考虑到实际环境中e_ij(如几千公里)和p_j(Amazon的每小时几美金)的数量级别上的差异，我们采用标准化后的形式：c_ij＝w_i(p_j'+e_ij')，其中p_j'＝p_j/max_h∈V(p_h)，e_ij'＝e_ij/max_l∈U,h∈V(e_lh)。注意标准化后的边长e_ij'仍然满足三角不等式。The total cost of a user with activation level w _i is w _i (p _j + e _ij . Considering the difference in the quantitative level of e _ij (such as thousands of kilometers) and p _j (a few dollars per hour for Amazon) in the actual environment , we adopt the standardized form: c _ij =w _i (p _j '+e _ij '), where p _j '=p _j /max _h∈V (p _h ), e _ij '=e _ij /max _{l∈ U,h∈V} (e _lh ). Note that the standardized side length e _ij ' still satisfies the triangle inequality.

则目标具体可以表述为：The objective can then be expressed as:

a)公平数据放置(FDP)。最大的用户和被指配到的DC之间的距离极小化，使得每个本地用户可以以最小的时延接入数据： a) Fair Data Placement (FDP). The distance between the largest user and the assigned DC is minimized, so that each local user can access data with the minimum delay:

b)优选数据放置(PDP)。最大的用户和被指配到的DC之间的加权距离极小化，使得具有更多数据的本地用户可以以最小的时延接入数据：如果需要，我们也用w(i,j)＝w_ie_ij来表示加权距离。b) Preferred Data Placement (PDP). The weighted distance between the largest user and the assigned DC is minimized, so that local users with more data can access data with the smallest delay: We also denote the weighted distance by w(i,j)=w _i e _ij if necessary.

c)传输成本最小化数据放置(TCMDP)。传输成本，被定义为所有用户和其被指配到的DC之间的加权距离的和极小化： c) Transport Cost Minimized Data Placement (TCMDP). The transmission cost, defined as the minimization of the sum of weighted distances between all users and their assigned DCs:

d)成本极小化(CMDP)。总成本，被定义为所有用户的成本之和极小化： d) Cost Minimization (CMDP). The total cost, defined as minimizing the sum of the costs of all users:

因为a)和c)分别是b)和d)的特殊形式，所以后续只给出b)和d)的算法。Since a) and c) are special forms of b) and d) respectively, only the algorithms of b) and d) are given later.

优选数据放置(PDP)的算法基本思想The basic idea of the algorithm of preferred data placement (PDP)

首先介绍几个概念，用于描述算法。First introduce a few concepts, used to describe the algorithm.

1)瓶颈图:注意到PDP问题的最优解一定在某一个用户加权边达到，所以我们应该从小到大逐一检查加权边，直至所有的约束都满足。瓶颈图的构建正是基于这以思想。m条用户加权边被按照非递减顺序排序，并记做w(1,j)≤w(2,g)≤…≤w(m,h)，其中j,g,h∈V且可能相同。瓶颈图G₁,G₂,…,G_m是G的边子图，且G_r＝(U,V,E_r)(r＝1,2,…,m)，其中E_r＝{e_ij|w_ie_ij≤w(r,g)}。即，G_r由所有G的顶点和不大于第r个最短加权边w(r,g)的边组成。1) Bottleneck graph: Note that the optimal solution of the PDP problem must be reached at a certain user weighted edge, so we should check the weighted edges one by one from small to large until all constraints are satisfied. The construction of the bottleneck map is based on this idea. The m user-weighted edges are sorted in non-decreasing order and denoted as w(1,j)≤w(2,g)≤...≤w(m,h), where j,g,h∈V and may be the same. Bottleneck graphs G ₁ , G ₂ ,...,G _m are edge subgraphs of G, and G _r =(U,V,E _r )(r=1,2,...,m), where E _r ={e _ij |w _i e _ij ≤w(r,g)}. That is, G _r consists of all vertices of G and an edge no larger than the r-th shortest weighted edge w(r,g).

2)门限图：对于每一个G_r，其对应的门限图T_r在用户集合U上如下构建。对每两个点u,v∈U，边(u,v)在T_r中，当且仅当存在一个DCj∈V且(u,j)和(v,j)都在G_r中。例如，在图2中，边(5，6)不存在，因为在G_m中，用户5，6没有相邻的公共DC。2) Threshold graph: For each G _r , its corresponding threshold graph T _r is constructed on the user set U as follows. For every two vertices u, v ∈ U, an edge (u, v) is in T _r if and only if there exists a DCj ∈ V and both (u, j) and (v, j) are in G _r . For example, in Figure 2, edge (5, 6) does not exist because users 5, 6 have no adjacent common DC in _Gm .

3)极大团：给定无向图H，H的团是一个完全子图。如果一个团不包含在其它团中，则称之为极大团，记做C(H)。很容易在多项式时间内找到C(H)。本文中用到的简单的方法如下，先将H的任意一个点加入C(H)，然后加入它的邻居点，这些邻居点和已经在C(H)中的所有点能够构成一个完全图。C(H)中所有点的邻居点都被逐一检查，知道没有点可以加入。这样就找到了一个极大团。现在，删除这个极大团的所有点，并在剩余点中重复这个过程，直到H变成空的。这样，就找到了一系列的极大团，且各极大团无交。3) Maximum clique: Given an undirected graph H, the clique of H is a complete subgraph. If a clique is not included in other cliques, it is called a maximal clique, denoted as C(H). It is easy to find C(H) in polynomial time. The simple method used in this article is as follows, first add any point of H to C(H), and then add its neighbor points, these neighbor points and all points already in C(H) can form a complete graph. Neighbors of all points in C(H) are checked one by one until no point can be added. In this way, a maximal cluster is found. Now, delete all points of this maximal clique and repeat this process in the remaining points until H becomes empty. In this way, a series of maximal cliques are found, and each maximal clique has no intersection.

注意到，如果用户可以被同一个DC服务，则相应的门限图是一个团。例如，在附图1中，用户6-9可以被共同的DC服务，因此其在T_m中的对应门限图就是附图2中包含6-9的团。在门限图中寻找极大团意味着将用户按照DC的可用性进行分组，其中每一组中的用户都可以用其中一个成员来代表。Note that if users can be served by the same DC, the corresponding threshold graph is a clique. For example, in FIG. 1 , users 6-9 can be served by a common DC, so their corresponding threshold graph in T _m is the clique containing 6-9 in FIG. 2 . Finding maximal cliques in the threshold graph means grouping users according to the availability of DCs, where each user in a group can be represented by one of its members.

4)最大权重优先算法(BWF)4) Maximum Weight First Algorithm (BWF)

输入G＝(U,V,E)：用户加权非完全二分图；k：正整数Input G=(U,V,E): user weighted incomplete bipartite graph; k: positive integer

输出D：所选的DC集合Output D: Selected DC set

将用户加权边按非递减顺序排序如下：w(1,j)≤w(2,g)≤…≤w(m,h) Sort user weighted edges in non-decreasing order as follows: w(1,j)≤w(2,g)≤...≤w(m,h)

对每一个G_r构建对应的门限图T_r。对每一个T_r找其极大团集合。假定有H个团。记对应于T_r的极大团集合是C_r＝C(T_r)_h，其中r＝1,2,…,m，h＝1,…,HA corresponding threshold map T _{r is constructed for each G r} _. For each T _r find its maximal clique set. Suppose there are H groups. Note that the maximal clique set corresponding to T _r is C _r ＝C(T _r ) _h , where r=1,2,…,m, h=1,…,H

令t＝min{r||C_r|≤k}Let t=min{r||C _r |≤k}

For{h＝1,...,H}For{h＝1,...,H}

{{

令l是具有最大权重w_l＝max{w_j|j∈C(T_t)_h，j没有被指配点}。令C(T_t)_h的邻居DC是 Let l be a point with maximum weight w _l =max{w _j |j∈C(T _t ) _h , j is not assigned}. Let the neighbor DC of C(T _t ) _h be

e＝min_j∈N{e_lj|e_lj∈G_t}。令v是N中的点，且e_lv＝e。D＝D∪ve=min _j∈N {e _lj |e _lj ∈G _t }. Let v be a point in N, and e _lv =e. D＝D∪v

}}

可以证明，BWF算法给出了PDP问题的一个3-近似解(即得到的解的值不大于最优解的值的3倍)，其中底层的二分图是非完全的。而且这个解是紧的(即不可能有比3更小的近似比)。It can be proved that the BWF algorithm gives a 3-approximate solution to the PDP problem (that is, the value of the obtained solution is not greater than 3 times the value of the optimal solution), where the underlying bipartite graph is incomplete. And this solution is compact (ie no approximation ratio smaller than 3 is possible).

当底层图为完全二分图时，可以用独立集代替极大团，算法更快，解质量相同。算法BWF试图将最大规模的数据放置到最近的DC。它按照用户加权边从小到大的顺序逐一检查对应的瓶颈图，并构造对应的门限图。然后找到每个门限图的团集合。每一个团都代表了一组可以至少被一个相同DC服务的用户，且用户和共同DC之间的加权边不大于瓶颈图的最大加权边。所有团集合中团的个数不大于k的团集合下标被选中，以使得最大的用户加权边尽可能的小。当然可以用二分法查找来进一步加快速度。When the underlying graph is a complete bipartite graph, independent sets can be used instead of maximal cliques, the algorithm is faster and the solution quality is the same. Algorithm BWF tries to place the largest amount of data to the nearest DC. It checks the corresponding bottleneck graph one by one according to the order of user weighted edges from small to large, and constructs the corresponding threshold graph. Then find the clique set for each threshold graph. Each clique represents a group of users that can be served by at least one same DC, and the weighted edge between users and the common DC is not larger than the maximum weighted edge of the bottleneck graph. The subscripts of all clique sets whose number of cliques is not greater than k are selected so that the largest user weighted edge is as small as possible. Of course, binary search can be used to further speed up.

对于每一个在门限图的团集合C_t中的用户，按照以下方法寻找目标DC。在每一个团中，从具有最大权重的用户开始，我们把它指配到G_t中最近的邻居DC。所有同一个团中的其余用户也被隐式的指配到这个DC。这个过程重复，直到所有C_t中的用户都被指配(for循环)。这样，算法在用户和DC之间建立了一个映射。映射将二分图划分成多个集群，每个集群的中心都是一个DC。For each user in the clique set C _t of the threshold map, find the target DC according to the following method. In each clique, starting from the user with the largest weight, we assign it to the nearest neighbor DC in _Gt . All other users in the same group are also implicitly assigned to this DC. This process repeats until all users in C _t are assigned (for loop). In this way, the algorithm establishes a mapping between users and DCs. Mapping divides the bipartite graph into clusters, with a DC at the center of each cluster.

成本极小化数据放置(CMDP)算法基本思想Basic idea of cost minimization data placement (CMDP) algorithm

对CMDP问题，本发明给出了NPA-CMDP算法，它试图将每一个用户指配到最近(即用户和DC之间的成本最小)的DC。如果DC的个数超过了k，则某个DC将被去掉，相应的用户会被重新指配。这个过程重复直到约束满足。For the CMDP problem, the present invention provides the NPA-CMDP algorithm, which attempts to assign each user to the nearest DC (that is, the minimum cost between the user and the DC). If the number of DCs exceeds k, a certain DC will be removed and the corresponding user will be reassigned. This process is repeated until the constraints are satisfied.

算法2.CMDP最近优先算法(NPA-CMDP)。Algorithm 2. CMDP Nearest First Algorithm (NPA-CMDP).

输出D：所选的DC集合Output D: Selected DC set

将每个用户指配到最近的DC(用户和DC之间的加权距离(成本)定义为w_ie_ij+w_ip_j)。将选择的DC集合记为DAssign each user to the nearest DC (the weighted distance (cost) between user and DC is defined as w _i e _ij +w _i p _j ). Record the selected DC set as D

While{|D|>k}While{|D|>k}

{{

在D找到DC，和D中其余DC相比，这个DC和指配到它的用户之间的距加权离是最小的。从D中删除这个DC。Find a DC in D that has the smallest distance-weighted distance from the user assigned to it compared to the rest of the DCs in D. Remove this DC from D.

并将原来指配到这个DC的用户重新指配到D中剩余的最近的DCAnd reassign the users originally assigned to this DC to the remaining nearest DC in D

}}

本发明考虑将跨地域分布的大数据移动到云端的可行性，研究了多个目标DC选择问题，用非完全二分图来表示底层设施，从而克服了已有问题都是完全二分图的局限。更加符合实际情况中由于用户偏好，或者是法律限制而导致的并不是每个DC都可用的情况。研究了四种准则，包括。公平数据放置(FDP)，优选数据放置(PDP)，传输成本最小化数据放置(TCMDP)和成本最小化数据放置(CMDP)，实际应用中，可以根据运营商和用户的需求进行选择。The present invention considers the feasibility of moving cross-regional distributed big data to the cloud, studies multiple target DC selection problems, and uses an incomplete bipartite graph to represent the underlying facilities, thereby overcoming the limitation that existing problems are all complete bipartite graphs. It is more in line with the actual situation where not every DC is available due to user preference or legal restrictions. Four guidelines were studied, incl. Fair Data Placement (FDP), Preferred Data Placement (PDP), Transmission Cost Minimized Data Placement (TCMDP) and Cost Minimized Data Placement (CMDP), in practical applications, can be selected according to the needs of operators and users.

以上内容是结合具体的优选实施方式对本发明所作的进一步详细说明，不能认定本发明的具体实施只局限于这些说明。对于本发明所属技术领域的普通技术人员来说，在不脱离本发明构思的前提下，还可以做出若干简单推演或替换，都应当视为属于本发明的保护范围。The above content is a further detailed description of the present invention in conjunction with specific preferred embodiments, and it cannot be assumed that the specific implementation of the present invention is limited to these descriptions. For those of ordinary skill in the technical field of the present invention, without departing from the concept of the present invention, some simple deduction or replacement can be made, which should be regarded as belonging to the protection scope of the present invention.

Claims

1. A method for selecting a data center when big data migrates to the cloud, characterized in that the method comprises:

Construct the underlying incomplete graph, use the activation level w _i to describe the user's data generation, and define four criteria for fair data placement FDP, optimal data placement PDP, transmission cost-minimized data placement TCMDP, and cost-minimized data placement CMDP , and the selection of a DC based on one of the above criteria;

Among them, the incomplete graph G=(U, V, E), U represents the user, V represents the DC, the side length e _ij ∈ E (i ∈ U, j ∈ V) satisfies the triangle inequality, and the positive integer k (k≤ |U|,k≤|V|), for any i ∈ U and j ∈ V, if the data of user i can be moved to DC _j , then there is an edge between them; the method aims at starting from the available DC Find a DC subset D(|D|≤k) in the set V to store the data of all users in U according to different criteria;

The FDP criterion is: the distance between the largest user and the assigned DC is minimized, so that each local user can access data with the minimum delay:

The PDP criterion is: the weighted distance between the largest user and the assigned DC is minimized, so that local users with more data can access data with the minimum delay:

\min_{D. &SubsetEqual; V, | D. | \leq k} \max_{i &Element; u, j &Element; D.} (w_{i} e_{i j});

The TCMDP criterion is: the sum of weighted distances between all users and their assigned DCs is minimized:

\min_{D. &SubsetEqual; V, | D. | \leq k} (Σ_{i &Element; u, j &Element; D.} w_{i} e_{i j});

The CMDP criterion is: the sum of the costs of all users is minimized: c _ij is the total cost of the user with activation level w _i .

2. The method according to claim 1, characterized in that: the incomplete graph is a weighted incomplete graph, and the side length is w _i e _ij , wherein w _i is the activation level of the user, and w _i e _ij is the mobile data A fee will be required to arrive at DC _j .

3. The method according to claim 1, wherein the activation level w _i is determined according to the amount of data uploaded every day.

4. The method according to claim 1, characterized in that: for user i with activation level w _i , if it wants to store and process data in DC _j , it needs to pay w _i p _j , where p _j = p ' _j + p” _j , p' _j is the price of DC processing unit capacity data, and p” _j is the storage cost of DC storing unit capacity data.

5. The method according to claim 1, characterized in that: the selection of DC based on the PDP criterion adopts the maximum weight priority BWF algorithm, and the BWF algorithm comprises the following steps:

Sort user weighted edges in non-decreasing order, record as w(1,j)≤w(2,g)≤...≤w(m,h), where j,g,h∈V, bottleneck graph G ₁ ,G ₂ ,...,G _m is an edge subgraph of G, and G _r ＝(U,V,E _r )(r=1,2,...,m), E _r ＝{e _ij |w _i e _ij ≤w (r,g)};

Construct the corresponding threshold graph T _r for each G _r , for every two points u, v∈U, the edge (u, v) is in T _r , if and only if there is a DCj∈V and (u, j) and (v, j) are in G _r ;

Find its maximal clique set for each T _r , if a clique is not included in other cliques, it is called a maximal clique; suppose there are H groups, record the maximal clique set corresponding to T _r as C _r = C(T _r ) _h , where r=1,2,…,m, h=1,…,H, let t=min{r||C _r |≤k}

for(h=1,...,H)

{Let l be a point with maximum weight w _l =max{w _j |j∈C(T _t ) _h , j is not assigned};

Let the neighbor DC of C(T _t ) _h be e=min _j∈N {e _lj |e _lj ∈G _t };

Let v be a point in N, and e _lv =e. D=D∪v}.

6. The method according to claim 5, characterized in that: when the underlying graph is a complete bipartite graph, independent sets are used instead of maximal cliques to accelerate the BWF algorithm.

7. The method according to claim 1, characterized in that: the selection of DC based on the CMDP criterion adopts the CMDP nearest priority NPA-CMDP algorithm, and the NPA-CMDP algorithm comprises the steps of:

Assign each user to the nearest DC, the weighted distance between the user and DC is defined as w _i e _ij +w _i p _j , and the selected DC set is recorded as D;

While{|D|>k}

{Find a DC in D. Compared with other DCs in D, the distance-weighted distance between this DC and the user assigned to it is the smallest;

Delete this DC from D;

And reassign the users originally assigned to this DC to the remaining nearest DC in D}.