CN105739929A - Data center selection method for big data to migrate to cloud - Google Patents

Data center selection method for big data to migrate to cloud Download PDF

Info

Publication number
CN105739929A
CN105739929A CN201610067866.3A CN201610067866A CN105739929A CN 105739929 A CN105739929 A CN 105739929A CN 201610067866 A CN201610067866 A CN 201610067866A CN 105739929 A CN105739929 A CN 105739929A
Authority
CN
China
Prior art keywords
data
user
graph
cmdp
users
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610067866.3A
Other languages
Chinese (zh)
Other versions
CN105739929B (en
Inventor
张江涛
黄荷姣
王轩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Institute of Technology Shenzhen
Original Assignee
Harbin Institute of Technology Shenzhen
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Institute of Technology Shenzhen filed Critical Harbin Institute of Technology Shenzhen
Priority to CN201610067866.3A priority Critical patent/CN105739929B/en
Publication of CN105739929A publication Critical patent/CN105739929A/en
Application granted granted Critical
Publication of CN105739929B publication Critical patent/CN105739929B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0655Vertical data movement, i.e. input-output transfer; data movement between one or more hosts and one or more storage devices

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

本发明提出了一种大数据向云端迁移时的数据中心的选择方法,首先,考虑到因用户偏好和法律限制等因素导致DC不可用情况,进行了非完全图建模;采用激活级别的方式来描述用户的数据产生量;定义了公平数据放置FDP、优选数据放置PDP、传输成本最小化数据放置TCMDP和成本最小化数据放置CMDP等四种准则;基于上述准则进行DC的选择。本发明提出的方法针对BD向云端移动时的需求,从用户角度研究了移动机制,可以缩短数据接入时延,降低数据成本。本发明的方法可以反映DC的可用性以及用户的偏好。本发明的方法可以利用网络自动进行低成本,低延迟的数据迁移,避免采用硬件方式,有利于自动化管理的实施。

The present invention proposes a data center selection method when big data is migrated to the cloud. First, considering the unavailability of DC due to factors such as user preferences and legal restrictions, an incomplete graph modeling is carried out; the activation level is adopted To describe the user's data generation; four criteria are defined: fair data placement FDP, preferred data placement PDP, transmission cost minimization data placement TCMDP, and cost minimization data placement CMDP; DC selection is based on the above criteria. The method proposed by the present invention aims at the requirement when BD moves to the cloud, studies the moving mechanism from the user's point of view, can shorten the data access delay, and reduce the data cost. The method of the present invention can reflect the availability of DCs as well as user preferences. The method of the invention can use the network to automatically perform low-cost and low-delay data migration, avoiding the use of hardware, and is beneficial to the implementation of automatic management.

Description

大数据向云端迁移时的数据中心的选择方法How to choose a data center when migrating big data to the cloud

技术领域technical field

本发明涉及云计算技术领域,尤其涉及一种大数据向云端迁移时的数据中心的选择方法。The invention relates to the technical field of cloud computing, in particular to a method for selecting a data center when big data is migrated to the cloud.

背景技术Background technique

云计算已经成为了大数据(BD)分析的优选平台。特别的当数据时从多个跨地域分布的地点产生,而且本地用户需要经常用到本地数据,并且有时数据又需要进一步整合以进行进一步分析时,尤其如此。例如,对于一个具有很多遍布全球的子公司的跨国销售公司来说,每个国家的子公司为了商业目的需要及时对本土用户产生的数据进行分析。所有的数据又要被汇总分析以报给总部,或者支持跨国交易。一般来说,一个大型的云以分布式进行组网并具有多个跨地域分布的数据中心(DC,比如Amazon至少有遍布4个大洲的11个DC,Google至少有遍布4个洲的13个DC)。每个DC都已按需付费的方式配置有计算以及存储资源。这种基础设施能够提供就近服务,特别适合于跨地域分布。Cloud computing has become the platform of choice for big data (BD) analysis. This is especially true when data is generated from multiple geographically distributed locations, and local users often need to use local data, and sometimes the data needs to be further integrated for further analysis. For example, for a multinational sales company with many subsidiaries all over the world, the subsidiaries in each country need to analyze the data generated by local users in a timely manner for business purposes. All data must be aggregated and analyzed to report to the headquarters, or to support cross-border transactions. Generally speaking, a large cloud is networked in a distributed manner and has multiple geographically distributed data centers (DCs, for example, Amazon has at least 11 DCs spread over 4 continents, and Google has at least 13 DCs spread across 4 continents. DC). Each DC is configured with computing and storage resources on a pay-as-you-go basis. This kind of infrastructure can provide close service, especially suitable for cross-regional distribution.

为了在云中处理BD,前提条件是将BD迁移并存储到合适的DC上。直接移动硬件是移动大规模数据的一种可选方式。比如,AmazonImport/Export服务推荐用可移动存储设备来运送数据。有时,甚至有可能移动整个机器。但这只适合于间歇性的,或者一次性的大批量数据移动。这种方式有很大的延迟,不能满足不断增长的数据实时分析需求。而且它也和自动管理理念相矛盾,并且需要更多的变得越来越贵的劳动力参与。在Inter网上传数据非常昂贵,而且因为太大的延迟而不实用。据Amazon数据,通过10MBInter网传输1TB的数据大致需要13天时间。实时数据通常被建议用高速专用连接传送(如AWSdirectconnect)。这种方式能够加快传输速度。但即便依赖于高速专用连接,跨大洲进行数据传输仍然十分困难。例如,AWSdirectconnect不提供跨大洲的服务。而国际专线却太贵。这就限制了将通常遍布全球的大规模数据移动到一个DC上。而且,用一个DC来存储数据会导致更经常的本地数据分析延迟更大。In order to process BDs in the cloud, the prerequisite is to migrate and store the BDs on a suitable DC. Moving hardware directly is an optional way to move large amounts of data. For example, the Amazon Import/Export service recommends removable storage devices for shipping data. Sometimes, it is even possible to move the entire machine. But this is only suitable for intermittent, or one-time bulk data movement. This method has a large delay and cannot meet the growing demand for real-time data analysis. And it also contradicts the idea of automatic management and requires more labor participation which is becoming more and more expensive. Uploading data over the Internet is very expensive and impractical due to too much latency. According to Amazon data, it takes roughly 13 days to transmit 1TB of data through a 10MBInternet. Real-time data is usually recommended to be sent over a high-speed dedicated connection (such as AWSdirectconnect). This method can speed up the transmission speed. But even relying on high-speed dedicated connections, transferring data across continents is difficult. For example, AWSdirectconnect does not provide services across continents. The international line is too expensive. This limits the ability to move large amounts of data, often spread across the globe, to a single DC. Also, using a DC to store data results in greater latency for more often local data analysis.

特别在一些区域,数据安全法律要求一些数据必须存储在本地(如欧盟的一些国家)。总而言之,用户有必要遵循一些规则来为他们的数据选择合适的存储地点。就像Amazon建议的那样:离用户更近以减少数据使用延迟,满足特定的法律规要求,或者减少成本等。Especially in some regions, data security laws require that some data must be stored locally (such as some countries in the European Union). All in all, it is necessary for users to follow some rules to choose a suitable storage location for their data. Just like Amazon suggests: closer to users to reduce data usage delays, meet specific legal requirements, or reduce costs, etc.

当前,一些基于MapReduce的框架,比如G-Hadoop和G-MR,已能够实现跨集群和DC的数据分析。和只用一个DC的机制相比,使用多个DC的机制不仅能满足综合分析的需求,而且能保证更快的数据使用和具有更低的成本。Currently, some MapReduce-based frameworks, such as G-Hadoop and G-MR, have been able to implement cross-cluster and DC data analysis. Compared with the mechanism using only one DC, the mechanism using multiple DCs can not only meet the needs of comprehensive analysis, but also ensure faster data usage and lower cost.

将BD移动到云端时多个DC的选择问题和设施选择问题(facilitylocationproblem,FLP)以及k-中间点问题相关。FLP旨在基于不同准则选择设施来服务客户。DC可以被看做是设施,而本地数据用户即是客户。k-中间点问题试图找到不多于k个点,其余没有被选择的点将被指配到一个被选择的点,使得这些点对之间的边长和最小。The problem of selecting multiple DCs when moving BDs to the cloud is related to the facility location problem (FLP) and the k-midpoint problem. FLP aims to select facilities to serve customers based on different criteria. DCs can be viewed as facilities, while local data users are customers. The k-intermediate point problem tries to find no more than k points, and the remaining unselected points will be assigned to a selected point such that the sum of the edge lengths between these pairs of points is minimized.

FLP问题的变种中,k-供应商问题需要从给定集合中选择至多k个供应商(对应DC)使得每个客户和离他最近的供应商之间的最大距离最小。一般的,供应商和客户网络被建模成一个完全图对于一个广义的k-供应商问题变种,每一个供应商被赋予一个权值,要求所有被选择的供应商权值不大于k。但是,受限于法规,一些DC可能不能用来服务某些数据,所以图不总是完全图。而且,数据是和用户相关的,而不是和DC(供应商)相关。Among the variants of the FLP problem, the k-supplier problem requires selecting at most k suppliers (corresponding to DCs) from a given set such that the maximum distance between each customer and its nearest supplier is minimized. Generally, the supplier and customer network is modeled as a complete graph. For a generalized variant of the k-supplier problem, each supplier is assigned a weight, and all selected suppliers are required to have a weight no greater than k. However, due to regulations, some DCs may not be able to serve certain data, so the graph is not always a complete graph. Also, the data is related to the user, not to the DC (supplier).

无容量限制的设施选择问题(Uncapacitatedfacilitylocation,UFL)是FLP的另一个变种,其中设施没有容量限制,而且每一个设施有一个固定的开设成本权重。问题目标是极小化总的固定成本和总的服务成本。目前所得到的算法都有一个常数的违约因子。即这意味着算法需要的设施个数不少于k的某个倍数。而k-中间点问题不考虑设施的权重,但是数据的移动是与数据的大小相关的。Uncapacitated facility location (UFL) is another variant of FLP in which facilities have no capacity constraints and each facility has a fixed set-up cost weight. The problem objective is to minimize total fixed costs and total service costs. The resulting algorithms so far have a constant default factor. That is, this means that the number of facilities required by the algorithm is not less than a certain multiple of k. The k-intermediate problem does not consider the weight of facilities, but the movement of data is related to the size of data.

发明内容Contents of the invention

为了解决现有技术中的问题,本发明提出了一种大数据向云端迁移时的数据中心的选择方法,实现分布式云计算中BD向云端移动时的低成本、高速率的数据存取目标。In order to solve the problems in the prior art, the present invention proposes a data center selection method when big data migrates to the cloud, and realizes the low-cost, high-speed data access target when BD moves to the cloud in distributed cloud computing .

本发明通过以下技术方案实现:The present invention is realized through the following technical solutions:

一种大数据向云端迁移时的数据中心的选择方法,所述方法包括:构建底层非完全图,采用激活级别的方式来描述用户的数据产生量,定义公平数据放置FDP、优选数据放置PDP、传输成本最小化数据放置TCMDP和成本最小化数据放置CMDP等四种准则,以及基于上述准则之一进行DC的选择;其中,所述非完全图G=(U,V,E),U代表用户,V代表DC,边长eij∈E(i∈U,j∈V)满足三角不等式,正整数k(k≤|U|,k≤|V|),对于任意i∈U以及j∈V,如果用户i的数据能够被移动到DCj,则它们之间存在一条边;所述方法旨在从可用的DC集合V中找到一个DC子集D(|D|≤k)来按照不同准则存储U中所有用户的数据;所述FDP准则为:最大的用户和被指配到的DC之间的距离极小化,使得每个本地用户可以以最小的时延接入数据:所述PDP准则为:最大的用户和被指配到的DC之间的加权距离极小化,使得具有更多数据的本地用户可以以最小的时延接入数据:所述TCMDP准则为:所有用户和其被指配到的DC之间的加权距离的和极小化:所述CMDP)准则为:所有用户的成本之和极小化: A method for selecting a data center when big data is migrated to the cloud, the method comprising: constructing an incomplete graph at the bottom layer, using activation levels to describe user data generation, defining fair data placement FDP, preferred data placement PDP, Four criteria such as transmission cost minimization data placement TCMDP and cost minimization data placement CMDP, and DC selection based on one of the above criteria; wherein, the incomplete graph G=(U, V, E), U represents the user , V represents DC, side length e ij ∈ E (i ∈ U, j ∈ V) satisfies the triangle inequality, positive integer k (k ≤ | U |, k ≤ | V |), for any i ∈ U and j ∈ V , if user i's data can be moved to DCj, there is an edge between them; the method aims to find a DC subset D(|D|≤k) from the available DC set V to store according to different criteria Data of all users in U; the FDP criterion is: the distance between the largest user and the assigned DC is minimized, so that each local user can access data with the minimum delay: The PDP criterion is: the weighted distance between the largest user and the assigned DC is minimized, so that local users with more data can access data with the minimum delay: The TCMDP criterion is: the sum of weighted distances between all users and their assigned DCs is minimized: The CMDP) criterion is: the sum of the costs of all users is minimized:

本发明的有益效果是:本发明提出的方法针对BD向云端移动时的需求,从用户角度研究了移动机制,可以缩短数据接入时延,降低数据成本。本发明研究了四种准则:公平数据放置FDP,优选数据放置PDP,传输成本最小化数据放置TCMDP和成本最小化数据放置CMDP,本发明的方法可以反映DC的可用性以及用户的偏好,对于前两种准则,算法能够保证找到的解至少不差于3倍的最优解。本发明的方法可以利用网络自动进行低成本,低延迟的数据迁移,避免采用硬件方式,有利于自动化管理的实施。The beneficial effect of the present invention is that: the method proposed by the present invention studies the mobile mechanism from the user's point of view, aiming at the requirement when the BD moves to the cloud, which can shorten the data access delay and reduce the data cost. The present invention studies four criteria: fair data placement FDP, preferred data placement PDP, transmission cost minimization data placement TCMDP and cost minimization data placement CMDP, the method of the present invention can reflect the availability of DC and user's preference, for the first two This criterion, the algorithm can guarantee that the solution found is at least not worse than 3 times the optimal solution. The method of the invention can use the network to automatically perform low-cost and low-delay data migration, avoiding the use of hardware, and is beneficial to the implementation of automatic management.

附图说明Description of drawings

图1是本发明的分布式用户数据和数据中心的非完全二分图;Fig. 1 is the incomplete bipartite graph of distributed user data and data center of the present invention;

图2是门限图的示意图。FIG. 2 is a schematic diagram of a threshold map.

具体实施方式detailed description

为了使本发明的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本发明进行进一步详细说明。应当理解,此处所描述的具体实施例仅用以解释本发明,并不用于限定本发明。In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention.

本发明考虑将跨地域分布的大数据移动到云端的可行性,研究了多个目标DC选择问题,用非完全二分图来表示底层设施,从而克服了已有问题都是完全二分图的局限。更加符合实际情况中由于用户偏好,或者是法律限制而导致的并不是每个DC都可用的情况。The present invention considers the feasibility of moving cross-regional distributed big data to the cloud, studies multiple target DC selection problems, and uses an incomplete bipartite graph to represent the underlying facilities, thereby overcoming the limitation that existing problems are all complete bipartite graphs. It is more in line with the actual situation where not every DC is available due to user preference or legal restrictions.

附图1模拟了一个分布式用户数据和分布式云计算中的DC构成的非完全二分图,其中用户有所偏好或者受安全法律限制,并不是每个DC都能被每一个用户选择。有边连接表示本DC可用或者用户不排斥。Attached Figure 1 simulates an incomplete bipartite graph composed of distributed user data and DCs in distributed cloud computing, where users have preferences or are restricted by security laws, and not every DC can be selected by every user. A side connection indicates that the DC is available or the user does not exclude it.

本发明是目的是寻求更快的数据接入和更低的成本。这个问题推广了传统的k-供应商问题、UFL和k-中间点问题。The present invention seeks faster data access and lower costs. This problem generalizes the traditional k-suppliers problem, UFL and k-midpoint problem.

考虑底层非完全图G=(U,V,E),其中U代表用户,V代表DC,边长eij∈E(i∈U,j∈V)满足三角不等式,正整数k(k≤|U|,k≤|V|),本发明旨在从可用的DC集合V中找到一个DC子集D(|D|≤k)来按照不同准则存储U中所有用户的数据。对于任意i∈U以及j∈V,如果用户i的数据能够被移动到DCj(至少不被法规限制或者不被用户排除),则它们之间存在一条边。假定所有的i都临近至少一个j,否则问题无解。假定|E|=m,其中m≤|U|*|V|。Consider the underlying incomplete graph G=(U, V, E), where U represents the user, V represents the DC, and the side length e ij ∈ E (i ∈ U, j ∈ V) satisfies the triangle inequality, and the positive integer k(k≤| U|, k≤|V|), the present invention aims to find a DC subset D (|D|≤k) from the available DC set V to store the data of all users in U according to different criteria. For any i ∈ U and j ∈ V, there exists an edge between them if the data of user i can be moved to DC j (at least not restricted by regulations or excluded by users). Assume that all i are adjacent to at least one j, otherwise the problem has no solution. Assume that |E|=m, where m≤|U|*|V|.

用户权重定义:每个用户都被赋予一个权重wi,表示当前或者可见的未来的数据产生的激活级别,或者本地用户的重要程度。wi随着数据量或者重要程度的增加而增加。用激活级别而不是数据量可以容忍数据的动态变化同时提供对数据量的适度近似。激活级别可以根据每天上载数据量而定。例如对于一个典型的每天上载200GB的公司而言,10GB可以被用来作为激活级别的判断门限。如果一个子公司每天产生的数据小于10GB,可以赋予权值1。对于在20-30GB之间的子公司,则赋予权值3,以此类推。对于用户具有激活级别wi的用户i,移动数据到DCj将需要支付费用wieijDefinition of user weight: each user is assigned a weight w i , which represents the activation level generated by the current or visible future data, or the importance of the local user. w i increases with the increase of data volume or importance. Using activation levels instead of data volumes can tolerate dynamic changes in data while providing a reasonable approximation of data volumes. The activation level can be based on the amount of uploaded data per day. For example, for a typical company that uploads 200GB per day, 10GB can be used as the threshold for determining the activation level. If a subsidiary produces less than 10GB of data per day, it can be given a weight of 1. For subsidiaries between 20-30GB, assign a weight of 3, and so on. For user i with activation level w i , moving data to DCj will require payment of fee w i e ij .

DC权重定义:每一个DC都具有不同的计算和存储资源价格。为了更经济的存储和处理数据,当然优选更低的价格。给定DCj,假定一个VM实例每小时处理数据的价格是aj。平均来说,这种实例每小时能够分析bjGB数据。则处理10GB数据的价格是p'j=10/bj*aj。如果10GB数据的存储费用是p”j,那么,对于具有激活级别1的用户在DC侧的总费用是pj=p'j+p”j。对于具有激活级别wi的用户i,如果它想在DCj存储和处理数据,它需要支付wipjDC weight definition: Each DC has different computing and storage resource prices. For more economical storage and processing of data, lower prices are of course preferred. Given DCj, assume that the price of a VM instance processing data per hour is a j . On average, such an instance can analyze b j GB of data per hour. Then the price for processing 10GB data is p' j =10/b j *a j . If the storage charge for 10GB data is p" j , then the total charge on the DC side for a user with activation level 1 is p j = p' j + p" j . For user i with activation level w i , if it wants to store and process data in DCj, it needs to pay w i p j .

具有激活级别wi的用户的总费用是wi(pj+eij。考虑到实际环境中eij(如几千公里)和pj(Amazon的每小时几美金)的数量级别上的差异,我们采用标准化后的形式:cij=wi(pj'+eij'),其中pj'=pj/maxh∈V(ph),eij'=eij/maxl∈U,h∈V(elh)。注意标准化后的边长eij'仍然满足三角不等式。The total cost of a user with activation level w i is w i (p j + e ij . Considering the difference in the quantitative level of e ij (such as thousands of kilometers) and p j (a few dollars per hour for Amazon) in the actual environment , we adopt the standardized form: c ij =w i (p j '+e ij '), where p j '=p j /max h∈V (p h ), e ij '=e ij /max l∈ U,h∈V (e lh ). Note that the standardized side length e ij ' still satisfies the triangle inequality.

则目标具体可以表述为:The objective can then be expressed as:

a)公平数据放置(FDP)。最大的用户和被指配到的DC之间的距离极小化,使得每个本地用户可以以最小的时延接入数据: a) Fair Data Placement (FDP). The distance between the largest user and the assigned DC is minimized, so that each local user can access data with the minimum delay:

b)优选数据放置(PDP)。最大的用户和被指配到的DC之间的加权距离极小化,使得具有更多数据的本地用户可以以最小的时延接入数据:如果需要,我们也用w(i,j)=wieij来表示加权距离。b) Preferred Data Placement (PDP). The weighted distance between the largest user and the assigned DC is minimized, so that local users with more data can access data with the smallest delay: We also denote the weighted distance by w(i,j)=w i e ij if necessary.

c)传输成本最小化数据放置(TCMDP)。传输成本,被定义为所有用户和其被指配到的DC之间的加权距离的和极小化: c) Transport Cost Minimized Data Placement (TCMDP). The transmission cost, defined as the minimization of the sum of weighted distances between all users and their assigned DCs:

d)成本极小化(CMDP)。总成本,被定义为所有用户的成本之和极小化: d) Cost Minimization (CMDP). The total cost, defined as minimizing the sum of the costs of all users:

因为a)和c)分别是b)和d)的特殊形式,所以后续只给出b)和d)的算法。Since a) and c) are special forms of b) and d) respectively, only the algorithms of b) and d) are given later.

优选数据放置(PDP)的算法基本思想The basic idea of the algorithm of preferred data placement (PDP)

首先介绍几个概念,用于描述算法。First introduce a few concepts, used to describe the algorithm.

1)瓶颈图:注意到PDP问题的最优解一定在某一个用户加权边达到,所以我们应该从小到大逐一检查加权边,直至所有的约束都满足。瓶颈图的构建正是基于这以思想。m条用户加权边被按照非递减顺序排序,并记做w(1,j)≤w(2,g)≤…≤w(m,h),其中j,g,h∈V且可能相同。瓶颈图G1,G2,…,Gm是G的边子图,且Gr=(U,V,Er)(r=1,2,…,m),其中Er={eij|wieij≤w(r,g)}。即,Gr由所有G的顶点和不大于第r个最短加权边w(r,g)的边组成。1) Bottleneck graph: Note that the optimal solution of the PDP problem must be reached at a certain user weighted edge, so we should check the weighted edges one by one from small to large until all constraints are satisfied. The construction of the bottleneck map is based on this idea. The m user-weighted edges are sorted in non-decreasing order and denoted as w(1,j)≤w(2,g)≤...≤w(m,h), where j,g,h∈V and may be the same. Bottleneck graphs G 1 , G 2 ,...,G m are edge subgraphs of G, and G r =(U,V,E r )(r=1,2,...,m), where E r ={e ij |w i e ij ≤w(r,g)}. That is, G r consists of all vertices of G and an edge no larger than the r-th shortest weighted edge w(r,g).

2)门限图:对于每一个Gr,其对应的门限图Tr在用户集合U上如下构建。对每两个点u,v∈U,边(u,v)在Tr中,当且仅当存在一个DCj∈V且(u,j)和(v,j)都在Gr中。例如,在图2中,边(5,6)不存在,因为在Gm中,用户5,6没有相邻的公共DC。2) Threshold graph: For each G r , its corresponding threshold graph T r is constructed on the user set U as follows. For every two vertices u, v ∈ U, an edge (u, v) is in T r if and only if there exists a DCj ∈ V and both (u, j) and (v, j) are in G r . For example, in Figure 2, edge (5, 6) does not exist because users 5, 6 have no adjacent common DC in Gm .

3)极大团:给定无向图H,H的团是一个完全子图。如果一个团不包含在其它团中,则称之为极大团,记做C(H)。很容易在多项式时间内找到C(H)。本文中用到的简单的方法如下,先将H的任意一个点加入C(H),然后加入它的邻居点,这些邻居点和已经在C(H)中的所有点能够构成一个完全图。C(H)中所有点的邻居点都被逐一检查,知道没有点可以加入。这样就找到了一个极大团。现在,删除这个极大团的所有点,并在剩余点中重复这个过程,直到H变成空的。这样,就找到了一系列的极大团,且各极大团无交。3) Maximum clique: Given an undirected graph H, the clique of H is a complete subgraph. If a clique is not included in other cliques, it is called a maximal clique, denoted as C(H). It is easy to find C(H) in polynomial time. The simple method used in this article is as follows, first add any point of H to C(H), and then add its neighbor points, these neighbor points and all points already in C(H) can form a complete graph. Neighbors of all points in C(H) are checked one by one until no point can be added. In this way, a maximal cluster is found. Now, delete all points of this maximal clique and repeat this process in the remaining points until H becomes empty. In this way, a series of maximal cliques are found, and each maximal clique has no intersection.

注意到,如果用户可以被同一个DC服务,则相应的门限图是一个团。例如,在附图1中,用户6-9可以被共同的DC服务,因此其在Tm中的对应门限图就是附图2中包含6-9的团。在门限图中寻找极大团意味着将用户按照DC的可用性进行分组,其中每一组中的用户都可以用其中一个成员来代表。Note that if users can be served by the same DC, the corresponding threshold graph is a clique. For example, in FIG. 1 , users 6-9 can be served by a common DC, so their corresponding threshold graph in T m is the clique containing 6-9 in FIG. 2 . Finding maximal cliques in the threshold graph means grouping users according to the availability of DCs, where each user in a group can be represented by one of its members.

4)最大权重优先算法(BWF)4) Maximum Weight First Algorithm (BWF)

输入G=(U,V,E):用户加权非完全二分图;k:正整数Input G=(U,V,E): user weighted incomplete bipartite graph; k: positive integer

输出D:所选的DC集合Output D: Selected DC set

将用户加权边按非递减顺序排序如下:w(1,j)≤w(2,g)≤…≤w(m,h) Sort user weighted edges in non-decreasing order as follows: w(1,j)≤w(2,g)≤...≤w(m,h)

对每一个Gr构建对应的门限图Tr。对每一个Tr找其极大团集合。假定有H个团。记对应于Tr的极大团集合是Cr=C(Tr)h,其中r=1,2,…,m,h=1,…,HA corresponding threshold map T r is constructed for each G r . For each T r find its maximal clique set. Suppose there are H groups. Note that the maximal clique set corresponding to T r is C r =C(T r ) h , where r=1,2,…,m, h=1,…,H

令t=min{r||Cr|≤k}Let t=min{r||C r |≤k}

For{h=1,...,H}For{h=1,...,H}

{{

令l是具有最大权重wl=max{wj|j∈C(Tt)h,j没有被指配点}。令C(Tt)h的邻居DC是 Let l be a point with maximum weight w l =max{w j |j∈C(T t ) h , j is not assigned}. Let the neighbor DC of C(T t ) h be

e=minj∈N{elj|elj∈Gt}。令v是N中的点,且elv=e。D=D∪ve=min j∈N {e lj |e lj ∈G t }. Let v be a point in N, and e lv =e. D=D∪v

}}

可以证明,BWF算法给出了PDP问题的一个3-近似解(即得到的解的值不大于最优解的值的3倍),其中底层的二分图是非完全的。而且这个解是紧的(即不可能有比3更小的近似比)。It can be proved that the BWF algorithm gives a 3-approximate solution to the PDP problem (that is, the value of the obtained solution is not greater than 3 times the value of the optimal solution), where the underlying bipartite graph is incomplete. And this solution is compact (ie no approximation ratio smaller than 3 is possible).

当底层图为完全二分图时,可以用独立集代替极大团,算法更快,解质量相同。算法BWF试图将最大规模的数据放置到最近的DC。它按照用户加权边从小到大的顺序逐一检查对应的瓶颈图,并构造对应的门限图。然后找到每个门限图的团集合。每一个团都代表了一组可以至少被一个相同DC服务的用户,且用户和共同DC之间的加权边不大于瓶颈图的最大加权边。所有团集合中团的个数不大于k的团集合下标被选中,以使得最大的用户加权边尽可能的小。当然可以用二分法查找来进一步加快速度。When the underlying graph is a complete bipartite graph, independent sets can be used instead of maximal cliques, the algorithm is faster and the solution quality is the same. Algorithm BWF tries to place the largest amount of data to the nearest DC. It checks the corresponding bottleneck graph one by one according to the order of user weighted edges from small to large, and constructs the corresponding threshold graph. Then find the clique set for each threshold graph. Each clique represents a group of users that can be served by at least one same DC, and the weighted edge between users and the common DC is not larger than the maximum weighted edge of the bottleneck graph. The subscripts of all clique sets whose number of cliques is not greater than k are selected so that the largest user weighted edge is as small as possible. Of course, binary search can be used to further speed up.

对于每一个在门限图的团集合Ct中的用户,按照以下方法寻找目标DC。在每一个团中,从具有最大权重的用户开始,我们把它指配到Gt中最近的邻居DC。所有同一个团中的其余用户也被隐式的指配到这个DC。这个过程重复,直到所有Ct中的用户都被指配(for循环)。这样,算法在用户和DC之间建立了一个映射。映射将二分图划分成多个集群,每个集群的中心都是一个DC。For each user in the clique set C t of the threshold map, find the target DC according to the following method. In each clique, starting from the user with the largest weight, we assign it to the nearest neighbor DC in Gt . All other users in the same group are also implicitly assigned to this DC. This process repeats until all users in C t are assigned (for loop). In this way, the algorithm establishes a mapping between users and DCs. Mapping divides the bipartite graph into clusters, with a DC at the center of each cluster.

成本极小化数据放置(CMDP)算法基本思想Basic idea of cost minimization data placement (CMDP) algorithm

对CMDP问题,本发明给出了NPA-CMDP算法,它试图将每一个用户指配到最近(即用户和DC之间的成本最小)的DC。如果DC的个数超过了k,则某个DC将被去掉,相应的用户会被重新指配。这个过程重复直到约束满足。For the CMDP problem, the present invention provides the NPA-CMDP algorithm, which attempts to assign each user to the nearest DC (that is, the minimum cost between the user and the DC). If the number of DCs exceeds k, a certain DC will be removed and the corresponding user will be reassigned. This process is repeated until the constraints are satisfied.

算法2.CMDP最近优先算法(NPA-CMDP)。Algorithm 2. CMDP Nearest First Algorithm (NPA-CMDP).

输入G=(U,V,E):用户加权非完全二分图;k:正整数Input G=(U,V,E): user weighted incomplete bipartite graph; k: positive integer

输出D:所选的DC集合Output D: Selected DC set

将每个用户指配到最近的DC(用户和DC之间的加权距离(成本)定义为wieij+wipj)。将选择的DC集合记为DAssign each user to the nearest DC (the weighted distance (cost) between user and DC is defined as w i e ij +w i p j ). Record the selected DC set as D

While{|D|>k}While{|D|>k}

{{

在D找到DC,和D中其余DC相比,这个DC和指配到它的用户之间的距加权离是最小的。从D中删除这个DC。Find a DC in D that has the smallest distance-weighted distance from the user assigned to it compared to the rest of the DCs in D. Remove this DC from D.

并将原来指配到这个DC的用户重新指配到D中剩余的最近的DCAnd reassign the users originally assigned to this DC to the remaining nearest DC in D

}}

本发明考虑将跨地域分布的大数据移动到云端的可行性,研究了多个目标DC选择问题,用非完全二分图来表示底层设施,从而克服了已有问题都是完全二分图的局限。更加符合实际情况中由于用户偏好,或者是法律限制而导致的并不是每个DC都可用的情况。研究了四种准则,包括。公平数据放置(FDP),优选数据放置(PDP),传输成本最小化数据放置(TCMDP)和成本最小化数据放置(CMDP),实际应用中,可以根据运营商和用户的需求进行选择。The present invention considers the feasibility of moving cross-regional distributed big data to the cloud, studies multiple target DC selection problems, and uses an incomplete bipartite graph to represent the underlying facilities, thereby overcoming the limitation that existing problems are all complete bipartite graphs. It is more in line with the actual situation where not every DC is available due to user preference or legal restrictions. Four guidelines were studied, incl. Fair Data Placement (FDP), Preferred Data Placement (PDP), Transmission Cost Minimized Data Placement (TCMDP) and Cost Minimized Data Placement (CMDP), in practical applications, can be selected according to the needs of operators and users.

以上内容是结合具体的优选实施方式对本发明所作的进一步详细说明,不能认定本发明的具体实施只局限于这些说明。对于本发明所属技术领域的普通技术人员来说,在不脱离本发明构思的前提下,还可以做出若干简单推演或替换,都应当视为属于本发明的保护范围。The above content is a further detailed description of the present invention in conjunction with specific preferred embodiments, and it cannot be assumed that the specific implementation of the present invention is limited to these descriptions. For those of ordinary skill in the technical field of the present invention, without departing from the concept of the present invention, some simple deduction or replacement can be made, which should be regarded as belonging to the protection scope of the present invention.

Claims (7)

1.一种大数据向云端迁移时的数据中心的选择方法,其特征在于,所述方法包括:1. A method for selecting a data center when big data migrates to the cloud, characterized in that the method comprises: 构建底层非完全图,采用激活级别wi的方式来描述用户的数据产生量,定义公平数据放置FDP、优选数据放置PDP、传输成本最小化数据放置TCMDP和成本最小化数据放置CMDP等四种准则,以及基于上述准则之一进行DC的选择;Construct the underlying incomplete graph, use the activation level w i to describe the user's data generation, and define four criteria for fair data placement FDP, optimal data placement PDP, transmission cost-minimized data placement TCMDP, and cost-minimized data placement CMDP , and the selection of a DC based on one of the above criteria; 其中,所述非完全图G=(U,V,E),U代表用户,V代表DC,边长eij∈E(i∈U,j∈V)满足三角不等式,正整数k(k≤|U|,k≤|V|),对于任意i∈U以及j∈V,如果用户i的数据能够被移动到DCj,则它们之间存在一条边;所述方法旨在从可用的DC集合V中找到一个DC子集D(|D|≤k)来按照不同准则存储U中所有用户的数据;Among them, the incomplete graph G=(U, V, E), U represents the user, V represents the DC, the side length e ij ∈ E (i ∈ U, j ∈ V) satisfies the triangle inequality, and the positive integer k (k≤ |U|,k≤|V|), for any i ∈ U and j ∈ V, if the data of user i can be moved to DC j , then there is an edge between them; the method aims at starting from the available DC Find a DC subset D(|D|≤k) in the set V to store the data of all users in U according to different criteria; 所述FDP准则为:最大的用户和被指配到的DC之间的距离极小化,使得每个本地用户可以以最小的时延接入数据: The FDP criterion is: the distance between the largest user and the assigned DC is minimized, so that each local user can access data with the minimum delay: 所述PDP准则为:最大的用户和被指配到的DC之间的加权距离极小化,使得具有更多数据的本地用户可以以最小的时延接入数据: min D ⊆ V , | D | ≤ k max i ∈ U , j ∈ D ( w i e i j ) ; The PDP criterion is: the weighted distance between the largest user and the assigned DC is minimized, so that local users with more data can access data with the minimum delay: min D. ⊆ V , | D. | ≤ k max i ∈ u , j ∈ D. ( w i e i j ) ; 所述TCMDP准则为:所有用户和其被指配到的DC之间的加权距离的和极小化: min D ⊆ V , | D | ≤ k ( Σ i ∈ U , j ∈ D w i e i j ) ; The TCMDP criterion is: the sum of weighted distances between all users and their assigned DCs is minimized: min D. ⊆ V , | D. | ≤ k ( Σ i ∈ u , j ∈ D. w i e i j ) ; 所述CMDP准则为:所有用户的成本之和极小化:cij为具有激活级别wi的用户的总费用。The CMDP criterion is: the sum of the costs of all users is minimized: c ij is the total cost of the user with activation level w i . 2.根据权利要求1所述的方法,其特征在于:所述非完全图为加权非完全图,边长为wieij,其中wi为用户具有激活级别,wieij为移动数据到DCj将需要支付费用。2. The method according to claim 1, characterized in that: the incomplete graph is a weighted incomplete graph, and the side length is w i e ij , wherein w i is the activation level of the user, and w i e ij is the mobile data A fee will be required to arrive at DC j . 3.根据权利要求1所述的方法,其特征在于:所述激活级别wi根据每天上载的载数据量决定。3. The method according to claim 1, wherein the activation level w i is determined according to the amount of data uploaded every day. 4.根据权利要求1所述的方法,其特征在于:对于具有激活级别wi的用户i,如果它想在DCj存储和处理数据,它需要支付wipj,其中,pj=p'j+p”j,p'j为DC处理单位容量数据的价格,p”j为DC存储单位容量数据的存储费用。4. The method according to claim 1, characterized in that: for user i with activation level w i , if it wants to store and process data in DC j , it needs to pay w i p j , where p j = p ' j + p” j , p' j is the price of DC processing unit capacity data, and p” j is the storage cost of DC storing unit capacity data. 5.根据权利要求1所述的方法,其特征在于:基于PDP准则进行DC的选择采用最大权重优先BWF算法,所述BWF算法包括如下步骤:5. The method according to claim 1, characterized in that: the selection of DC based on the PDP criterion adopts the maximum weight priority BWF algorithm, and the BWF algorithm comprises the following steps: 将用户加权边按非递减顺序排序,记做w(1,j)≤w(2,g)≤…≤w(m,h),其中j,g,h∈V,瓶颈图G1,G2,…,Gm是G的边子图,且Gr=(U,V,Er)(r=1,2,…,m),Er={eij|wieij≤w(r,g)};Sort user weighted edges in non-decreasing order, record as w(1,j)≤w(2,g)≤...≤w(m,h), where j,g,h∈V, bottleneck graph G 1 ,G 2 ,...,G m is an edge subgraph of G, and G r =(U,V,E r )(r=1,2,...,m), E r ={e ij |w i e ij ≤w (r,g)}; 对每一个Gr构建对应的门限图Tr,对每两个点u,v∈U,边(u,v)在Tr中,当且仅当存在一个DCj∈V且(u,j)和(v,j)都在Gr中;Construct the corresponding threshold graph T r for each G r , for every two points u, v∈U, the edge (u, v) is in T r , if and only if there is a DCj∈V and (u, j) and (v, j) are in G r ; 对每一个Tr找其极大团集合,如果一个团不包含在其它团中,则称之为极大团;假定有H个团,记对应于Tr的极大团集合是Cr=C(Tr)h,其中r=1,2,…,m,h=1,…,H,令t=min{r||Cr|≤k}Find its maximal clique set for each T r , if a clique is not included in other cliques, it is called a maximal clique; suppose there are H groups, record the maximal clique set corresponding to T r as C r = C(T r ) h , where r=1,2,…,m, h=1,…,H, let t=min{r||C r |≤k} for(h=1,...,H)for(h=1,...,H) {令l是具有最大权重wl=max{wj|j∈C(Tt)h,j没有被指配点};{Let l be a point with maximum weight w l =max{w j |j∈C(T t ) h , j is not assigned}; 令C(Tt)h的邻居DC是e=minj∈N{elj|elj∈Gt};Let the neighbor DC of C(T t ) h be e=min j∈N {e lj |e lj ∈G t }; 令v是N中的点,且elv=e。D=D∪v}。Let v be a point in N, and e lv =e. D=D∪v}. 6.根据权利要求5所述的方法,其特征在于:当底层图为完全二分图时,用独立集代替极大团,加速BWF算法。6. The method according to claim 5, characterized in that: when the underlying graph is a complete bipartite graph, independent sets are used instead of maximal cliques to accelerate the BWF algorithm. 7.根据权利要求1所述的方法,其特征在于:基于CMDP准则进行DC的选择采用CMDP最近优先NPA-CMDP算法,所述NPA-CMDP算法包括如下步骤:7. The method according to claim 1, characterized in that: the selection of DC based on the CMDP criterion adopts the CMDP nearest priority NPA-CMDP algorithm, and the NPA-CMDP algorithm comprises the steps of: 将每个用户指配到最近的DC,用户和DC之间的加权距离定义为wieij+wipj,将选择的DC集合记为D;Assign each user to the nearest DC, the weighted distance between the user and DC is defined as w i e ij +w i p j , and the selected DC set is recorded as D; While{|D|>k}While{|D|>k} {在D找到DC,和D中其余DC相比,这个DC和指配到它的用户之间的距加权离是最小的;{Find a DC in D. Compared with other DCs in D, the distance-weighted distance between this DC and the user assigned to it is the smallest; 从D中删除这个DC;Delete this DC from D; 并将原来指配到这个DC的用户重新指配到D中剩余的最近的DC}。And reassign the users originally assigned to this DC to the remaining nearest DC in D}.
CN201610067866.3A 2016-01-29 2016-01-29 The selection method of data center when big data is migrated to cloud Expired - Fee Related CN105739929B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610067866.3A CN105739929B (en) 2016-01-29 2016-01-29 The selection method of data center when big data is migrated to cloud

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610067866.3A CN105739929B (en) 2016-01-29 2016-01-29 The selection method of data center when big data is migrated to cloud

Publications (2)

Publication Number Publication Date
CN105739929A true CN105739929A (en) 2016-07-06
CN105739929B CN105739929B (en) 2019-01-11

Family

ID=56247304

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610067866.3A Expired - Fee Related CN105739929B (en) 2016-01-29 2016-01-29 The selection method of data center when big data is migrated to cloud

Country Status (1)

Country Link
CN (1) CN105739929B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106776876A (en) * 2016-11-29 2017-05-31 用友网络科技股份有限公司 Data migration method and data mover system
CN107103381A (en) * 2017-03-17 2017-08-29 华为技术有限公司 A kind of method and system for planning of data center
CN109388486A (en) * 2018-10-09 2019-02-26 北京航空航天大学 A kind of data placement and moving method for isomery memory with polymorphic type application mixed deployment scene
CN111427926A (en) * 2020-03-23 2020-07-17 平安医疗健康管理股份有限公司 Abnormal medical insurance group identification method and device, computer equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120155468A1 (en) * 2010-12-21 2012-06-21 Microsoft Corporation Multi-path communications in a data center environment
CN103441942A (en) * 2013-08-26 2013-12-11 重庆大学 Data center network system and data communication method based on software definition
CN104022928A (en) * 2014-05-21 2014-09-03 中国科学院计算技术研究所 Topology construction method of high-density server and system thereof
CN104809539A (en) * 2014-01-29 2015-07-29 宏碁股份有限公司 Dynamic planning method for server resources of data center
CN105264457A (en) * 2014-02-28 2016-01-20 华为技术有限公司 Energy consumption monitoring method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120155468A1 (en) * 2010-12-21 2012-06-21 Microsoft Corporation Multi-path communications in a data center environment
CN103441942A (en) * 2013-08-26 2013-12-11 重庆大学 Data center network system and data communication method based on software definition
CN104809539A (en) * 2014-01-29 2015-07-29 宏碁股份有限公司 Dynamic planning method for server resources of data center
CN105264457A (en) * 2014-02-28 2016-01-20 华为技术有限公司 Energy consumption monitoring method and device
CN104022928A (en) * 2014-05-21 2014-09-03 中国科学院计算技术研究所 Topology construction method of high-density server and system thereof

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106776876A (en) * 2016-11-29 2017-05-31 用友网络科技股份有限公司 Data migration method and data mover system
CN107103381A (en) * 2017-03-17 2017-08-29 华为技术有限公司 A kind of method and system for planning of data center
CN109388486A (en) * 2018-10-09 2019-02-26 北京航空航天大学 A kind of data placement and moving method for isomery memory with polymorphic type application mixed deployment scene
CN109388486B (en) * 2018-10-09 2021-08-24 北京航空航天大学 A data placement and migration method for heterogeneous memory and multi-type application hybrid deployment scenarios
CN111427926A (en) * 2020-03-23 2020-07-17 平安医疗健康管理股份有限公司 Abnormal medical insurance group identification method and device, computer equipment and storage medium

Also Published As

Publication number Publication date
CN105739929B (en) 2019-01-11

Similar Documents

Publication Publication Date Title
CN107392412B (en) Order scheduling method and device
Naas et al. iFogStor: an IoT data placement strategy for fog infrastructure
US20230017632A1 (en) Reducing the environmental impact of distributed computing
CN110645983B (en) Path planning method, device and system for unmanned vehicle
US20200082316A1 (en) Cognitive handling of workload requests
US20170046653A1 (en) Planning of transportation requests
CN113128744B (en) Distribution planning method and device
CN111027853B (en) Order distribution method and device for dense warehousing and electronic equipment
CN110516985B (en) Warehouse selection method, system, computer system and computer readable storage medium
CN111044062B (en) Path planning and recommending method and device
CN105739929A (en) Data center selection method for big data to migrate to cloud
US10789558B2 (en) Non-linear systems and methods for destination selection
CN106202092A (en) Method and system for data processing
CN107633358A (en) Facility addressing and the method and apparatus of distribution
US20230196182A1 (en) Database resource management using predictive models
CN107301519A (en) A kind of task weight pricing method in mass-rent express system
CN109934427B (en) Method and device for generating item distribution scheme
Vasyanin et al. A Methodology of the Mathematical Modeling for Perspective Development of Nodes and Transport Routes in the Multicommodity Hierarchical Network. I. Optimization Problems
Guan et al. Multidimensional resource fragmentation-aware virtual network embedding for IoT applications in MEC networks
CN113222490A (en) Inventory allocation method and device
CN110958666A (en) Network slice resource mapping method based on reinforcement learning
US11720850B1 (en) Dynamic package selection algorithm for delivery
Xia et al. Data locality-aware big data query evaluation in distributed clouds
US20250068996A1 (en) System and Method for Automated Task Allocation
CN107528914A (en) The resource requisition dispatching method of data fragmentation

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20190111

Termination date: 20220129

CF01 Termination of patent right due to non-payment of annual fee