CN106126340B

CN106126340B - A kind of reducer selection method across data center's cloud computing system

Info

Publication number: CN106126340B
Application number: CN201610459589.0A
Authority: CN
Inventors: 包卫东; 朱晓敏; 周文; 肖文华; 纪浩然; 王吉; 陈超; 邵屹杨; 刘桂鹏
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2016-06-23
Filing date: 2016-06-23
Publication date: 2018-11-02
Anticipated expiration: 2036-06-23
Also published as: CN106126340A

Abstract

The invention discloses a kind of reducer selection methods across data center's cloud computing system to include：Systematic parameter is obtained from cloud computing system；Object function and constraint are obtained according to systematic parameter；Drift-penalty factor and its upper bound of object function are obtained using Liapunov Optimization Framework；Reducer options is extracted from the upper bound in drift-penalty factor and generates reducer selection method.The present invention obtains drift-penalty factor of object function with its upper bound using Liapunov Optimization Framework and extracts the technological means of reducer options, the cost across data center is balanced, solves to high-performance, High Availabitity and least cost the scheduling problem across data center's cloud computing system.

Description

Method for selecting stipulator of cross-data center cloud computing system

Technical Field

The invention relates to the field of virtualization clouds, in particular to a stipulator selection method of a cross-data center cloud computing system.

Background

The speed of data generation is never so fast today: YouTube produced nearly 40 hundred million video viewing records per day with a total duration of nearly 432000 hours of new video. With the advent of the big data era, the data volume of various industries is continuously increased, the potential huge value of the data volume is worthy of mining, and social websites such as FaceBook can reveal user use patterns and potential relations by analyzing website history records (including click records, activity records and the like) and detect social hotspot events or serve market decision. However, fast processing of large volumes of geographically dispersed data is so complex that traditional PCs have not been able to meet their processing requirements, for which many companies have deployed multi-data center clouds and hybrid clouds to address this problem. These cloud technologies provide powerful and efficient solutions to handle the ever-increasing speed of diverse large data sets, some of which have been invested in business to cope with the ever-increasing computing demands while providing users with a guaranteed quality of service.

The map-reduce model (MapReduce) is a distributed programming model for massively parallel data processing, which has shown its compelling advantages in many applications. The original maprduce model was not designed for cross-datacenter, although recent research has extended the original MapReduce model from single datacenter to multiple datacenters. The most interesting problems include: (1) how to migrate large-scale data in different locations to geographically distributed data centers ?(2) requires how much computing resources to provide in these data centers to guarantee performance while minimizing the cost of ? heterogeneous, diverse, and dynamic utility-driven resource pricing models of large data makes these two problems very challenging. In addition, internal dependencies between multiple phases of distributed computing, such as the interaction of the mapping phase with the reduction phase in MapReduce computing, further exacerbate the complexity of data migration, resource provisioning, and reducer selection problems among geographically distributed multiple data centers.

Aiming at the problem that a scheduling scheme of a cross-data center cloud computing system based on a mapping-protocol model is lacked in the prior art, an effective solution is not available at present.

Disclosure of Invention

In view of the above, the present invention is directed to a method for selecting a reducer of a cross-data center cloud computing system, which can balance the cost of the cross-data center to achieve high performance and high availability, and solve the scheduling problem of the cross-data center cloud computing system with the minimum cost.

Based on the above purpose, the technical scheme provided by the invention is as follows:

according to one aspect of the invention, a method for selecting a stipulator of a cross-data center cloud computing system is provided, and comprises the following steps:

acquiring system parameters from a cloud computing system;

obtaining a target function and constraint according to system parameters;

obtaining a drift-penalty factor of the objective function and an upper bound thereof by using a Lyapunov optimization framework;

and extracting a stipulator selection item from an upper bound in the drift-penalty factor and generating a stipulator selection method.

Wherein obtaining the objective function according to the system parameters comprises:

describing decision variables by using system parameters;

describing the cost of the data center by using system parameters and decision variables;

and describing the objective function and the constraint according to the decision variables and the cost of the data center.

The cloud computing system comprises a plurality of data sources and a plurality of data centers, wherein each data center comprises a mapper and a stipulator; when the cloud computing system performs data migration, the data of the data source is transferred to a mapper of any data center to perform mapping operation and generate an intermediate key value pair, and then the intermediate key value pair is transferred from the mapper of any data center to a stipulator of a single data center to perform stipulation operation.

And, wherein the system parameters include:

the method comprises the steps of a data center set, a virtual machine type set and a data set;

the data volume transferred from a certain data source to a certain data center at a certain moment, the data volume generated by the certain data source at the certain moment and the maximum data volume generated by the certain data source at each moment;

the method comprises the steps of transferring the unit data volume of a certain data source to a certain data center, storing the unit data in the data center, the data volume which is not processed in the data center at a certain time, delaying from the certain data source to the certain data center, delaying an economic expense conversion factor, the price of a certain type of virtual machine in the certain data center at a certain time, the data volume transferred from the certain data center at a certain time and the migration expense between the certain data center and the two data centers.

Meanwhile, the decision variables comprise data distribution variables, virtual machine supply variables and stipulator selection variables, and the description of the decision variables by using system parameters comprises the following steps:

describing data distribution variables by using the data volume transferred from a certain data source to a certain data center at a certain moment, the data volume generated by the certain data source at a certain moment and the maximum data volume generated by the certain data source at each moment;

describing a virtual machine supply variable by using the number of certain types of virtual machines used as mapping and provided from a certain data center at a certain moment and the number of certain types of virtual machines used as specifications and provided from the certain data center at a certain moment;

and selecting variables by using a data center description stigmator to which all data generated by the mapper at a certain time is collected.

And the cost of the data center comprises bandwidth cost, storage cost, delay cost, calculation cost and migration cost, and the cost of the data center is described by using system parameters and decision variables comprises:

describing bandwidth spending using price and data allocation variables for transferring unit data amount from a certain data source to a certain data center;

describing storage cost by using the storage price of unit data in a data center, the data volume which is not processed by the data center at a certain time and a data distribution variable;

describing delay expenses by using delay from a certain data source to a certain data center, a delay economic expense conversion factor and a data distribution variable;

describing and calculating cost by using the price of a certain type of virtual machine in a certain data center at a certain moment and a supply variable of the virtual machine;

the migration cost is described by using the data volume transferred from a certain data center at a certain moment, the migration cost between two certain data centers, a virtual machine supply variable and a stipulator selection variable.

And, describing the objective function and the constraint according to the decision variables and the cost of the data center comprises:

the sum of the data volumes transferred from a data source to each data center at a certain moment is equal to the sum of the data volumes generated by the data source at the moment;

the number of virtual machines used as mapping and stipulating by a certain data center at a certain moment is less than or equal to the number of available virtual machines of the data center at the moment;

only one data center is selected as a specification device at the same time;

the average data arrival rate of a certain data center is less than or equal to the average data processing rate of the data center;

the sum of bandwidth costs, storage costs, delay costs, computation costs and migration costs is minimal.

Wherein, using the lyapunov optimization framework to obtain the drift-penalty factor of the minimum cost function and the upper bound thereof comprises:

constructing an actual queue and a virtual queue according to the target function and the constraint, and constructing a Lyapunov function by using a Lyapunov optimization framework;

calculating and obtaining 1 time slot Lyapunov drift and drift-penalty factor according to the Lyapunov function;

an upper bound for the drift-penalty factor is calculated.

And, constructing an actual queue and a virtual queue according to the objective function and the constraint, and constructing a lyapunov function by using a lyapunov optimization framework comprises:

describing the actual queue of the mapping queue according to the target function and the constraint as well as the unprocessed data volume in a certain data center mapper at a certain moment;

describing a mapping queue virtual queue according to a target function and a constraint and the maximum delay of a mapping queue actual queue;

describing a protocol queue actual queue according to a target function and constraint and unprocessed data volume in a certain data center protocol reducer at a certain moment;

describing a protocol queue virtual queue according to a target function and constraint and the maximum delay of a protocol queue actual queue;

and constructing a Lyapunov function by using a Lyapunov optimization framework according to the mapping queue actual queue, the mapping queue virtual queue, the reduced queue actual queue and the reduced queue virtual queue.

Meanwhile, the method for extracting the stipulator selection item from the upper bound in the drift-penalty factor and generating the stipulator selection comprises the following steps:

extracting a polynomial with a stipulator selection variable from an upper bound in the drift-penalty factor;

according to the constraint of the variable description selected by the stipulator, the polynomial is evaluated to be minimum;

and generating a stigmator selection method according to the solution of the stigmator selection variable at the minimum value of the polynomial.

From the above description, the technical solution provided by the present invention balances the cost across data centers, and solves the scheduling problem of the cloud computing system across data centers with high performance, high availability and minimal cost by using the technical means of obtaining the drift-penalty factor of the objective function and its upper bound and extracting the stipulator option by using the lyapunov optimization framework.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

Fig. 1 is a flowchart of a stipulator selection method for a cross-data center cloud computing system according to an embodiment of the present invention;

fig. 2 is a system structure diagram of a cross-data center big data processing by using MapReduce in a stigmator selection method of a cross-data center cloud computing system according to an embodiment of the present invention;

FIG. 3 is a graph showing the change of user visit data from 6 months 21 to 27 days of the world cup website in 1998;

fig. 4 is a line graph illustrating total system cost over time for a method for selecting a reducer of a cross-data center cloud computing system according to an embodiment of the present invention;

fig. 5 is a line graph illustrating the change of the system costs with time in a method for selecting a reducer of a cross-data center cloud computing system according to an embodiment of the present invention, wherein the MiniBDP algorithm is used;

fig. 6 is a line graph illustrating the variation of the average system cost with the parameter V in a method for selecting a reducer of a cross-data center cloud computing system according to an embodiment of the present invention, using the MiniBDP algorithm;

fig. 7 is a detailed matrix diagram of data allocation amounts of data sources to data centers in a method for selecting a reducer of a cross-data center cloud computing system according to an embodiment of the present invention;

fig. 8 is a detailed matrix diagram of distances from data sources to data centers in a method for selecting a stipulator of a cross-data center cloud computing system according to an embodiment of the present invention;

fig. 9 is a histogram of Reducer times selected for each data center in a method for Reducer selection in a cross-data center cloud computing system according to an embodiment of the present invention;

fig. 10 is a bar graph comparing costs of various policies in a method for selecting a conventier for a cross-data center cloud computing system according to an embodiment of the invention;

fig. 11 is a graph illustrating comparative line shapes of a plurality of policies on queue length in a method for selecting a reducer of a cross-data center cloud computing system according to an embodiment of the present invention;

fig. 12 is a graph illustrating a comparison of cumulative costs of a MiniBDP algorithm and an offline optimization method in a method for selecting a stipulator of a cross-data center cloud computing system according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions in the embodiments of the present invention will be further described in detail, clearly and completely, with reference to the accompanying drawings in the embodiments of the present invention. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments that can be derived by one of ordinary skill in the art from the embodiments given herein are intended to be within the scope of the present invention.

According to the embodiment of the invention, a method for selecting a stipulator of a cross-data center cloud computing system is provided.

As shown in fig. 1, a method for selecting a conventier of a cross-data center cloud computing system according to an embodiment of the present invention includes:

step S101, system parameters are obtained from a cloud computing system;

step S103, obtaining a target function and constraint according to system parameters;

step S105, obtaining a drift-penalty factor of the objective function and an upper bound thereof by using a Lyapunov optimization framework;

and step S107, extracting stipulator selection items from the upper bound in the drift-penalty factors and generating a stipulator selection method.

describing decision variables by using system parameters;

And, wherein the system parameters include:

only one data center is selected as a specification device at the same time;

an upper bound for the drift-penalty factor is calculated.

The technical features of the present invention will be further described below with reference to specific embodiments.

In the map-reduce model (MapReduce), a Mapper (Mapper) processes an input data set and then outputs a series of intermediate key-value pairs, denoted as < key, value >, resulting from a mapping phase; and the Reducer (Reducer) receives all the intermediate data from the Mapper and fuses these values to smaller values according to a specific key. Mapper and Reducer can be deployed in different data centers.

In a distributed data center environment, geographically dispersed data processing execution paths are important. Processing geographically dispersed data across data centers using MapReduce can be divided into three execution paths: COPY, MULTI and GEO. COPY is a strategy for copying all sub-data (Map-generated intermediate data) to a single data center, and when the output data generated by MapReduce is smaller than the input data, the method is not efficient; MULTI is a method of performing MapReduce operations on each data subset and then summarizing the results, and has a disadvantage in that the expected result can be obtained only when the MapReduce operation sequence does not affect the final result; the GEO is a strategy for executing Map operations in different data centers and then copying all intermediate results to a single data center for Reduce operations, and is applicable to the fact that the jobs of the Reduce nodes are related to each other, such as calculating the median of the number of pages in a webpage cache. Since most of the applications are related to each work, the embodiment takes the GEO path execution policy into consideration when modeling.

Fig. 2 is a system structure diagram illustrating a Data Service Provider (DSP) managing a plurality of Data sources (e.g., user request records of multiple areas of a large website) and transmitting all Data to a cloud for processing. As shown in fig. 2, Data sources (Data sources) in different geographic locations continuously generate a large amount of Data, Data analysis applications are deployed in the cloud, and the Data sources are connected to Data centers in different locations. In the model, once the data in the data source is generated, the data is transferred to the data center in real time to be processed in an incremental mode, wherein the incremental mode is that only the newly added data is calculated, and the intermediate data generated in the past can be reused. In particular, mappers for Map operations and reducers for Reduce operations are deployed in each data center.

Due to the aforementioned GEO execution path in terms of MapReduce computation across data centers, there are two corresponding phases to the data migration process: in the first stage, data can be transferred to any data center for Map operation; in the second phase, the intermediate data generated by Map operations of the data centers must be transferred to a single data center in consideration of the relevance between the data centers. As shown in FIG. 2, the thick lines represent example execution paths, which show that the original data from data sources 1 and 2 are transferred to multiple data centers for Map operations, and then the intermediate data output by each Map operation is aggregated into the Reducer of data center 1 for Reduce operations.

Formally is provided withIs a geographically distributed data center collection with a data capacity ofThe value is D (D is more than or equal to 1 and less than or equal to D).For a collection of virtual machines of different types, their sizeEach virtual machine having a different CPU and memory configuration, i.e. a specific computing speed v_kAnd each data center may provide all types of virtual machines. The data is fromDynamically generated in a data center (with the value R, 1 ≦ R ≦ R), denoted as setData generated at any position can be transferred to any data center for Map operation, and then intermediate data generated by each Mapper is converged to a single data center for Reduce operation. For better practicality, this embodiment assumes a bandwidth B from data location r to data center d_rdIs limited and bandwidth is the bottleneck portion that affects system performance, while network bandwidth inside the data center is very high. In addition, the data generated by each region are independent; the resource prices (e.g., virtual machines, storage, etc.) for each data center are different, and the prices also vary over time.

The cloud computing system operates in a time series, and is divided into T0, 1, … and T. In each time series, the data service provider needs to make decisions on several aspects:

(1) Determine how much data to move from data location r to data center d?

(2) How many resources are leased per data center to support data site ?

(3) Selecting which data center to do Reduce operations ?

Our goal is to minimize the overall cost of cloud big data analysis and to guarantee processing delays over long runs. Based on the above system model, we mathematically model the problem, describing three decisions using three decision variables.

data allocation variables:this is expressed as the amount of data transferred from data location r to center d at time t, meaning that the data generated at each location can be transferred to any data center for data analysis. Let a_r(t) is the amount of data generated by the r-th region at time t,the maximum amount of data generated per time for position r. Thus, we have:

equation (2) ensures that the sum of the data distributed to each data center at a certain time position r is equal to the total amount of data generated at that time position. Set of variables can be written as

supply variables for the virtual machine:the number of the k-type virtual machines used as Map operation and Reduce operation and provided by the data center d at the time t is recorded respectively, and the number of the virtual machines can change continuously along with time. Due to the limited computing resources of a single data center, the maximum number of k types of virtual machines in the data center d is set asTherefore, there are:

the above equation means that in a particular data center, the number of resources used for Map operations and Reduce operations does not exceed the number of available resources in the current data center,similarly, we can also define n (t).

Reducer selects variables:all data generated by Mapper at the time t are converged to a single data center for reduce operation, x_d(t) is defined as a binary variable. When x is_dAnd when the value (t) is 1, indicating that the data center d is selected to perform the Reduce operation, otherwise, not performing the Reduce operation. Namely:

wherein,ensure that only one data center is selected for Reduce operation at time t, define the set

The cost is further described in terms of 3 decision variables. The goal of the data service provider is to minimize the overall costs incurred by the system by optimizing the data distributed to each data center, the resources provided by each data center, and the appropriate Reduce target data center at a given time. The present embodiment considers the following costs: bandwidth costs, storage costs, latency costs, computational costs, and migration costs.

bandwidth cost, generally, the bandwidth cost will be different with different VPNs due to different network operatorsTo transfer 1Gb data from data source r to data center d for a price, the total bandwidth cost to transfer the data into the cloud at time t is

(ii) storage cost-due to the large amount of data that needs to be analyzed, which is very important for selecting a data center for storage_d,W_d(t) Respectively representing the data storage price and the data volume which is not processed by the data center, the total storage cost at the time t can be represented as:in particular, from formula (16) and formula (18), we can obtain W_d(t)＝M_d(t)+R_d(t).

delay cost-the delay in uploading data to the data center also has a significant impact on the system performance, and we need to minimize it during data processingThe delay between location r and data center d, which is determined by the geographical distance of the data source location from the data center, can be calculated by a simple command, such as Ping, during actual operation. We translate the delay into economic expense. Thus, the delay cost can be defined as:where α is the conversion factor between delay and economic cost.

The total cost of bandwidth cost, storage cost, delay cost is:

computing cost-the number of virtual machines rented from a data center is critical to the overall cost of big data analytics applications and system performance due to the constant change in virtual machine price over timeFor the price of the k-class virtual machines in the data center d at the time t, the calculation cost can be actually calculated as follows:

migration cost in many applications, analyzing data requires not only new data at the current time, but also historical data (e.g., when new data arrives, incremental data analysis reuses historical computation results rather than recalculating them). therefore, historical intermediate data generated by other data centers will migrate to the selected Reducer, and data migration cost is inevitably generatedWherein f is_ifor a particular application, the intermediate data generated by Map operations in the data center at time τ i can be used for estimation, since a factor γ exists between the amount of raw data and the amount of intermediate data output, [ beta ] is_τ∈[0,1]indicating the proportion of historical data that needs to be migrated, the condition β needs to be satisfied_a＜β_b(a < b), meaning that the significance of the historical data decreases with increasing time, the specific value being determined by the particular application. In addition, let us remember phi_id(. h) is a migration cost function (including bandwidth cost and latency cost) for migrating data from data center i to data center d, which can be determined by the bandwidth price and the geographic distance between the two data centers. Also, since data migration inside the same data center is not considered, the migration cost function needs to satisfy Φ when i is d_id(· 0). Thus, the total migration cost generated by the system at time t is:

based on the above mathematical description of 5 costs, the total cost generated by the system at time t can be described as:

thus, minimizing the average cost problem of [0, T ] data migration and data processing over a period of time can be formalized as:

wherein, for the average amount of data assigned to data center d for time T,the average number of virtual machines provided for data center d for Map operations,the average number of virtual machines for Reduce operations provided for data center d.Representing the mean amount of data that is input into data center d for Reduce operations. Constraints (15) guarantee Map load queues M by ensuring that data arrival rates do not exceed the data average processing rate_dAnd Reduce load queue stability.

Since the data generation is random, x is an integer constraint variable, h_i(t) is a nonlinear function, and the above problem can be easily verified as a random integer nonlinear optimization problem. In general, when T is very large, it is difficult to adopt a centralized approach to efficiently solve this problem. In view of this, the present embodiment uses the lyapunov optimization framework to solve the problem. The most unique advantage of the lyapunov optimization method is that it can obtain any solution that is demonstrable close to the offline optimal solution by greedily minimizing the drift penalty in each time series, without requiring any information about the future. In this embodiment, the problem P1 is first converted into an optimization problem of minimizing the lyapunov drift-penalty term, and then an algorithm is designed to solve the problem.

Since the present embodiment considers incremental data processing, the data processing process can be modeled as an evolving queue model. In each data center, in order to describe the two phases of data processing MapReduce, the corresponding queues are designed as follows:

in the Map phase: suppose M_dAnd (t) is the data volume processed in the Map queue of the data center d at the time t. Initialization M_d(0) The update of the queue can be described as follows:

the above update rule indicates that the data processing amount and the newly arrived data amount at the data center d at the time t are respectivelyAnd

to guarantee queue M_d(t) a worst delay of l_mDesigning the corresponding virtual queue Y_d(t) of (d). Also, initialize Y_d(0) 0, it follows the following update rule:

wherein, when M_dWhen the ratio of (t) > 0,otherwiseIn the same way, when M_d(t)＝0，OtherwiseWherein epsilon_dIs a constant preset to control the worst delay of the Map queue. It can be shown that if queue M_d(t) and Y_d(t) is bounded, the maximum delay for data processing is l_mA time slot therein Andare respectively a queue M_d(t) and Y_d(t) maximum length.

In the Reduce stage: similar to the Map phase, the corresponding queue in data center d is R_d(t) (provided with R)_d(0) 0), the update procedure for this pair of columns is as follows:

wherein,historical intermediate data of time u migrated from other data centers at time t. From the above equation, the system allows only a portion of the data in the same time slot to be processed and migrated with intermediate data. When the actual system is deployed, the system waits for all intermediate results to further output the final result.

Accordingly, its virtual queue can be defined as:

in theory, queue R could be guaranteed as well_d(t) worst delay.

Let M (t) be [ M ]_d(t)]、Y(t)＝[Y_d(t)]、R(t)＝[R_d(t)]、Z(t)＝[Z_d(t)]，Is a joint matrix of the Map queue and the Reduce queue. In order to measure the congestion degree of the system in the data processing process, let Θ (t) be [ M (t); r (t); y (t); z (t)]. The lyapunov function can be defined as follows:

wherein L (Θ (t)) represents the queue backlog condition in the system. To guarantee queue stability by keeping the lyapunov function continuously in a low congestion state, we introduce a 1-slot lyapunov drift as follows:

according to the Lyapunov optimization theory, a drift-penalty factor can be calculated by adding a system cost function to the Lyapunov drift amount:

where V is a non-negative factor that balances overall system cost and stability. The intuitive conclusion is that the larger V, the less expensive and vice versa. Therefore, problem P1 may be transformed into solving problem P2:

P2.min：(22) (23)

s.t.：(10)(11)(12)(13)(14). (24)

to solve problem P2, we are working to find the upper bound of equation (22) without directly finding the minimum of that equation. This approach has been shown not to affect the optimality of the results and the performance of the algorithm. Thus, the core of the problem is to find the upper bound of equation (22). It can be shown that for any decision scheme, the equation (22) satisfies:

wherein,

we extract the normalizer selection problem from the problem P2 by analyzing the right half of equation (25). Looking at the right part of the equation for equation (25), the polynomial with the reducer selection can be written as:

because tau epsilon [ t-mu, t-1]Thus, time t is time f_i(tau) it is known that,are also known. This problem is also translated into a simple max-flow problem. Therefore, the method is easy to obtain:

wherein

The present embodiment provides an online algorithm that runs for a long time as follows:

the effectiveness of the algorithm is verified by a comparison experiment. We evaluated the performance of the algorithm herein using the world cup98 dataset, which records user visit data from 30 days 4 months to 26 days 7 months for the world cup website in 1998, data from 30 servers in 4 locations (paris 4, hein 10, pleino 10, santa clara 6). Each record contains the following detailed information: request time, requesting client, requesting object, server processing the request, etc. We extracted data from 6 months 21 to 27 days a week and performed experiments, and to simulate a large-scale website, the original request number was enlarged 1000 times, the request numbers were summarized every 30 minutes, and the record content of each request was 100KB, so that the data change graph shown in fig. 3 can be obtained.

In the experiment, we assume that the model contains four data sources (4 data locations in santa clara, pranopo, hero, paris in the usa corresponding to the dataset) and 12 data centers (12 amazon in europe and america servers corresponding to ashbya, dallas, los angeles, miami, newark, palo alto, seattle, st louis, amsterdam, dublin, frankfurt, london); examples of 5 types of virtual machines (c3.large, c3.xlarge, c3.2xlarge, c3.4xlarge, c3.8xlarge) provided by Amazon EC2 were also considered in the experiments. The distance between the data center and the data source is obtained by an online tool.

The model parameters were set as follows: measuring the link delay of data from a data center by using an RTT (Round Trip Time), namely, RTT (ms) is 0.02 distance (km) + 5; the price of the virtual machine and the price of the storage adopt the shotminstance price of Amazon and the price of S3 respectively, and the price of the virtual machine and the price of the storage are respectively transmitted through the link<r,d>Unit price compliance for uploaded data [0.1, 0.25 ]]dollar/GB, setting data migration cost as linear function relative to data, using only the intermediate data of the first two time slots as history data, i.e. beta_t-1＞β_t-2＞β_t-3other parameters are, V60, γ 0.5, α 0.01, e_d＝1,σ_d＝γ×ε_d。

The experiment mainly considers two indexes of expense and queue length, wherein the expense represents the economic factor of the system, and the queue length describes the stability factor of the system. For comparison, we used the Cost Ratio (CR) of a certain case to the total Cost as an index in the experiment. It can be represented by the formulaCalculation of where C_iCost for the case 1, C_curFor the cost of the current case, N is the totalThe number of cases.

We performed experiments with fixed parameters to show the validity of MiniBDP (name of the algorithm implemented by the invention) in problem solving. Figure 4 shows the total cost of the system as a function of time. As can be seen from fig. 3 and 4, the total system cost changes with the change of the data size, which shows that the algorithm MiniBDP can adaptively and dynamically adjust the supply of virtual machines to meet the changing data processing demand without predicting the future demand. Fig. 5 shows various charges (i.e., processing charges, storage charges, bandwidth charges, delay charges, and migration charges) as a function of time, and the results show that data processing charges account for a large portion of the total charges while other types of charges account for a lower level. This is illustrated from another side, where the algorithm presented herein is able to select a suitable data center for data processing, thereby reducing additional costs.

To dissect the internal properties of the algorithm, we show the detailed results of the data allocation and reducer selection results. As can be seen in conjunction with fig. 7 and 8, the algorithm results herein exhibit the property of data localization, as data tends to migrate to data center processing near the data source. Even though north america is less expensive than europe, data generated in paris is rarely transferred to north american data centers (ashen, dallas, los angeles, miami, newark, palo alto, seattle, st. Fig. 9 shows the number of times each data center is selected Reducer, and as shown in fig. 9, most of the Reduce operations are concentrated in data centers in north america. This is because it is more economical to migrate intermediate data from 4 data centers in europe to 8 data centers in north america than to migrate in the opposite direction.

We also analyzed the effect of parameter V on algorithm performance through experimentation. Figure 6 shows the variation of cost and queue length with the parameter V, as shown in figure 6, the time-averaged cost generated by the system decreases with increasing V, and when V is large enough, the system average cost has a minimum. This result provides theoretical guidance for our cost reduction when deploying real systems. However, as V increases, the load queue length also increases, which in turn causes data processing delays. Therefore, it is important how to select the appropriate V to balance the total cost of the system and the delay.

We also compare the algorithm MiniBDP herein with other algorithms that are combined from different data allocation policies, resource provisioning policies and Reducer selection policies.

for the Data distribution part, 3 representative strategies are mainly considered, namely, ② near-cost-aware Data Allocation Principle (PDA) is used for distributing Data generated by each Data source to the nearest Data center, the strategy has minimum delay and is suitable for delay-sensitive scenes, and ② Load Balancing Allocation principle (LBDA) is used for distributing the Data to the Data center with the minimum Load.

for the resource supply part, two simple strategies are mainly considered, namely ① Heuristic strategy (HVP) which determines the resource supply amount of ① virtual machine at the current moment based on the resource demand at the historical moment, ① fixed Strategy (SVP) which keeps the fixed supply amount of each type of virtual machine, wherein 50% of the resource amount required at the previous moment is added to the resource amount required at the current moment to solve the problem of strong load fluctuation, and the fixed value is set as the average value of the results obtained by the MiniBDP for comparison, and the total amount of the strategy in the T moment is equal to the total amount supplied by the MiniBDP.

for the Reducer Selection part, two strategies are mainly considered, namely ① a minimum migration Cost Selection (MCRS), namely ① a data center with the minimum data migration Cost is selected as ① a Reducer, and ① a Load Balance Selection (LBRS), namely ① a data center with the minimum Reduce Load in the data center is selected as the Reducer.

Thus, combining the above strategies can form different scenarios as follows:

MiniBDP；

SVP+PDA+MCRS、SVP+PDA+LBRS、SVP+LBDA+MCRS；

SVP+LBDA+LBRS、SVP+MPDA+MCRS、SVP+MPDA+LBRS；

HVP+PDA+MCRS、HVP+PDA+LBRS、HVP+LBDA+MCRS；

HVP+LBDA+LBRS、HVP+MPDA+MCRS、HVP+MPDA+LBRS。

figure 10 shows the time-averaged cost comparison for the different scenarios. As shown in fig. 10:

however, because the loads corresponding to the two schemes increase with time, which means that long-time operation of the system cannot be guaranteed, the schemes SVP + PDA + MCRS and SVP + PDA + LBRS are not feasible in practical terms, while the algorithm MiniBDP has the characteristic of data localization, therefore, the MiniBDP can balance data localization and system stability in consideration of the above results.

the highest cost is generated by HVP + LBDA + MCRS and HVP + LBDA + LBRS, the two schemes mainly adopt a load balancing data distribution strategy, data are distributed to all data centers equally, and large-scale data are migrated from USA to Paris without considering delay cost and resource price, so that high delay cost and calculation cost are inevitably caused.

As shown in fig. 11, MiniBDP is most stable after a long run (because its queue length remains most stable). However, the queue length of other strategies increases with time, which inevitably leads to a breakdown of the system. Note again that the SVP resource provisioning policy is the same amount of resources as the MiniBDP policy, but incurs higher cost and lower system stability than the MiniBDP, so the MiniBDP can optimize between these three decisions to reduce overall cost and improve system stability. As mentioned above, the amount of virtual machine resources provided by the HVP is an additional 50% of the amount required in the previous slot, and the scheme using these strategies does not show good performance because its corresponding queue length is not stable on the time axis.

In addition, we also compared MiniBDP with the offline optimal results. Since the original problem contains 60480 variables, (m, n contain 60 variables in each slot; x contains 12 variables; λ contains 48 variables; thus there are 180 x 336 variables for 336 slots), it is very difficult to solve this large-scale integer nonlinear programming problem efficiently on a PC using existing optimization tools (e.g., GLPK, CPLEX, lpsolva, etc.). Therefore, we divide T slots into several time segments at regular intervals to solve separately. Since the data arrival rate is known in this manner, the result obtained is a sub-optimal solution offline. And in this case, the maximum delay of data processing is actually set to the interval slot because data must be processed to completion within the interval slots. In the experiment, we compared the effect of different intervals on the results.

Fig. 12 shows a comparison of the accumulated time cost for different time intervals (optimal-x represents an interval x), and as shown in fig. 12, the MiniBDP cost is lower than the cases where the interval is 1interval 2 and 4, and the cost is lower the larger the interval is. We believe this is mainly due to: firstly, under the conditions of optimal-1, optimal-2 and optimal-4, data processing must be completed in 1, 2 and 4 time slots respectively; second, a smaller interval requires more virtual machine resources to complete data processing more quickly. However, for MiniBDP, which has a soft delay control mechanism, it can be adjusted by setting parameters epsilon and sigma, and by setting a longer delay, the overall cost can be reduced. The MiniBDP is compared with the offline suboptimal method in terms of solving time, and experimental results show that the MiniBDP solving time is far lower than that of the offline suboptimal method, so that the method has very obvious advantages.

In summary, the present invention designs a theoretical framework for data movement that aims to minimize total cost. By means of the technical scheme, the drift-penalty factor of the objective function and the upper bound thereof are obtained by using the Lyapunov optimization framework, and the technical means of extracting the stipulator option are adopted, so that the cost of the cross-data center is balanced, and the scheduling problem of the cross-data center cloud computing system is solved with high performance, high availability and minimum cost. 5 types of expenses such as bandwidth expense, storage expense, calculation expense, migration expense, delay expense and the like generated in two stages of data processing of a cross-data center MapReduce are balanced; we model a complex cost optimization problem as a joint random integer nonlinear optimization problem, and minimize the above five costs simultaneously; by utilizing the Lyapunov technology, the original problem is converted into a sub-problem selected by a corresponding stipulator; detailed theoretical analysis is carried out on the MiniBDP algorithm to prove the performances of the MiniBDP in the aspects of cost optimality, worst delay and the like; based on real world historical data, the correctness of theoretical analysis and the superiority of MiniBDP compared with other typical algorithms are verified through simulation experiments.

Those of ordinary skill in the art will understand that: the invention is not to be considered as limited to the specific embodiments thereof, but is to be understood as being modified in all respects, all changes and equivalents that come within the spirit and scope of the invention.

Claims

1. A method for selecting a stipulator of a cross-data center cloud computing system, wherein the cloud computing system comprises a plurality of data sources and a plurality of data centers, wherein each data center comprises a mapper and a stipulator; when the cloud computing system performs data migration, firstly, transferring data of the data source to a mapper of any data center to perform mapping operation and generate an intermediate key value pair, and then, transferring the intermediate key value pair from the mapper of any data center to a stipulator of a single data center to perform stipulation operation; the cloud computing system operates according to a time sequence and is divided into T0, 1, … and T;

the method comprises the following steps:

acquiring system parameters from a cloud computing system; the system parameters include: the method comprises the steps of a data center set, a virtual machine type set and a data set; the data volume transferred from a certain data source to a certain data center at a certain moment, the data volume generated by the certain data source at the certain moment and the maximum data volume generated by the certain data source at each moment; the method comprises the following steps that a certain data source transfers the unit data volume to a certain data center, the storage price of the unit data in the data center, the data volume which is not processed by the data center at a certain time, the delay from the certain data source to the certain data center, a delay economic expense conversion factor, the price of a certain type of virtual machine in the certain data center at a certain time, the data volume transferred from the certain data center at a certain time and the migration expense between two data centers;

obtaining an objective function and a constraint according to the system parameters, comprising: describing decision variables using the system parameters; describing costs of a data center using the system parameters and the decision variables; describing an objective function and a constraint according to the decision variables and the cost of the data center; the decision variables include data distribution variables, virtual machine supply variables, and reducer selection variables, and the describing the decision variables using the system parameters includes:

describing data distribution variables by using the data volume transferred from a certain data source to a certain data center at a certain moment, the data volume generated by a certain data source at a certain moment and the maximum data volume generated by a certain data source at each momentThe amount of data transferred from data location r to center d at time t; let a_r(t) is the amount of data generated by the r-th region at time t,for the maximum amount of data generated at position r per time, the expression is:

in the above formula, the first and second carbon atoms are,is a collection of geographically distributed data centers,is a data center set; the variable set corresponding to the data distribution variable is

Describing a virtual machine supply variable by using the number of certain types of virtual machines used as mapping and provided from a certain data center at a certain moment and used as specificationsRespectively recording the number of k types of virtual machines used as mapping operation and specification operation and provided by a data center d at the time t; let the maximum number of k-type virtual machines in the data center d beThe expression is as follows:

the variable set corresponding to the supply variable of the virtual machine is

Data center for collecting all data generated by using mapper at a certain timeDescribing a specifier selection variable, said specifier selection variable being x_d(t),x_d(t) is defined as a binary variable; when x is_dWhen the value (t) is 1, indicating that the data center d is selected to perform the specification operation, otherwise, not performing the specification operation, wherein the expression is as follows:

in the above formula, the first and second carbon atoms are,ensuring that only one data center is selected for specification operation at time t, the specification device selecting the variable set corresponding to the variable

The cost of the data center includes:

cost of bandwidth, setTo transfer 1Gb data from data source r to data center d for a price, the total bandwidth cost to transfer the data into the cloud at time t is

Storage cost, set_d,W_d(t) respectively represents the data storage price and the data volume which is not processed by the data center, and the total storage cost at the time t is as follows:

delay cost ofFor the delay between location r and data center d, the delay cost is:where α is a conversion factor between delay and economic cost;

the total cost of the bandwidth cost, the storage cost and the delay cost is as follows:

calculating the cost ofFor the price of the k-type virtual machines in the data center d at the time t, the calculation cost is expressed as:

the migration cost is set as the amount of data transferred from the data center i at time tWherein f is_i(τ) represents intermediate data, β, generated by data center i at time τ_τ∈[0，1]represents the proportion of the historical data which needs to be migrated and satisfies the condition beta_a＜β_b(a < b); let phi_id(. d) is a migration cost function for migrating data from data center i to data center d that satisfies Φ when i ═ d_id(·)＝0；

the total migration cost generated by the system at time t is:

the total cost of the system at time t is:

C(m(t)，n(t)，λ(t)，x(t))＝C_p(m(t)，n(t))+C_sbl(λ(t))+C_mgr(m(t)，x(t))

describing an objective function and a constraint according to the decision variables and the cost of the data center, and expressing as follows:

wherein, for the average amount of data assigned to data center d for time T,the average number of virtual machines provided for data center d for the mapping operation,the average number of virtual machines for specification operation provided for the data center d;representing the average amount of intermediate data input into the data center d for carrying out the specification operation;

obtaining a drift-penalty factor and its upper bound for the objective function using a lyapunov optimization framework, comprising: modeling the data processing into an evolving queue using an incremental data processing approach;

in the mapping stage, set M_d(t) initializing M for the amount of unprocessed data in the mapping queue for data center d at time t_d(0) If 0, then the queue update is described as:

to guarantee queue M_d(t) a worst delay of l_mDesigning the corresponding virtual queue Y_d(t), initializing Y_d(0) 0, it follows the following update rule:

wherein, when M_dWhen the ratio of (t) > 0,otherwiseWhen M is_d(t)＝0，OtherwiseWherein epsilon_dThe delay control method comprises the following steps of (1) setting a constant for controlling worst delay of a mapping queue in advance; if queue M_d(t) and Y_d(t) is bounded, the maximum delay for data processing is l_mA time slot therein Andare respectively a queue M_d(t) and Y_d(t) maximum length;

in the specification stage, the corresponding queue in the data center d is R_d(t), then the update process of this queue is:

wherein,historical intermediate data of u time moved from other data centers at the t time; accordingly, its virtual queue is:

let M (t) be [ M ]_d(t)]、Y(t)＝[Y_d(t)]、R(t)＝[R_d(t)]、Z(t)＝[Z_d(t)]，A joint matrix of the mapping queue and the protocol queue; in order to measure the congestion degree of the system in the data processing process, let Θ (t) be [ M (t); r (t); y (t); z (t)](ii) a The lyapunov function can be defined as follows:

wherein L (Θ (t)) represents the queue backlog condition in the system;

the 1-slot lyapunov drift is introduced as follows:

adding a system cost function to the 1-time slot Lyapunov drift to calculate a drift-penalty factor:

wherein V is a non-negative factor balancing the total cost and stability of the system;

extracting stipulator selection items from an upper bound in the drift-penalty factors and generating a stipulator selection method, comprising:

the objective function and constraints translate into:

the optimal reducer selection method can be obtained by solving the polynomial of the following reducer selection terms:

wherein tau is ∈ [ t-mu, t-1 ∈ [ ]]，

The method for solving the above equation to obtain the stipulator selection comprises the following steps:

wherein,

2. the method of claim 1, wherein the cost of the data center comprises bandwidth cost, storage cost, delay cost, computation cost, and migration cost, and wherein describing the cost of the data center using the system parameters and the decision variables comprises:

3. The method of claim 2, wherein describing objective functions and constraints based on the decision variables and the cost of the data center comprises:

only one data center is selected as a specification device at the same time;

the sum of the bandwidth cost, storage cost, delay cost, computation cost, and migration cost is minimal.

4. The method of claim 1, wherein obtaining the drift-penalty factor and its upper bound for the objective function using the lyapunov optimization framework comprises:

calculating an upper bound for the drift-penalty factor.

5. The method of claim 4, wherein constructing real queues and virtual queues according to the objective function and constraints, and constructing a lyapunov function using a lyapunov optimization framework comprises:

describing a mapping queue actual queue according to the target function and the constraint and unprocessed data volume in a certain data center mapper at the certain moment;

describing a mapping queue virtual queue according to the target function and the constraint and the maximum delay of the mapping queue actual queue;

describing a protocol queue actual queue according to the target function and the constraint and the unprocessed data volume in a certain data center protocol reducer at a certain moment;

describing a protocol queue virtual queue according to the target function and the constraint and the maximum delay of a protocol queue actual queue;

6. The method of claim 4, wherein extracting virtual machine supplies from an upper bound in the drift-penalty factor and generating a stipulator selection method comprises:

minimizing the polynomial according to constraints described by the reducer selection variables;

and generating a stipulator selection method according to the solution of the supply variable of the virtual machine when the polynomial is at the minimum value.