CN108446383A

CN108446383A - A kind of data task redistribution method based on geographically distributed data query

Info

Publication number: CN108446383A
Application number: CN201810233064.4A
Authority: CN
Inventors: 黄晶; 黄蛟; 高尚; 杨博
Original assignee: Jilin University
Current assignee: Jilin University
Priority date: 2018-03-21
Filing date: 2018-03-21
Publication date: 2018-08-24
Anticipated expiration: 2038-03-21
Also published as: CN108446383B

Abstract

The invention discloses a kind of data task redistribution methods based on geographically distributed data query, the task and data of distributed system are redistributed using MFHC algorithms, the data at different data center are rearranged and the data center for executing query task is rearranged, a task is executed in different data centers.To achieve the purpose that reduce the totle drilling cost of analysis mode query process consumption, and solve the problems, such as that partial data can not be transferred to other data centers because of privacy or other limitations.The totle drilling cost for reducing inquiry solves the problems, such as partial data because privacy or other limitations can not be transferred to other data centers, improves the inquiry velocity of task, save cost.

Description

A kind of data task redistribution method based on geographically distributed data query

Technical field

The present invention relates to communication technique field more particularly to a kind of data task weights based on geographically distributed data query Distribution method.

Background technology

With the development of network, in the big data epoch, data information is very universal, is carrying out data query process In, in order to improve the speed of user access server and reduce the bandwidth of data transmission occupancy, many companies all build in the whole world Found their data center.Such as Microsoft and Amazon, establish tens data centers in the whole world.These are in different zones Data center constantly generates a large amount of data, such as User Activity daily record, server admin daily record and performance logs etc..Analysis These data being distributed in the data center of different zones are a critically important job.Such as analysis user inquires day Will can advertise decision, and analysis network log can detect dos attacks, and analysis system daily record can establish prediction model etc.. But analysis is carried out to the data for being distributed in different data center and needs cost, this cost includes mainly that data move Link cost and data center carry out task computation cost.For incorporated business, it is to minimize this cost very much It is necessary to.

The main of the prior art uses centralized approach, i.e., the data that will be analyzed are passed from the data center of different zones Then defeated to one data center carries out data analysis in this data center.The shortcomings that this method, has：1) for present needle Application to large-scale data, centralized approach need to transmit a large amount of data, cause serious bandwidth waste, and expend Time is also very long.2) there are the data that some data centers preserve to be related to privacy concern, can not arbitrarily be transferred in other data The heart.

Invention content

In view of the foregoing drawbacks or insufficient, the purpose of the present invention is to provide a kind of numbers based on geographically distributed data query According to task redistribution method, rearranged by the data to different data center and to execute query task data Center is rearranged, and a task is executed in different data centers, reduces the total of analysis mode query process consumption Cost.

To achieve the above objectives, the technical scheme is that：

A kind of data task redistribution method based on geographically distributed data query, including：

1) multiple queries task, is obtained；

2), according to the state of query task and the data at current time, following information is carried out with statistical method Prediction, and using fixed temporal scalable algorithm, minimum processing is carried out to the cost in following a period of time, when obtaining following The data processing centre at quarter；

3), by MFHC algorithms, task distribution is carried out, divides the storage location of paired data to be moved according to task, moved Move data processing centre；

4), paired data is divided to carry out SQL operations according to task, so that data are carried out mobile place when lower a moment operates Reason；

5) step 1) -4, is repeated), until completing all query tasks.

The step 2) specifically includes：

2.1, it obtains and exchanges bandwidth Cor, the calculating cost Ccom of operation task that initial data occupies, and appoint to executing The data center of business reassigned caused by switching cost Csw：

Wherein, the D gathers for data center, and there are several regions in each data center；P is regional ensemble；G is task Set, each task g belong to an analysis mode inquiry, and the inquiry of each analysis formula can be expressed as a DAG figure；O_pFor region The data volume of p initial data；l_DC(p),dFor the data center where the p of region to the link cost of data center d；c_k,gFor task g Cost is calculated in the unit of data center k；b_k,gIt is task g in the data center k total amount of data to be run；I_i,gExist for task g The data volume of the data center i intermediate data to be run；x_p,dFor two-valued variable, if the initial data of region p will be transferred to data Otherwise center d, value 1 are 0；y_d,gFor two-valued variable, if task g will be executed in data center d, value 1, otherwise for 0；

2.2, sum up the costs：

2.3, minimum processing is carried out to totle drilling cost by FHC algorithms：

subject to:For t=τ .. τ+ω

Two symbol X and Y being directed to are defined as：

Wherein, f_pFor the minimum backup quantity of region p data；

FHC algorithms, so as to find out y, are asked by being minimized to the totle drilling cost of [t, t+w] in the period according to the value of y Go out x.

The step 3) specifically includes：

3.1, in t moment, by mfhc algorithms combination (w+1) a fhc algorithms as a result, being obtained most using majority voting algorithm Whole y values, the value of x is determined further according to the value of y；

3.2, according to the x values acquired, data is redistributed, specified data center, then root are transferred data to According to the y values acquired, so that by carrying out the task that SQL operations execute data center to data.

Compared with the prior art, beneficial effects of the present invention are：

The present invention provides a kind of data task redistribution methods based on geographically distributed data query, are calculated using MFHC Method redistributes the task and data of distributed system, is rearranged to the data at different data center and right The data center for executing query task is rearranged, and a task is executed in different data centers.Subtracted with reaching The purpose of the totle drilling cost of few analysis mode query process consumption, and partial data is solved because privacy or other limitations can not be transferred to The problem of other data centers.The totle drilling cost for reducing inquiry solves partial data because privacy or other limitations can not pass Defeated the problem of arriving other data centers, the inquiry velocity of task is improved, cost is saved.

Description of the drawings

Fig. 1 is the data task redistribution method flow chart based on geographically distributed data query of the present invention；

Fig. 2 is the control flow block diagram of the present invention；

Fig. 3 is the MFHC algorithm flow charts of the present invention.

Specific implementation mode

The present invention is described in detail below in conjunction with attached drawing, it is clear that described embodiment is only the present invention one Divide embodiment, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art are not making The every other embodiment obtained under the premise of creative work, belongs to protection scope of the present invention.

As shown in Figure 1, the present invention provides a kind of data task redistribution methods based on geographically distributed data query：

1) multiple queries task, is obtained；

2), according to the state of query task and the data at current time, following information is carried out with statistical method Prediction, and using fixed temporal scalable algorithm (fixed horizon control, FHC) in following a period of time at This carries out minimum processing, obtains the data processing centre of future time instance；

The step 2) specifically includes：

The D gathers for data center, and there are several regions in each data center；P is regional ensemble；G is set of tasks, Each task g belongs to an analysis mode inquiry, and the inquiry of each analysis formula can be expressed as a DAG figure；O_pIt is original for region p The data volume of data；l_DC(p),dFor the data center where the p of region to the link cost of data center d；c_k,gIt is task g in number Cost is calculated according to the unit of center k；b_k,gIt is task g in the data center k total amount of data to be run；I_i,gIt is task g in data The data volume of the center i intermediate data to be run；x_p,dFor two-valued variable, if the initial data of region p will be transferred to data center Otherwise d, value 1 are 0；y_d,gFor two-valued variable, if task g will be executed in data center d, otherwise value 1 is 0；

2.2, sum up the costs：

subject to:For t=τ ... τ+ω

Two symbol X and Y being directed to are defined as：

Wherein, f_pFor the minimum backup quantity of region p data；

As shown in figure 3, the step 3) specifically includes：

Illustratively, as shown in Fig. 2, A, which represents data center, executes task, the outer ring of A represents data center's number According to transfer.When initial, algorithm determines to execute task in data center A, after subsequent time determine to execute task, institute in D and E It is transferred to data center D and data center E with the intermediate data for generating data center A, the two data center's task executions After, then last task is executed in data center D, so the intermediate result of data center E is transferred to data center D, Finally task is executed in data center D again.

In the present invention, when initial, data distribution has the execution of multiple queries mission requirements at this time in different data centers； System is determined with the minimum target of system overall cost in some period for each task and data distribution Task executes in which data center, and which data center is the data being performed should copy in.In lower a period of time Between section, repeat the process.

The present invention redistributes the task and data of distributed system using FHC algorithms.

5) step 1) -4, is repeated), until completing all query tasks.

It should be noted that in the description of the present application, unless otherwise indicated, the meaning of " plurality " is two or two with On.

Any process described otherwise above or method description are construed as in flow chart or herein, and expression includes It is one or more for realizing specific logical function or process the step of executable instruction code module, segment or portion Point, and the range of the preferred embodiment of the application includes other realization, wherein can not press shown or discuss suitable Sequence, include according to involved function by it is basic simultaneously in the way of or in the opposite order, to execute function, this should be by the application Embodiment person of ordinary skill in the field understood.

It should be appreciated that each section of the application can be realized with hardware, software, firmware or combination thereof.Above-mentioned In embodiment, software that multiple steps or method can in memory and by suitable instruction execution system be executed with storage Or firmware is realized.It, and in another embodiment, can be under well known in the art for example, if realized with hardware Any one of row technology or their combination are realized：With the logic gates for realizing logic function to data-signal Discrete logic, with suitable combinational logic gate circuit application-specific integrated circuit, programmable gate array (PGA), scene Programmable gate array (FPGA, Field-Programmable Gate Array) etc..

Those skilled in the art are appreciated that realize all or part of step that above-described embodiment method carries Suddenly it is that relevant hardware can be instructed to complete by program, the program can be stored in a kind of computer-readable storage medium In matter, which includes the steps that one or a combination set of embodiment of the method when being executed.

It is obvious to a person skilled in the art that will appreciate that above-mentioned Concrete facts example is the preferred side of the present invention Case, therefore improvement, the variation that those skilled in the art may make certain parts in the present invention, embodiment is still this The principle of invention, realization is still the purpose of the present invention, belongs to the range that the present invention is protected.

Claims

1. a kind of data task redistribution method based on geographically distributed data query, which is characterized in that including：

1) multiple queries task, is obtained；

2), according to the state of query task and the data at current time, following information is predicted with statistical method, And using fixed temporal scalable algorithm, minimum processing is carried out to the cost in following a period of time, obtains future time instance Data processing centre；

3), by MFHC algorithms, task distribution is carried out, divides the storage location of paired data to be moved according to task, is moved to Data processing centre；

4), paired data is divided to carry out SQL operations according to task, so that data are carried out mobile processing when lower a moment operates；

5) step 1) -4, is repeated), until completing all query tasks.

2. the data task redistribution method according to claim 1 based on geographically distributed data query, feature exist In the step 2) specifically includes：

2.1, it obtains and exchanges bandwidth Cor, the calculating cost Ccom of operation task that initial data occupies, and to executing task Data center reassigned caused by switching cost Csw：

Wherein, the D gathers for data center, and there are several regions in each data center；P is regional ensemble；G is task-set It closes, each task g belongs to an analysis mode inquiry, and the inquiry of each analysis formula can be expressed as a DAG figure；O_pFor region p The data volume of initial data；l_DC(p),dFor the data center where the p of region to the link cost of data center d；c_k,gFor task g Cost is calculated in the unit of data center k；b_k,gIt is task g in the data center k total amount of data to be run；I_i,gExist for task g The data volume of the data center i intermediate data to be run；x_p,dFor two-valued variable, if the initial data of region p will be transferred to data Otherwise center d, value 1 are 0；y_d,gFor two-valued variable, if task g will be executed in data center d, value 1, otherwise for 0；

2.2, sum up the costs：

subject to：For t=τ ... τ+ω

Two symbol X and Y being directed to are defined as：

Wherein, f_pFor the minimum backup quantity of region p data；

FHC algorithms are by minimizing the totle drilling cost of [t, t+w] in the period, and so as to find out y, x is found out according to the value of y.

3. the data task redistribution method according to claim 2 based on geographically distributed data query, feature exist In the step 3) specifically includes：

3.1, in t moment, by mfhc algorithms combination (w+1) a fhc algorithms as a result, being obtained using majority voting algorithm final Y values determine the value of x further according to the value of y；

3.2, according to the x values that acquire, data is redistributed, specified data center is transferred data to, further according to asking The y values obtained carry out the task that SQL operations execute data center so that passing through to data.