CN103309942B

CN103309942B - A kind of scheduler and reduce the method for redundancy overhead in asynchronous iteration process

Info

Publication number: CN103309942B
Application number: CN201310173239.4A
Authority: CN
Inventors: 廖小飞; 金海�; 张宇
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2013-05-10
Filing date: 2013-05-10
Publication date: 2016-04-13
Anticipated expiration: 2033-05-10
Also published as: CN103309942A

Abstract

The invention discloses a kind of method reducing redundancy overhead in asynchronous iteration process, comprise the following steps: set up a Hash table, the corresponding data group of each list item, wherein each list item comprises again three territories, receive the data D coming from message receiver, according to the ITC value of this data D and the weights Pri (D) of this data D of IN value calculating, judge whether to exist the data group G (D) with this data D with identical key assignments in Hash table to exist, if exist, upgrade weights and the data list of these data group G (D), otherwise in Hash list, create the data group G (D) of key assignments identical with this data D, and carry out initialization, judge that whether task performer is idle, if it is from Hash table, select a data group of maximum weight, send data corresponding for this data group to task performer process.This method can solve the redundant computation existed in existing method and the problem that communication overhead is large, computer resource is wasted, the speed of convergence of iterative computation is slow.

Description

A kind of scheduler and reduce the method for redundancy overhead in asynchronous iteration process

Technical field

The invention belongs to large data processing field, more specifically, relate to the method for redundancy overhead in a kind of scheduler and the process of minimizing asynchronous iteration thereof.

Background technology

Asynchronous iteration process is prevalent in web app, in the field such as data mining and scientific algorithm, such as: the PageRank algorithm in Web search engine, Adsorption algorithm in link analysis and commending system, and for solving the Jacobi method of system of linear equations, the intermediate result produced in its permission previous iteration is used to the calculating in current iteration immediately, accelerate the speed of convergence of iterative computation, and there is not problem of load balancing and cloud computing infrastructure is better applied.

But, when there being much available intermediate result available, asynchronous iteration process will due to a large amount of computing cost unnecessary and communication overhead in two reason triggering following iteration below: one) selection intermediate result blindly processes, and does not consider that each data are on the impact of speed of convergence with first process the expense that they will cause, two) for each intermediate result triggers a large amount of calculating and communication overhead in successive iterations cascade, thus make asynchronous iteration process bulk redundancy can be caused to calculate and communication overhead to most of iterated application, slow down the speed of convergence of asynchronous iteration process, waste a large amount of computer resource, in fact, these redundancies trigger to calculate to send out by scheduling and are avoided, but the dispatching algorithms for asynchronous iteration process all at present, such as: priority scheduling (Priorityscheduling), polling dispatching (Round-robinscheduling) etc., when selection intermediate result processes, do not consider the expense first processing these intermediate results.Simultaneously, these dispatching algorithms are not that the form organized carries out dispatching, thus need for the calculating in each intermediate result data cascaded triggering successive iterations with communicate, finally make to use the asynchronous iteration process of these dispatching algorithms still to there is calculating and the communication redundancy expense of a large amount of cascades for a lot of iterated application.But what these redundant computation and communication overhead can be a large amount of wastes computer resource, and the speed of convergence of iterative computation that slowed down.

Summary of the invention

For above defect or the Improvement requirement of prior art, the invention provides a kind of method reducing redundancy overhead in asynchronous iteration process, its object is to solve the redundant computation existed in existing method and the problem that communication overhead is large, computer resource is wasted, the speed of convergence of iterative computation is slow.

For achieving the above object, according to one aspect of the present invention, provide a kind of method reducing redundancy overhead in asynchronous iteration process, it is applied in a kind of scheduler, this scheduler is connected with task performer and message receiver communication respectively, and the method comprises the following steps:

(1) Hash table is set up, the corresponding data group of each list item, wherein each list item comprises again three territories: first territory is for storing the key assignments of data group, second territory is used for the weights of data group, 3rd territory is for storing the data list in data group, and data list comprises the value of data and the iteration level at data place;

(2) the data D coming from message receiver is received;

(3) according to the ITC value of this data D and the weights Pri (D) of this data D of IN value calculating, following sub-step is specifically comprised:

(3-1) ITC value ITC (D) and the IN value IN (D) of data D is calculated, wherein ITC (D)=± D, IN (D) is the information be recorded in data D, number of times handled during its original raw data being specially data D changes to data D;

(3-2) according to ITC(D) and IN (D) utilize following equation to calculate weights Pri (D): Pri (the D)=t of data D ₁× ITC (D)+t ₂× IN (D)/T, wherein t ₁and t ₂be respectively and represent ITC(D) and the weighted value of IN (D) importance, and its value is the decimal between 0 to 1, and T is the value adjusting IN (D) span, its span be greater than 1 integer;

(4) judge whether to exist the data group G (D) with this data D with identical key assignments in Hash table to exist, if exist, upgrade weights and the data list of these data group G (D), otherwise in Hash list, create the data group G (D) of key assignments identical with this data D, and carry out initialization

(5) judge that whether task performer is idle, if yes then enter step (6), otherwise return step (2);

(6) from Hash table, select a data group of maximum weight, send data corresponding for this data group to task performer process, then enter step (7);

(7) judge whether the application program run in task performer terminates, and if it is process terminates, otherwise proceed to step (8);

(8) judge whether also have untreated data group in Hash table, if had, return step (5), otherwise return step (2).

Preferably, step (4) comprises following sub-step:

(4-1) the key assignments D of data D is obtained _key, hash function process is carried out to obtain a unique group identity K to this key assignments;

(4-2) inquire about in Hash table according to this unique identification K, to have judged whether that key assignments is for D _keydata group G (D), if having, then proceed to step (4-3), otherwise proceed to step (4-4);

(4-3) data D insertion had key assignments D _keydata group G (D) in, then proceed to step (4-5);

(4-4) creating a key assignments is D _keydata G (D), and by data D data inserting group G (D), then turn to step (4-6);

(4-5) adopting following formula to upgrade key assignments in Hash table is D _keythe weights of data group, Pri (G (D))=Pri (G (D))+Pri (D), then turns to step (5);

(4-6) weights of data group G (D) are set to Pri (D), then turn to step (5).

According to another aspect of the present invention, provide a kind of scheduler, it is connected with task performer and message receiver communication respectively, and this scheduler comprises:

First module, for setting up a Hash table, the corresponding data group of each list item, wherein each list item comprises again three territories: first territory is for storing the key assignments of data group, second territory is used for the weights of data group, 3rd territory is for storing the data list in data group, and data list comprises the value of data and the iteration level at data place;

Second module, for receiving the data D coming from message receiver;

3rd module, for calculating the weights Pri (D) of this data D according to the ITC value of this data D and IN value, specifically comprises following sub-step:

First submodule, for calculating ITC value ITC (D) and the IN value IN (D) of data D, wherein ITC (D)=± D, IN (D) is the information be recorded in data D, number of times handled during its original raw data being specially data D changes to data D;

Second submodule, for according to ITC(D) and IN (D) utilize following equation to calculate weights Pri (D): Pri (the D)=t of data D ₁× ITC (D)+t ₂× IN (D)/T, wherein t ₁and t ₂be respectively and represent ITC(D) and the weighted value of IN (D) importance, and its value is the decimal between 0 to 1, and T is the value adjusting IN (D) span, its span be greater than 1 integer;

Four module, exist for judging whether to exist the data group G (D) with this data D with identical key assignments in Hash table, if exist, upgrade weights and the data list of these data group G (D), otherwise in Hash list, create the data group G (D) of key assignments identical with this data D, and carry out initialization;

5th submodule, for judging that whether task performer is idle, if yes then enter the 6th module, otherwise returns the second module;

6th module, for selecting a data group of maximum weight from Hash table, sending data corresponding for this data group to task performer process, then entering the 7th module;

7th module, for judging whether the application program run in task performer terminates, and if it is process terminates, otherwise proceeds to the 8th module;

8th module judges whether also have untreated data group in Hash table, if had, returns the 5th module, otherwise returns the second module.

Preferably, four module comprises:

3rd submodule, for obtaining the key assignments D of data D _key, hash function process is carried out to obtain a unique group identity K to this key assignments;

4th submodule, for inquiring about in Hash table according to this unique identification K, to have judged whether that key assignments is for D _keydata group G (D), if having, then proceed to the 5th submodule, otherwise proceed to the 6th submodule;

5th submodule, has key assignments D for being inserted by data D _keydata group G (D) in, then proceed to the 7th submodule;

6th submodule is D for creating a key assignments _keydata G (D), and by data D data inserting group G (D), then turn to the 8th submodule;

7th submodule, upgrading key assignments in Hash table for adopting following formula is D _keythe weights of data group, Pri (G (D))=Pri (G (D))+Pri (D), then turns to the 8th submodule;

8th submodule, for the weights of data group G (D) are set to Pri (D), then turns to step the seven submodule.

In general, the above technical scheme conceived by the present invention compared with prior art, can obtain following beneficial effect:

1, redundant computation amount and redundancy communication amount little: solve owing to adopting step (1) some redundant computation and communication overhead that each data causes cascaded triggering to bring, step (3-4) and step (6) solve the redundancy overhead brought with random order blindness process data, and therefore this method effectively can eliminate redundant computation expense and the communication overhead of a large amount of existence in asynchronous iteration process.

2, the fast convergence rate of iterative computation: owing to have employed step (1), step (3-4) and step (6), make asynchronous iteration process is restrained to more effective calculating and communicates faster and be preferentially processed, therefore this method accelerates the speed of convergence of asynchronous iteration process, improves the resource utilization of cloud computing infrastructure.

Accompanying drawing explanation

Fig. 1 is the applied environment figure that the present invention reduces the method for redundancy overhead in asynchronous iteration process.

Fig. 2 is the process flow diagram that the present invention reduces the method for redundancy overhead in asynchronous iteration process.

Embodiment

In order to make object of the present invention, technical scheme and advantage clearly understand, below in conjunction with drawings and Examples, the present invention is further elaborated.Should be appreciated that specific embodiment described herein only in order to explain the present invention, be not intended to limit the present invention.In addition, if below in described each embodiment of the present invention involved technical characteristic do not form conflict each other and just can mutually combine.

General thought of the present invention is, proposing a dispatching algorithm is that the form organized is carried out, and consider that when dispatching intermediate result processing sequence data are to the importance of the speed of convergence of iterative computation and the expense first processing these data simultaneously, this dispatching algorithm uses a gatherer pending data to be collected with the form of group, calculate weights all respectively for often organizing simultaneously, and the calculating of weights of each group needs to consider the importance that data collected by this group play iterative computation speed of convergence and first process the expense that these data will bring, then this dispatching algorithm just selects the group of a maximum weight, the data allowing this organize first are processed, thus reach these redundant computation expense and communication overheads of effectively eliminating a large amount of existence in asynchronous iteration process, accelerate the speed of convergence of asynchronous iteration process, improve the resource utilization of cloud computing infrastructure.

As shown in Figure 1, the method that the present invention reduces redundancy overhead in asynchronous iteration process is applied in the system (abbreviation runtime system) of a kind of support program operation, this system comprises task manager and data processor, wherein task manager is used for initialization and the end of management data processor, data processor is used for receipt message and process data, and comprise message receiver, scheduler of the present invention, and task performer, the message comprising process data that message receiver is sent for receiving remainder data processor, scheduler be used for scheduled reception to message in the processing sequence of data, the pending data that task performer exports to process scheduler for performing user program.

As shown in Figure 2, the method that the present invention reduces redundancy overhead in asynchronous iteration process is applied in a kind of scheduler, and this scheduler is connected with task performer and message receiver communication respectively, and the method comprises the following steps:

(1) Hash table is set up, the corresponding data group of each list item, wherein each list item comprises again three territories: first territory is for storing the key assignments of data group, second territory is used for the weights of data group, 3rd territory is for storing the data list in data group, data list comprises the value of data, and the iteration level (Iterationnumber) at data place, and it is specifically as shown in table 1;

The advantage of this step is, by setting up Hash table, the fast finding that can be conducive to data group inserts and upgrades.

Data group	The key assignments of data group	The weights of data group	Data list

Table 1

(2) the data D coming from message receiver is received;

(3) calculate the weights Pri (D) of this data D according to the convergence importance of this data D (ImportanceToConvergence is called for short ITC) value and iteration level (IterationNumber is called for short IN) value, specifically comprise following sub-step:

(3-1) ITC value ITC (D) and the IN value IN (D) of data D is calculated; Specifically, ITC (D)=± D, wherein sign needs to be provided inside configuration file by the application program in task performer, for an application program, if the value of data D is larger larger to convergence role, so use positive sign, otherwise with negative sign, and IN (D) is the information be recorded in data D, number of times handled during its original raw data being specially data D changes to data D.

The advantage of this sub-step is, by approximate treatment ITC and IN value, can effectively reduce weight computing expense.

(3-2) according to ITC(D) and IN (D) utilize following equation to calculate weights Pri (D): Pri (the D)=t of data D ₁× ITC (D)+t ₂× IN (D)/T, wherein t ₁and t ₂be respectively and represent ITC(D) and the weighted value of IN (D) importance, and its value is the decimal between 0 to 1, T for adjustment IN (D) span value, mainly make its span and ITC(D) basically identical, its span be greater than 1 integer.

The advantage of this sub-step is, by calculating weights for each data, the value increase by data D on the basis that the weights of data group can be made former obtains, the effective real-time weights obtaining data group.The reason of calculating Pri (D) like this can be explained below.

(4) judge whether to exist the data group G (D) with this data D with identical key assignments in Hash table to exist, if exist, upgrade weights and the data list of these data group G (D), otherwise in Hash list, create the data group G (D) of key assignments identical with this data D, and carry out initialization; Specifically, this step comprises following sub-step:

(4-3) data D insertion had key assignments D _keydata group G (D) in, namely complete the renewal of these data group G (D) Hash list in Hash table, then proceed to step (4-5);

The advantage of this sub-step is, by calculating the weights of data group G (D), can make the former basis of the weights Pri of data group (G (D)) is obtained by weights Pri (D) increment of data D, the effective real-time weights obtaining data group.The reason of calculating Pri (G (D)) like this can be explained below.

(4-6) weights of data group G (D) are set to Pri (D), then turn to step (5);

Explained later calculates the reason of Pri (D) and Pri (G (D)) by mode described in this method.

Before explanation reasons, first we provide the weights Pri (G (D)) how defining group G (D), and in the weights definition of data group, we mainly consider so two factors.

Factor one: ITC, it is faster that group G (D) having a larger ITC value will make iterative computation restrain, thus the data group avoiding ITC value little triggers a large amount of follow-up redundant computation and communication overhead.In fact the ITC value of a group, namely ITC (G (D)) can be defined as: ITC (G (D))=∑ _{d ∈ G (D)}iTC (D).

Factor two: CTG (Costtofirstlyprocessingthisgroup), namely the expense of these group data is first processed, when one group of data is selected processed, the data arriving this group will need again to be processed by all the other, and some again in triggering following iteration calculate and communicate, and these calculate and communication overhead is exactly the CTG value that this organizes data, if the group that those CTG values are large can be postponed, and first process the little group of those CTG, the group that so CTG value is large will have more multimachine can collect more data, reduce redundancy overhead, in fact, CTG is exactly the importance (data causing triggering more are more important) of the data for improving all groups of average amount of collecting and collection.When providing the CTG how calculating a group, we introduce two new variablees: 1) IterationNumber (orIN), the iteration number of plies at the data place that Here it is is collected; 2) CompletionRatiof0rCRl, this is according to the ratio having all processed data of key assignments n to account for all need data to be processed in a certain iteration level, i.e. CR (i, n)=Num _p(i, n)/Num _t(i, n), wherein Num _p(i, n) and Num _t(i, n) is the data volume that the data volume (i) processed for more new data-objects R when iteration level is i and all needs are processed, due to Num _t(i, n) is difficult to knowing in advance, and for most of data object R Num (i) _t(i, n) is identical, so approximate Num _t(i, n) value is fixed value T, so has CR (i, n)=Num _p(i, n)/T.

We provide the CTG value of group G (D) now, i.e. CTG (G (D)), computing method, the CTG value due to each group of G (D) depend on will arrive this group data volume and each by arrive data by trigger calculating and communication overhead size.Because the data volume that will arrive group G (D) in the i-th th iteration depends on CR (i, D _key), the triggering amount that the data D reaching data group G (n) will cause depends on the IN value IN (D) of data D, so CTG (G (D))=-Σ _{i ∈ S}i × CR (i, D _key), and S={n|D ∈ G (D) and IN (D)=N}, D _keyfor the key assignments of data D, due to Pri (G (D)) and ITC (G (D)) positive correlation, and with CTG (G (D)) negative correlation, so have:

Pri(G(D))＝t ₁×ITC(G(D))-t ₂×CTG(G(D))

In order to effectively obtain the weights Pri (G (D)) of each group of G (D) in real time, Pri (G (D)) can calculate as follows:

\{\begin{matrix} Pri (G (D)) = Pri (G (D)) + Pri (D) \\ Pri (D) = t_{1} \times ITC (D) + t_{2} \times IN (D) / T \end{matrix} .

Thus, whenever a data D be scheduled device receive time, just can be obtained the weights of data group G (D) by above-mentioned computing formula increment ground.

Scheduler of the present invention, it is connected with task performer and message receiver communication respectively, and comprises:

Second module, for receiving the data D coming from message receiver;

Four module, exist for judging whether to exist the data group G (D) with this data D with identical key assignments in Hash table, if exist, upgrade weights and the data list of these data group G (D), otherwise in Hash list, create the data group G (D) of key assignments identical with this data D, and carry out initialization; Four module comprises:

Scheduler of the present invention can be stored in computer-readable medium.

Those skilled in the art will readily understand; the foregoing is only preferred embodiment of the present invention; not in order to limit the present invention, all any amendments done within the spirit and principles in the present invention, equivalent replacement and improvement etc., all should be included within protection scope of the present invention.

Claims

1. reduce a method for redundancy overhead in asynchronous iteration process, it is applied in a kind of scheduler, and this scheduler is connected with task performer and message receiver communication respectively, it is characterized in that, the method comprises the following steps:

(2) the data D coming from message receiver is received;

(3-2) following equation is utilized to calculate weights Pri (D): Pri (the D)=t of data D according to ITC (D) and IN (D) ₁× ITC (D)+t ₂× IN (D)/T, wherein t ₁and t ₂be respectively and represent ITC (D) and the weighted value of IN (D) importance, and its value is the decimal between 0 to 1, T is the value adjusting IN (D) span, its span be greater than 1 integer;

(4) judge whether to exist the data group G (D) with this data D with identical key assignments in Hash table to exist, if exist, upgrade weights and the data list of these data group G (D), otherwise in Hash list, create the data group G (D) of key assignments identical with this data D, and carry out initialization, this step comprises following sub-step:

(4-6) weights of data group G (D) are set to Pri (D), then turn to step (5);

2. use a scheduler for method described in claim 1, it is connected with task performer and message receiver communication respectively, it is characterized in that, this scheduler comprises:

Second module, for receiving the data D coming from message receiver;

3rd module, for calculating the weights Pri (D) of this data D according to the ITC value of this data D and IN value, specifically comprises following submodule:

Second submodule, for utilizing following equation to calculate weights Pri (D): Pri (the D)=t of data D according to ITC (D) and IN (D) ₁× ITC (D)+t ₂× IN (D)/T, wherein t ₁and t ₂be respectively and represent ITC (D) and the weighted value of IN (D) importance, and its value is the decimal between 0 to 1, T is the value adjusting IN (D) span, its span be greater than 1 integer;

8th submodule, for the weights of data group G (D) are set to Pri (D), then turns to step the seven submodule;