CN113434299B

CN113434299B - Coding distributed computing method based on MapReduce framework

Info

Publication number: CN113434299B
Application number: CN202110756959.8A
Authority: CN
Inventors: 周玲玲; 蒋静
Original assignee: Guangxi Normal University
Current assignee: Guangxi Normal University
Priority date: 2021-07-05
Filing date: 2021-07-05
Publication date: 2024-02-06
Anticipated expiration: 2041-07-05
Also published as: CN113434299A

Abstract

The invention discloses a coding distributed computing method based on a MapReduce framework, which is characterized in that N input files are divided into a plurality of parts and are respectively stored on different distributed computing nodes; then, in the output function allocation, a new output function set W is designed for each distributed computing node _k This greatly reduces the number of output functions required. Finally, each distributed computing node obtains intermediate values of the input files which are not stored in the distributed computing nodes from other distributed computing nodes in a random selection mode, intermediate values of all the input files can be obtained, and the distributed output functions are calculated by utilizing the intermediate values of all the input files, so that distributed computing tasks are completed. The method reduces the number of input files and the number of output functions actually required on the premise of sacrificing a small amount of communication load by a new file distribution and function distribution mode, thereby better solving the actual problem, namely being widely applied in practice.

Description

Coding distributed computing method based on MapReduce framework

Technical Field

The invention relates to the technical field of distributed computing, in particular to a coding distributed computing method based on a MapReduce framework.

Background

Driven by the rapid development of machine learning and data science, modern computing paradigms have shifted from traditional single processor systems to large distributed computing systems, and one popular framework in distributed computing is the MapReduce framework. Distributed computing has shown its own strong advantages in processing large-scale data, and has become a popular research direction in recent years.

While the MapReduce framework has become a popular framework for distributed computing, it also suffers from a significant disadvantage in that it requires a large amount of data exchange. For example, when running "SelfJoin" on Amazon EC2 clusters, 70% of the execution time is spent on data exchanges. To alleviate the communication bottleneck, ali et al 2018 proposed coded distributed computing ("Coded Distributed Computing", CDC) based on the MapReduce framework, and given a general scheme to achieve optimal communication load. Although the scheme obtains the optimal communication load, the number of input files and the number of output functions required by the scheme are exponentially increased along with the increase of the number of nodes, so that a certain difficulty exists in solving the practical problem.

Disclosure of Invention

The invention aims to solve the problem that the number of input files and the number of output functions required by the existing coding distributed computing method based on the MapReduce framework are large, and provides the coding distributed computing method based on the MapReduce framework.

In order to solve the problems, the invention is realized by the following technical scheme:

the coding distributed computing method based on the MapReduce framework comprises a Map stage, a Shuffle stage and a Reduce stage, and comprises the following steps:

1) Map stage:

step 1, carrying out repeated-free average division on a given input file to obtainA subset of the input files;

step 2, randomly selecting from 0-K' -1 integersThe integers are used as marks of each input file subset;

step 3, performing modular operation on the node factor K' on the number of each node to obtain the mark of each node;

step 4, based on the marks of each input file subset and the marks of each node, distributing the input file subset with the marks the same as the marks of the nodes to the corresponding nodes for storage;

step 5, each node calculates the intermediate value of each stored input file subset by using a Map function;

2) The Shuffle stage:

step 6, each node encodes the intermediate values of all the stored subsets of the input files into signals and transmits the signals to other nodes;

step 7, distributing an output function set to be calculated to each node; wherein the node numbered k is assigned a set of output functions W _k The method comprises the following steps:

3) Reduce stage

Step 8, each node randomly selects the intermediate value of each non-stored input file subset of the node from the intermediate values transmitted by other nodes; combining the node to store the intermediate values of the input file subsets to obtain the intermediate values of all the input file subsets; and calculating the output function set distributed by the node by utilizing the intermediate values of all the input file subsets to finish distributed calculation.

K is the total number of nodes, K' is a node factor, and r is the calculated number of times of each input file; []Represents a rounding function, K e {0,1,.., K-1}, t is the number of output functions allocated on each node, andwhere gcd (K, s) represents the maximum common factor between K and s, s being the number of times each output function is calculated.

Compared with the prior art, the invention has the following characteristics:

1. compared with a general MapReduce framework, the method utilizes a carefully designed file allocation mode to enable each file block to be calculated by r different distributed computing nodes, and then utilizes redundant calculation on the nodes to create a coded multicast opportunity, so that the method can simultaneously transmit data to r nodes, and the time required by data transmission is reduced.

2. We reduce the number of input files and the number of output functions required compared to the solution proposed by Ali. The reason for this is: (1) when the Map stage performs file allocation, we perform modular operation on the node numbers first, and then perform file allocation on each distributed computing node according to the marks after the modular operation, so as to reduce the number of required input files; (2) during the Shuffle phase we have designed a new set of output functions W for each distributed compute node _k The number of required output functions is greatly reduced; and at s>When=k', our solution does not require all K distributed computing nodes to participateThe signal transmission is only carried out on K' distributed computing nodes, so that the computing tasks on some distributed computing nodes are lightened; (3) by using the new file allocation and function allocation mode, the number of input files and the number of output functions actually required can be reduced on the premise of sacrificing a small amount of communication load, so that the actual problem can be better solved, and the method is widely applied in practice.

Drawings

Fig. 1 is an execution process of the MapReduce framework.

Detailed Description

The present invention will be further described in detail with reference to specific examples in order to make the objects, technical solutions and advantages of the present invention more apparent.

As shown in fig. 1, the principle of the MapReduce framework is by computing Q output functions of N input files on K distributed computing nodes. According to the coding distributed computing method based on the MapReduce framework, firstly, a new file dividing mode is selected, so that the number of input files actually needed is reduced. Then, in the output function allocation, a new output function set W is designed for each distributed computing node _k This greatly reduces the number of output functions required. Finally, each distributed computing node obtains the intermediate values of the input files which are not stored in the distributed computing nodes from other distributed computing nodes in a random selection mode, and the intermediate values of all N input files can be obtained. Thus, each distributed computing node utilizes the intermediate values of the N input files to calculate the distributed output functions, and the distributed computing tasks are completed.

1) Map stage

Step 1, carrying out repeated-free average division on a given input file to obtainA subset of the input files.

If the total number of the given input files is N, the number of the input files allocated to each input file subset isWherein (1)>Represents ∈K' taken from>Are combined, i.e.)>K is the total number of nodes, K ' is a node factor, K ' is a factor of K which is not equal to 1, and K ' noteqK, r is the number of times each input file is calculated.

In this embodiment, k=10, K' = 5,r =4, then

Step 2, randomly selecting from 0-K' -1 integersThe integer numbers act as labels for each subset of input files.

In this embodiment, k=10, K '= 5,r =4, then K' -1=5-1=4,i.e. 2 integers are randomly selected from the 5 integers of 0-4 as the labels for each subset of input files. For example, the labels of the n=10 input file subsets are respectively: {0,1},{0,2},{0,3},{0,4},{1,2},{1,3},{1,4},{2,3},{2,4},{3,4}.

And step 3, performing modular operation on the node factor K' by the number of each node to obtain the mark of each node.

In this embodiment, the labels of the 10 nodes are respectively:

for node number 0, after modulus K' =5, 0% 5=0, marked as 0;

for node number 1, after modulus K' =5, 1% 5=1, labeled 1;

for node number 2, K' =5 is modulo first, 2%5 =2, which is labeled 2;

for node number 3, K' =5 is modulo first, 3%5 =3, which is labeled 3;

for node number 4, after modulo K' =5, 4% 5=4, labeled 4;

for node number 5, K' =5 is modulo first, 5%5 =0, which is marked 0;

for node number 6, K' =5 is modulo first, 6%5 =1, which is labeled 1;

for node number 7, K' =5 is modulo first, 7%5 =2, which is labeled 2;

for node number 8, K' =5 is modulo first, 8%5 =3, which is labeled 3;

for node number 9, K' =5 is modulo first, 9%5 =4, which is labeled 4.

And 4, distributing the input file subsets with the same marks as the marks of the nodes to the corresponding nodes for storage based on the marks of each input file subset and the marks of each node.

In this embodiment, the input file subsets stored by the 10 nodes are respectively:

since the node numbered 0 is marked with 0, the node numbered 0 stores the 4 input file subsets of the marked 0, i.e., {0,1}, {0,2}, {0,3}, {0,4 };

since the node numbered 1 is marked 1, the node numbered 1 stores the 4 input file subsets of the marked 1, i.e., {0,1}, {1,2}, {1,3}, {1,4 };

since the node numbered 2 is marked with 2, the node numbered 2 stores the 4 input file subsets of the marked tape 2, namely {0,2}, {1,2}, {2,3}, {2,4 };

since the node numbered 3 is labeled 3, the node numbered 3 stores 4 subsets of input files labeled 3, namely {0,3}, {1,3} {2,3}, {3,4 };

since the node numbered 4 is marked 4, the node numbered 4 stores the subset of 4 input files of the marked tape 4, {0,4}, {1,4}, {2,4}, {3,4 };

since the node with the number of 5 is marked with 0, the node with the number of 5 stores 4 input file subsets with the number of 0, namely {0,1}, {0,2}, {0,3}, {0,4 };

since the node numbered 6 is marked 1, the node numbered 6 stores the 4 input file subsets of the marked 1, i.e., {0,1}, {1,2}, {1,3}, {1,4 };

since the node numbered 7 is marked with 2, the node numbered 7 stores the 4 input file subsets of the marked tape 2, namely {0,2}, {1,2}, {2,3}, {2,4 };

since the node numbered 8 is labeled 3, the node numbered 8 stores the 4 subsets of input files labeled 3, namely {0,3}, {1,3} {2,3}, {3,4 };

since the node numbered 9 is labeled 4, the node numbered 9 stores the 4 subsets of input files of the label tape 4, namely {0,4}, {1,4}, {2,4}, {3,4}.

And 5, calculating intermediate values of all the input file subsets stored at present on each node by using a Map function to obtain intermediate values of the files stored in the node.

2) Stage of Shuffle

And 6, each node encodes the stored file intermediate value into a signal and transmits the signal to other nodes.

Since each node of the MapReduce framework needs to use the intermediate values of all input files to perform the computation of the assigned function set, and in the Map stage, each node already stores a part of files, that is, only the intermediate values of a part of stored files are obtained, other nodes are required to give it the intermediate value of the file that the node does not store.

And 7, distributing an output function set to be calculated to each node.

In the MapReduce framework, Q output functions of N input files are computed on K distributed compute nodes. Wherein the set of output functions allocated on each distributed computing node isNamely W _k The output functions included in (a) are the output function sets that node k needs to calculate. Wherein the node numbered k is assigned a set of output functions W _k The method comprises the following steps:

wherein, [. Cndot. ] represents a rounding function, i.e., rounding up, k.e., {0, 1.,. Cndot., }, t is the number of output functions assigned to each node, and s is the number of times each output function is calculated.

In this embodiment, s=6,where gcd (K, s) represents the maximum common factor between K and s. Taking the node with the number of 2 as an example, the output function set isI.e. node number 2, requires the calculation of 3 output functions, output function 1, output function 3 and output function 4, respectively.

3) Reduce stage

Step 8, each node randomly selects the intermediate value of each non-stored input file subset of the node from signals transmitted by other nodes; combining the node to store the intermediate values of the input file subsets to obtain the intermediate values of all the input file subsets; and calculates the set of output functions assigned by the node using the intermediate values of all the subsets of input files.

Step 8.1, according to the output function set distributed in the Shuffle stage, each node obtains an intermediate value about the file stored by itself, that is, the intermediate value calculated by each node is: { v _q,n :q∈W _k ,w _n ∈M _k M is }, where M _k Represented is a set of files stored by the node numbered k.

And 8.2, calculating all intermediate values related to the stored file by other nodes in the Map stage, and then encoding the calculated intermediate values into signals by each node and transmitting the signals to the rest nodes.

Step 8.3, each node calculates the obtained intermediate value { v }, according to the intermediate value { v } _q,n :q∈W _k ,w _n ∈M _k And signals received from other nodes, to solve for intermediate values required by themselvesThe specific implementation process is as follows:

since each file is computed by r 'nodes, the intermediate value computed for this file is also owned by r' nodes. For each node to which intermediate values are to be transmitted, there are 1 intermediate values that are identical to the node to which the output function is to be calculated, so that the node to which the intermediate values are to be transmitted needs to encode all intermediate values that have been calculated into a signal to be transmitted to the node to which the output function is to be calculated. Since there is the same intermediate value between the node where the output function is to be calculated and the node where the intermediate value is to be transmitted, it is possible to useThe values of the remaining intermediate values are calculated, and because in one equation set, 3 unknowns are required to be solved, the node which is required to transmit the intermediate value sends 3 equations of the intermediate value calculated by the node, so that the intermediate value required by the node which is required to calculate the output function can be solved in a mode of solving one unknown according to one equation.

In this example implementation, each file is calculated by r '=2 nodes, so the intermediate value calculated for this file is also owned by r' =2 nodes. Taking the node numbered 2 as an example, it is assigned output functions numbered 1,3 and 4. While nodes 0,1, 3 and 4 store the same files {0,2}, {1,2}, {2,3}, {2,4} respectively as node 2, and 4 files are stored by each node, so node 2 lacks 3 files with respect to these nodes, i.e. lacks 3 intermediate values with respect to the calculated output function, and thus each node needs to transmit 3 signals to node 2.

Since the node numbered 2 stores the subset of 4 input files {0,2}, {1,2}, {2,3}, {2,4}, it is possible to know the intermediate value of the node numbered 2 that already has a partial stored file for the output function 1,3,4, namely:

v _1，{0，2} ,v _1，{1，2} ,v _1，{2，3} ,v _1，{2，4} ,v _3，{0，2} ,v _3，{1，2} ,v _3，{2，3} ,v _3，{2，4} ,v _4，{0，2} ,v _4，{1，2} ,v _4，{2，3} ,v _4，{2，4} ；

but No. 2 also lacks intermediate values for other non-stored files of output functions 1,3,4, namely:

v _1，{0，1} ,v _1，{0，3} ,v _1，{0，4} ,v _1，{1，3} ,v _1，{1，4} ,v _1，{3，4} ,v _3，{0，1} ,v _3，{0，3} ,v _3，{0，4} ,

v _3，{1，3} ,v _3，{1，4} ,v _3，{3，4} ,v _4，{0，1} ,v _4，{0，3} ,v _4，{0，4} ,v _4，{1，3} ,v _4，{1，4} ,v _4，{3，4} ；

the intermediate values of these non-stored files can be obtained by other nodes, so that the other nodes (node 0, node 1, node 3 and node 4) encode those intermediate values calculated by themselves as needed for node 2 into a signal which is sent to node 2, when:

the signals transmitted from node 0 to node 2 are:

the signals transmitted from node 1 to node 2 are:

the signals transmitted by node 3 to node 2 are:

the signals transmitted by node 4 to node 2 are:

wherein alpha is ₁ ,α ₂ ,α ₃ ,α ₄ 、Is a coefficient, and alpha ₁ ,α ₂ ,α ₃ ,α ₄ Not all equal to 1;Not all equal to alpha ₁ ,α ₂ ,α ₃ ,α ₄ I.e. all coefficients are linearly independent, ensuring that 3 unknowns can be solved by such a system of equations.

Node 2 can thus solve for those intermediate values that it needs by means of these signals. Wherein each signal is usedConnected +.>The intermediate values represented are exclusive-ored by bits, and each intermediate value is represented in binary. I.e. node 2 can solve 9 own required intermediate values from the system of equations sent by 0: v _1，{0，1} ,v _1，{0，3} ,v _1，{0，4} ,v _3，{0，1} ,v _3，{0，1} ,v _3，{0，1} ,v _4，{0，1} ,v _4，{0，1} ,v _4，{0，1} The method comprises the steps of carrying out a first treatment on the surface of the From the system of equations sent by node 1, the 6 own required intermediate values are solved: v _1，{1，3} ,v _1，{1，4} ,v _3，{1，3} ,v _3，{1，3} ,v _4，{1，3} ,v _4，{1，3} The method comprises the steps of carrying out a first treatment on the surface of the From the system of equations sent by node 3, 3 own required intermediate values are solved: v _1，{3，4} ,v _3，{3，4} ,v _4，{3，4} . (the solution is not unique, so long as the required intermediate value is solved from the node of the transmitted signal)

Since s=6 > K ', when K' distributed computing nodes restore their desired intermediate values, the intermediate values required by s=6 distributed computing nodes are restored at the same time, since the files stored by 2 of them must be identical.

Table 1 is the file and function assignments on the nodes. In table 1, the file stored on each node is represented, the grey part represents the output function allocated on each node, and the number of intermediate values that are commonly needed in this distributed system, i.e. all the rows in the table, can be seen intuitively by the table. The intermediate values required for each node are those with grey parts, wherein each node already has a part of its required intermediate values in the Map phase, those grey bottom bands; the remaining intermediate values with grey parts require other nodes in K' to transmit them.

Table 1: file and function allocation on nodes

Taking the third column in table 1 (i.e., node 0) as an example, it can be seen from the table that the intermediate values already existing for node 0 include:

v _1，{0，1} ,v _1，{0，2} ,v _1，{0，3} ,v _1，{0，4} ,v _2，{0，1} ,v _2，{0，2} ,v _2，{0，3} ,v _2，{0，4} ,v _4，{0，1} ,v _4，{0，2} ,v _4，{0，3} ,v _4，{0，4} the method comprises the steps of carrying out a first treatment on the surface of the Intermediate values are also required to include:

v _1，{1，2} ,v _1，{1，3} ,v _1，{1，4} ,v _1，{2，3} ,v _1，{2，4} ,v _1，{3，4} ,v _2，{1，2} ,v _2，{1，3} ,v _2，{1，4} ,

v _2，{2，3} ,v _2，{2，4} ,v _2，{3，4} ,v _4，{1，2} ,v _4，{1，3} ,v _4，{1，4} ,v _4，{2，3} ,v _4，{2，4} ,v _4，{3，4} 。

table 2 shows the results of comparison with the Ali scheme. K in the table represents the number of nodes required; r represents the number of times each file is calculated; s represents the number of times each output function is calculated. N (Ali) represents the number of files needed in the article being compared; n (New) represents the number of files that we need in this approach; q (Ali) represents the number of functions required in the article being compared; q (New) represents the number of functions that we need in this approach; the final L/L represents the ratio to the traffic load of the article being compared. From table 2, we can intuitively see that using our method significantly reduces the number of files and the number of output functions, but the traffic load is not much increased, but is less than twice as much as the original traffic load.

Table 2: comparison results with Ali scheme

The innovation of the invention is that:

in the Map stage, the method of the invention selects a node factor K' of K to obtain the target valueAnd K' to divide the total file into +.>A block; when the file blocks are stored, the number of each node is firstly subjected to modulo K', and then the judgment of which file blocks are stored in the node is carried out according to the modulo result serving as a mark. The proposal proposed by Ali is to divide the file directly into +.>When storing the file blocks, the blocks only need to store the file blocks with the node numbers on the corresponding nodes. Therefore, the scheme of the invention can process fewer files, and the files processed by the scheme proposed by Ali are more, so that the method of the invention has wider application in practice.

In the Shuffle stage, when the method of the invention distributes the output function to each node, a new function distribution mode is provided to determine the distributed output function on each node; while the approach proposed by Ali is to directly divide all output functions intoAnd when the functions are allocated, storing the output function set with the node numbers on the corresponding nodes. Thus, the method of the present invention requires a smaller number of output functions than the method proposed by Ali.

When s > =k ', using our scheme does not require all K distributed computing nodes to participate, but only needs to be performed on K' distributed computing nodes, thus reducing the computing tasks on some distributed computing nodes.

In summary, the method of the present invention can reduce the number of files and functions required in the method proposed by Ali, so that the method is better applied in practice.

It should be noted that, although the examples described above are illustrative, this is not a limitation of the present invention, and thus the present invention is not limited to the above-described specific embodiments. Other embodiments, which are apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein, are considered to be within the scope of the invention as claimed.

Claims

1. The coding distributed computing method based on the MapReduce framework is characterized by comprising the following steps of:

step 1, carrying out repeated-free average division on a given input file to obtainA subset of the input files; wherein the method comprises the steps ofRepresents ∈K' taken from>Are combined, i.e.)>

step 8, each node randomly selects the intermediate value of each non-stored input file subset of the node from the intermediate values transmitted by other nodes; combining the node to store the intermediate values of the input file subsets to obtain the intermediate values of all the input file subsets; calculating an output function set distributed by the node by utilizing intermediate values of all input file subsets to finish distributed calculation;

wherein K is the total number of nodes, K' is a node factor, r is the number of times each input file is calculated, t is the number of output functions allocated on each node,s is the number of times each output function is calculated, gcd (K, s) represents the maximum common factor between K and s, []Represents a rounding function, K e {0,1,...