CN114116293A

CN114116293A - MPI-IO-based MapReduce overflow writing improving method

Info

Publication number: CN114116293A
Application number: CN202111208323.6A
Authority: CN
Inventors: 卢宇彤; 颜承橹; 陈志广; 刘志勇
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2021-10-18
Filing date: 2021-10-18
Publication date: 2022-03-01

Abstract

The invention discloses a MapReduce overflow writing improving method based on MPI-IO, which comprises the following steps: reading a data set slice from a target file by the MPI process at the Map end; the Map-end MPI process runs a Map task, performs mapping processing on the data slices and partitions the mapping processing result to obtain partitioned key value pairs; judging that the size of the mapping processing result exceeds a memory capacity threshold, executing the over-write operation by the Map terminal, and over-writing the key value pairs subjected to partition sorting to the same disk file in parallel to obtain an over-write processing result; the Reduce end MPI process draws the overflow write processing result of the Map end and uses a Reduce task to Reduce the key value pair to obtain a Reduce processing result; and the Reduce end writes the result of the Reduce processing into a disk. The method aggregates the IO requests of a plurality of MPI processes by writing a large file in parallel, reduces a large amount of file reading and writing, simultaneously can avoid the generation of excessive intermediate files, and relieves the pressure of a metadata server. The invention can be widely applied to the fields of large data processing frameworks and high-performance calculation.

Description

MPI-IO-based MapReduce overflow writing improving method

Technical Field

The invention relates to the field of big data processing frameworks and high-performance computing, in particular to a MapReduce overflow writing improving method based on MPI-IO.

Background

The big data processing frame is an important research content in the field of high-performance computing, is used for reading, writing, distributing and calculating a large-capacity data set, aims to efficiently complete parallel processing tasks, and finally obtains valuable data information. At present, with the blowout of data, the requirements of a system on data storage capacity and processing capacity are higher and higher, the computing capacity and the storage capacity required for processing mass data already exceed the upper limit of one computer, and the realization of distributed computing and parallel processing of large-scale data by using a large data processing framework on a cluster becomes mainstream.

MapReduce is a parallel computing model and method oriented to large-scale data processing, and can be regarded as a programming model technically. The method abstracts data processing into concepts of Map and Reduce, and provides abstract operation and parallel programming interfaces to simply and conveniently complete programming and calculation processing of large-scale data by specifying the Map function and the Reduce function. The Map function maps data into a set of key-value pairs that operate on each element of a conceptual list of individual elements. Map operations are highly parallelizable, and thus are very useful for high performance demanding applications and for requirements in the field of parallel computing. The Reduce function generalizes the values of the same key in all the mapped key value pairs, which is often the step of obtaining the final result. Because the operation of the nodes in the large-scale cluster is relatively independent, the Reduce task can be executed on the framework in parallel.

In the MapReduce-based distributed processing framework, each element is independently operated, and the original file where the element is located is not changed during data processing, but the mapping processing result is written into a new temporary file. In large-scale data processing, when the output result of the Map task is too much, the memory overflow may be caused, so that the data in the buffer needs to be written into the disk under certain conditions, and then the buffer is reused. This process of writing data from memory to disk is called an overflow write. When the traditional MapReuce framework executes the overflow write operation, a process creates a new temporary storage file for the mapping processing result of each data block. Because the POSIX interface is adopted and is limited by VFS of linux, when the temporary file is created, the directory where the temporary file is located can be locked, and the data files cannot be created in parallel. Secondly, after the Map operation is executed each time, a large number of temporary files are generated, and the temporary files are frequently operated, so that the load of the metadata server is too high. In addition, a large number of temporary files also bring about reading and writing of a large number of small files, and the storage bandwidth cannot be fully utilized, so that the reading and writing speed of the files also becomes a bottleneck influencing the frame performance.

Disclosure of Invention

The invention aims to provide an MPI-IO-based MapReduce overflow write improving method, which overcomes the technical defect that a traditional POSIX API (post input interface) can not read and write files in parallel, aggregates IO requests of a plurality of MPI processes by a method for writing one large file in parallel, reduces a large number of file reads and writes, and simultaneously can avoid generation of excessive intermediate files and relieve pressure of a metadata server.

The first technical scheme adopted by the invention is as follows: a MapReduce overflow write improving method based on MPI-IO comprises the following steps:

s1, reading a data set slice from the target file by the Map-end MPI process;

s2, the Map task is run by the MPI process at the Map end, the data slice is mapped and the mapping processing result is partitioned, and the partitioned key value pair is obtained;

s3, judging that the size of the mapping processing result exceeds the memory capacity threshold, executing the over-write operation by the Map terminal, and over-writing the key value pairs subjected to partition sorting to the same disk file in parallel to obtain an over-write processing result;

s4, the Reduce end MPI process draws the overflow write processing result of the Map end and uses the Reduce task to Reduce the key value pair to obtain the Reduce processing result;

and S5, writing the Reduce processing result into the disk by the Reduce end.

Further, the step of reading the data set slice from the target file by the Map-side MPI process specifically includes:

s11, inquiring and caching metadata of the target data set from the distributed file system;

s12, the MPI root process calculates the total number of data slices of the target data set after being segmented according to the size of the target data set and the size of a predefined data slice;

s13, the MPI root process broadcasts the data slice size and the total number of slices to other MPI processes in the group;

s14, calculating each MPI process to obtain a slice number according to the process number and the total number of the MPI processes in the group;

s15, calculating the absolute offset of the read data slice in the target file by the MPI process according to the size of the data slice and the slice number to obtain the initial address of each data slice;

and S16, through the MPI-IO interface, the MPI process reads the data slices from the target data set in parallel and periodically to the cache from the first addresses of the respective divided data slices.

Further, the Map-end MPI process runs a Map task, performs mapping processing on the data slice and partitions the mapping processing result to obtain partitioned key value pairs, and the method specifically includes the steps of:

s21, reading the data slices by rows by the MPI process of the Map end, and extracting the data to be used as a key value;

s22, Map operation is carried out on each key, and the key is mapped into a key value pair to obtain a mapping processing result;

s23, carrying out hash processing on the key value pairs, and dispersing the key value pairs with different key values into r different hash partitions, wherein r represents the number of Reduce end nodes executing the Reduce task, and the partitioned key value pairs are obtained.

Further, the step of judging that the size of the mapping processing result exceeds the memory capacity threshold, executing the write-overflow operation by the Map terminal, and performing parallel write-overflow on the key value pairs sorted by the partitions to the same disk file to obtain the write-overflow processing result specifically includes:

s31, judging that the size of the mapping processing result exceeds a memory capacity threshold value, and creating r write-overflow files on the disk;

s32, each MPI process generates r IO requests for writing the key value pairs of r partitions in the mapping processing result into a disk;

s33, MPI-IO aggregates IO requests written to the same write-over file in different MPI processes;

s34, initiating a POSIX API call to the system based on the MPI-IO middleware, and writing key value pairs belonging to the same Hash partition in different MPI processes to different areas of the write-overflow file according to the process number of the MPI process and the offset of the write file;

s35, returning to the step S33 until all the mappings generated by Map operation are written;

s36, sorting the key value pairs in the overflow write file according to the key values, and stipulating the data with the same key value to obtain the overflow write processing result.

Further, the Reduce end MPI process draws the overflow write processing result of the Map end and uses the Reduce task to Reduce the key-value pair to obtain the Reduce processing result, which specifically includes:

s41, receiving the overflow write File from each Map end by the Reduce end MPI process through the MPI _ File _ read interface;

s42, merging the received overflow write files by the Reduce end, and sorting all key value pairs according to key value comparison;

and S43, summarizing all values with the same key according to the Reduce task requirement to obtain a Reduce processing result.

Further, the step of writing the Reduce processing result into the disk by the Reduce end specifically includes:

s51, performing secondary hash on the Reduce processing result according to the key value, and distributing the result to different files;

s52, the MPI process writes the Reduce processing result into a buffer, and returns to the step S51 until the memory is judged to reach the threshold value;

s53, jumping to the step S54 when the key is written into the file corresponding to the data, and jumping to the step S55 when the key is not written into the file corresponding to the data;

s54, summarizing the existing keys and the values of the keys to be written, and jumping to the step S56;

s55, inserting the key value pairs into the file according to the key value sequence;

and S56, returning to the step S51 until the Reduce processing result is written.

The method and the system have the beneficial effects that: the invention uses MPI-IO to replace a POSIX interface, so that IO requests of a plurality of MPI processes are aggregated, the development efficiency is improved through the MPI interface, and the MPI-IO is used to enable the plurality of processes to write data into the same file in parallel, thereby directly avoiding the generation of a large amount of intermediate files and also saving the expense of merging small files. In addition, the method reduces the memory pressure of the metadata server and also avoids a large number of processes from overflowing and writing the same parallel file system. In conclusion, the method and the device improve the overflow and write performance of the MapReduce framework and improve the task processing efficiency of the system through the MPI-IO parallel read-write interface while introducing no extra overhead.

Drawings

FIG. 1 is a flowchart illustrating steps of a MapReduce overflow write improving method based on MPI-IO according to the present invention;

FIG. 2 is a schematic diagram of the MPI process of the present invention reading a slice of a data set.

FIG. 3 is a diagram illustrating the Map task executed by the present invention

FIG. 4 is a flowchart of a file-over-write process using an MPI-IO aggregate IO request of multiple MPI processes according to the present invention.

FIG. 5 is a diagram illustrating the writing of data of each MPI process to different areas on an overflow write file according to the present invention.

FIG. 6 is a diagram illustrating the Reduce-side MPI process pulling data from the Map side according to the present invention.

FIG. 7 is a flow chart illustrating writing the Reduce processing result to the disk according to the present invention.

Detailed Description

The invention is described in further detail below with reference to the figures and the specific embodiments. The step numbers in the following embodiments are provided only for convenience of illustration, the order between the steps is not limited at all, and the execution order of each step in the embodiments can be adapted according to the understanding of those skilled in the art.

Taking the MapReduce framework running WordCount task as an example, referring to FIG. 1, the invention provides an MPI-IO-based MapReduce overflow write improving method, which comprises the following steps:

s1, reading a data set slice from the target file by the Map-end MPI process;

specifically, the Shuffle operation is performed on the mapping process result.

specifically, MPI-IO is used for replacing a POSIX interface, writing requests of a plurality of MPI processes are collected, and key value pair data which are divided into the same partition after passing through Shuffle are parallelly written on the same disk file in an overflowing manner.

And S5, writing the Reduce processing result into the disk by the Reduce end.

As a further preferred embodiment of the method, referring to fig. 2, the step of reading the data set slice from the target file by the Map-side MPI process specifically includes:

s11, the MPI process (root process) with the process number of 0 inquires and caches the metadata of the target data set from the distributed file system;

s12, the MPI root process calculates the total number of data slices of the target data set after being segmented according to the predefined data slice size spliteSize and the target data set size fileSize

s14, calculating the slice numbers { i, i + m, i +2 m, i +3 m, … } by each MPI process according to the process number i and the total number m of the MPI processes in the group;

s15, the MPI process calculates the absolute offset of the read data slice in the target file according to the size of the data slice and the slice number, and obtains the head address p of the read data of the MPI process, wherein the head is the file header;

and S16, through the MPI-IO interface, the MPI process reads the data slices from the target data set in parallel and periodically to the cache from the first address p of each divided data slice.

In this embodiment, the Map-side MPI root process obtains metadata for storing a WordCount to-be-processed file from the distributed file system, divides the WordCount file into a plurality of data slices with a size of 64MB, reads a corresponding slice according to the number of each MPI process, and obtains the number of the read data slice by a way of complementing the slice number and the MPI process, for example, if 10 MPI processes need to read the data slice, the process 1 will read the slice with the number of 1, 11, 21. After the slice number is available, the offset of the read data on the file can be calculated, for example, the offset of the No. 1 slice is 0-64 MB, 640-704 MB.. And then, an MPI-IO interface is used for realizing parallel reading of the data files by a plurality of MPI processes.

As a further preferred embodiment of the method, referring to fig. 3, the step of running a Map task by the Map-side MPI process, mapping the data slice, and partitioning the mapping result to obtain a partitioned key value pair specifically includes:

In this embodiment, each MPI process performs mapping processing on a read 64MB data slice, and reads data in units of lines, and then maps each word in a line as a key to generate a key-value. For example, "how are you" would be mapped to (how,1), (are,1), (you,1) and then these key-value pairs would be hashed. Assuming that the number of Reduce ends is 8 in the MapReduce framework, the key value pair is distributed to 8 partitions according to the hash value of the key. Since all MPI processes use the same hash, eventually the same key value will be divided into the same partitions.

Further, as a preferred embodiment of the method, referring to fig. 4, the step of determining that the size of the mapping processing result exceeds the memory capacity threshold, executing an over-write operation at the Map end, and performing parallel over-write on the key value pairs sorted by the partitions to the same disk file to obtain an over-write processing result specifically includes:

specifically, referring to fig. 5, writing key-value pairs belonging to the same hash partition in different MPI processes to different areas of the write-overflow file specifically includes:

s341, each MPI process generates r IO requests to write the key value pairs into r files, wherein r is the number of Reduce ends. Since the hash space is shared, the jth IO request of different MPI processes is written into a file by the key value pair classified into the jth partition, so that they can be stored by one file.

And S342, writing the data from the m processes into the overflow write file according to the number m of MPI processes at the Map end, taking m as a period and the size of the fixed data block. The MPI process with process number i writes data into the i, i + m, i +2 × m … th block, respectively, and if the data block size is blockSize, the process with process number i writes data into the offset as follows:

[ (i-1) × blockSize, i × blockSize ] [ (i-1+ m) × blockSize, (i + m) × blockSize ], [ (i-1+2 × m) × blockSize, (i +2 × m) × blockSize ]. in the file space.

And S343, aggregating a plurality of requests written by different MPI processes to the same file into a large request through MPI-IO, and writing the data from different MPI processes into different areas of the file in parallel according to the process numbers and the dividing method of S342.

In this embodiment, the following example is added, and when the MPI-IO interface is used to write a file in parallel, each MPI process sends out a written file header, an absolute offset, a data size, a data type, and a process state, and the MPI-IO interface manages the file header, the absolute offset, the data size, the data type, and the process state in a unified manner. The interval written by the key value pair of each process can be abstracted into one block, and the overflow write file is composed of the blocks. Assuming that the size of the block is 64MB, when data is written into the over-written file, if the written interval is full, a space of 10 × 64M is newly expanded, wherein the interval with the offset of 0-64 MB belongs to the process 1, the interval with the offset of 65-128 MB belongs to the process 2, and the interval with the offset of 577-640 MB belongs to the process 10. And each process writes data into a corresponding interval, and the position of the interval is calculated through a file header and an absolute offset.

As a further preferred embodiment of the method, referring to fig. 6, the step of the Reduce end MPI process pulling the overflow write processing result of the Map end and performing reduction processing on the key-value pair by using a Reduce task to obtain a Reduce processing result specifically includes:

specifically, the Map-side MPI process uses an MPI _ File _ write interface to send an overflow File in a blocking mode, so as to ensure that data information and a data envelope are safely stored; and the Reduce end MPI process performs data transmission among the processes on the premise of ensuring that the receiving cache is larger than the sending data through an MPI _ File _ read interface, receives the overflow write File from each Map end, and receives the data from the kth partition of each Map end.

And S42, merging the received overflow write files by the Reduce end, sorting all key value pairs according to key value comparison, and placing the values of the two keys into a value iterator when the two keys are the same. (ii) a

In this example, the Reduce end acquires data from the Map end through the MPI interface, and each Reduce end receives a key value pair of a corresponding partition, for example, the Reduce end No. 1 receives an overflow file No. 1 from each Map end, and the Reduce end No. 2 receives an overflow text No. 2 from each Map end.

As a further preferred embodiment of the method, referring to fig. 7, the step of writing the Reduce processing result to the disk by the Reduce end specifically includes:

In this example, the Reduce end distributes the induction result, e.g., (you,300), to different files according to the hash of the key value. And (4) writing the induction result into a cache, wherein due to limited memory space, the situation that the memory cannot put down all Key values at the same time can occur, and at the moment, the induction result is subjected to overflow writing on the induced Key value pair. And judging whether the key is written in the written target file, if not, inserting the key according to the sequence, and if so, for example, if the key is written in the disk file (you,500), combining the key and the key to generate a new key value pair (you, 800).

While the preferred embodiments of the present invention have been illustrated and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A MapReduce overflow write improving method based on MPI-IO is characterized by comprising the following steps:

s1, reading a data set slice from the target file by the Map-end MPI process;

and S5, writing the Reduce processing result into the disk by the Reduce end.

2. The method as claimed in claim 1, wherein the step of reading the data set slice from the target file by the Map-side MPI process specifically includes:

3. The method for improving MapReduce overflow write based on MPI-IO as claimed in claim 2, wherein the Map-end MPI process runs a Map task, performs mapping processing on a data slice, and partitions a mapping processing result to obtain a partitioned key value pair, and specifically includes:

4. The method according to claim 2, wherein the step of determining that the size of the mapping processing result exceeds the memory capacity threshold, executing an over-write operation at the Map end, and over-writing the key value pairs sorted by the partitions to the same disk file in parallel to obtain an over-write processing result specifically includes:

s34, initiating POSIXAPI calling to the system based on the MPI-IO middleware, and writing key value pairs belonging to the same hash partition in different MPI processes to different areas of the write-overflow file according to the process number of the MPI process and the offset of the write file;

5. The method as claimed in claim 4, wherein the Reduce-side MPI process pulls the overflow result of the Map side and uses Reduce task to Reduce the key-value pair to obtain the Reduce processing result, and the method specifically includes:

6. The method as claimed in claim 5, wherein the step of writing the Reduce processing result to the disk by the Reduce end includes: