CN114116293A - MPI-IO-based MapReduce overflow writing improving method - Google Patents

MPI-IO-based MapReduce overflow writing improving method Download PDF

Info

Publication number
CN114116293A
CN114116293A CN202111208323.6A CN202111208323A CN114116293A CN 114116293 A CN114116293 A CN 114116293A CN 202111208323 A CN202111208323 A CN 202111208323A CN 114116293 A CN114116293 A CN 114116293A
Authority
CN
China
Prior art keywords
mpi
reduce
processing result
file
key value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111208323.6A
Other languages
Chinese (zh)
Inventor
卢宇彤
颜承橹
陈志广
刘志勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sun Yat Sen University
Original Assignee
Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sun Yat Sen University filed Critical Sun Yat Sen University
Priority to CN202111208323.6A priority Critical patent/CN114116293A/en
Publication of CN114116293A publication Critical patent/CN114116293A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/073Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a memory management context, e.g. virtual memory or cache management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/13File access structures, e.g. distributed indices
    • G06F16/137Hash-based
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/172Caching, prefetching or hoarding of files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0614Improving the reliability of storage systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/0643Management of files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/0644Management of space entities, e.g. partitions, extents, pools
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0655Vertical data movement, i.e. input-output transfer; data movement between one or more hosts and one or more storage devices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/067Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/0671In-line storage system
    • G06F3/0673Single storage device
    • G06F3/0674Disk device
    • G06F3/0676Magnetic disk device

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Quality & Reliability (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a MapReduce overflow writing improving method based on MPI-IO, which comprises the following steps: reading a data set slice from a target file by the MPI process at the Map end; the Map-end MPI process runs a Map task, performs mapping processing on the data slices and partitions the mapping processing result to obtain partitioned key value pairs; judging that the size of the mapping processing result exceeds a memory capacity threshold, executing the over-write operation by the Map terminal, and over-writing the key value pairs subjected to partition sorting to the same disk file in parallel to obtain an over-write processing result; the Reduce end MPI process draws the overflow write processing result of the Map end and uses a Reduce task to Reduce the key value pair to obtain a Reduce processing result; and the Reduce end writes the result of the Reduce processing into a disk. The method aggregates the IO requests of a plurality of MPI processes by writing a large file in parallel, reduces a large amount of file reading and writing, simultaneously can avoid the generation of excessive intermediate files, and relieves the pressure of a metadata server. The invention can be widely applied to the fields of large data processing frameworks and high-performance calculation.

Description

MPI-IO-based MapReduce overflow writing improving method
Technical Field
The invention relates to the field of big data processing frameworks and high-performance computing, in particular to a MapReduce overflow writing improving method based on MPI-IO.
Background
The big data processing frame is an important research content in the field of high-performance computing, is used for reading, writing, distributing and calculating a large-capacity data set, aims to efficiently complete parallel processing tasks, and finally obtains valuable data information. At present, with the blowout of data, the requirements of a system on data storage capacity and processing capacity are higher and higher, the computing capacity and the storage capacity required for processing mass data already exceed the upper limit of one computer, and the realization of distributed computing and parallel processing of large-scale data by using a large data processing framework on a cluster becomes mainstream.
MapReduce is a parallel computing model and method oriented to large-scale data processing, and can be regarded as a programming model technically. The method abstracts data processing into concepts of Map and Reduce, and provides abstract operation and parallel programming interfaces to simply and conveniently complete programming and calculation processing of large-scale data by specifying the Map function and the Reduce function. The Map function maps data into a set of key-value pairs that operate on each element of a conceptual list of individual elements. Map operations are highly parallelizable, and thus are very useful for high performance demanding applications and for requirements in the field of parallel computing. The Reduce function generalizes the values of the same key in all the mapped key value pairs, which is often the step of obtaining the final result. Because the operation of the nodes in the large-scale cluster is relatively independent, the Reduce task can be executed on the framework in parallel.
In the MapReduce-based distributed processing framework, each element is independently operated, and the original file where the element is located is not changed during data processing, but the mapping processing result is written into a new temporary file. In large-scale data processing, when the output result of the Map task is too much, the memory overflow may be caused, so that the data in the buffer needs to be written into the disk under certain conditions, and then the buffer is reused. This process of writing data from memory to disk is called an overflow write. When the traditional MapReuce framework executes the overflow write operation, a process creates a new temporary storage file for the mapping processing result of each data block. Because the POSIX interface is adopted and is limited by VFS of linux, when the temporary file is created, the directory where the temporary file is located can be locked, and the data files cannot be created in parallel. Secondly, after the Map operation is executed each time, a large number of temporary files are generated, and the temporary files are frequently operated, so that the load of the metadata server is too high. In addition, a large number of temporary files also bring about reading and writing of a large number of small files, and the storage bandwidth cannot be fully utilized, so that the reading and writing speed of the files also becomes a bottleneck influencing the frame performance.
Disclosure of Invention
The invention aims to provide an MPI-IO-based MapReduce overflow write improving method, which overcomes the technical defect that a traditional POSIX API (post input interface) can not read and write files in parallel, aggregates IO requests of a plurality of MPI processes by a method for writing one large file in parallel, reduces a large number of file reads and writes, and simultaneously can avoid generation of excessive intermediate files and relieve pressure of a metadata server.
The first technical scheme adopted by the invention is as follows: a MapReduce overflow write improving method based on MPI-IO comprises the following steps:
s1, reading a data set slice from the target file by the Map-end MPI process;
s2, the Map task is run by the MPI process at the Map end, the data slice is mapped and the mapping processing result is partitioned, and the partitioned key value pair is obtained;
s3, judging that the size of the mapping processing result exceeds the memory capacity threshold, executing the over-write operation by the Map terminal, and over-writing the key value pairs subjected to partition sorting to the same disk file in parallel to obtain an over-write processing result;
s4, the Reduce end MPI process draws the overflow write processing result of the Map end and uses the Reduce task to Reduce the key value pair to obtain the Reduce processing result;
and S5, writing the Reduce processing result into the disk by the Reduce end.
Further, the step of reading the data set slice from the target file by the Map-side MPI process specifically includes:
s11, inquiring and caching metadata of the target data set from the distributed file system;
s12, the MPI root process calculates the total number of data slices of the target data set after being segmented according to the size of the target data set and the size of a predefined data slice;
s13, the MPI root process broadcasts the data slice size and the total number of slices to other MPI processes in the group;
s14, calculating each MPI process to obtain a slice number according to the process number and the total number of the MPI processes in the group;
s15, calculating the absolute offset of the read data slice in the target file by the MPI process according to the size of the data slice and the slice number to obtain the initial address of each data slice;
and S16, through the MPI-IO interface, the MPI process reads the data slices from the target data set in parallel and periodically to the cache from the first addresses of the respective divided data slices.
Further, the Map-end MPI process runs a Map task, performs mapping processing on the data slice and partitions the mapping processing result to obtain partitioned key value pairs, and the method specifically includes the steps of:
s21, reading the data slices by rows by the MPI process of the Map end, and extracting the data to be used as a key value;
s22, Map operation is carried out on each key, and the key is mapped into a key value pair to obtain a mapping processing result;
s23, carrying out hash processing on the key value pairs, and dispersing the key value pairs with different key values into r different hash partitions, wherein r represents the number of Reduce end nodes executing the Reduce task, and the partitioned key value pairs are obtained.
Further, the step of judging that the size of the mapping processing result exceeds the memory capacity threshold, executing the write-overflow operation by the Map terminal, and performing parallel write-overflow on the key value pairs sorted by the partitions to the same disk file to obtain the write-overflow processing result specifically includes:
s31, judging that the size of the mapping processing result exceeds a memory capacity threshold value, and creating r write-overflow files on the disk;
s32, each MPI process generates r IO requests for writing the key value pairs of r partitions in the mapping processing result into a disk;
s33, MPI-IO aggregates IO requests written to the same write-over file in different MPI processes;
s34, initiating a POSIX API call to the system based on the MPI-IO middleware, and writing key value pairs belonging to the same Hash partition in different MPI processes to different areas of the write-overflow file according to the process number of the MPI process and the offset of the write file;
s35, returning to the step S33 until all the mappings generated by Map operation are written;
s36, sorting the key value pairs in the overflow write file according to the key values, and stipulating the data with the same key value to obtain the overflow write processing result.
Further, the Reduce end MPI process draws the overflow write processing result of the Map end and uses the Reduce task to Reduce the key-value pair to obtain the Reduce processing result, which specifically includes:
s41, receiving the overflow write File from each Map end by the Reduce end MPI process through the MPI _ File _ read interface;
s42, merging the received overflow write files by the Reduce end, and sorting all key value pairs according to key value comparison;
and S43, summarizing all values with the same key according to the Reduce task requirement to obtain a Reduce processing result.
Further, the step of writing the Reduce processing result into the disk by the Reduce end specifically includes:
s51, performing secondary hash on the Reduce processing result according to the key value, and distributing the result to different files;
s52, the MPI process writes the Reduce processing result into a buffer, and returns to the step S51 until the memory is judged to reach the threshold value;
s53, jumping to the step S54 when the key is written into the file corresponding to the data, and jumping to the step S55 when the key is not written into the file corresponding to the data;
s54, summarizing the existing keys and the values of the keys to be written, and jumping to the step S56;
s55, inserting the key value pairs into the file according to the key value sequence;
and S56, returning to the step S51 until the Reduce processing result is written.
The method and the system have the beneficial effects that: the invention uses MPI-IO to replace a POSIX interface, so that IO requests of a plurality of MPI processes are aggregated, the development efficiency is improved through the MPI interface, and the MPI-IO is used to enable the plurality of processes to write data into the same file in parallel, thereby directly avoiding the generation of a large amount of intermediate files and also saving the expense of merging small files. In addition, the method reduces the memory pressure of the metadata server and also avoids a large number of processes from overflowing and writing the same parallel file system. In conclusion, the method and the device improve the overflow and write performance of the MapReduce framework and improve the task processing efficiency of the system through the MPI-IO parallel read-write interface while introducing no extra overhead.
Drawings
FIG. 1 is a flowchart illustrating steps of a MapReduce overflow write improving method based on MPI-IO according to the present invention;
FIG. 2 is a schematic diagram of the MPI process of the present invention reading a slice of a data set.
FIG. 3 is a diagram illustrating the Map task executed by the present invention
FIG. 4 is a flowchart of a file-over-write process using an MPI-IO aggregate IO request of multiple MPI processes according to the present invention.
FIG. 5 is a diagram illustrating the writing of data of each MPI process to different areas on an overflow write file according to the present invention.
FIG. 6 is a diagram illustrating the Reduce-side MPI process pulling data from the Map side according to the present invention.
FIG. 7 is a flow chart illustrating writing the Reduce processing result to the disk according to the present invention.
Detailed Description
The invention is described in further detail below with reference to the figures and the specific embodiments. The step numbers in the following embodiments are provided only for convenience of illustration, the order between the steps is not limited at all, and the execution order of each step in the embodiments can be adapted according to the understanding of those skilled in the art.
Taking the MapReduce framework running WordCount task as an example, referring to FIG. 1, the invention provides an MPI-IO-based MapReduce overflow write improving method, which comprises the following steps:
s1, reading a data set slice from the target file by the Map-end MPI process;
s2, the Map task is run by the MPI process at the Map end, the data slice is mapped and the mapping processing result is partitioned, and the partitioned key value pair is obtained;
specifically, the Shuffle operation is performed on the mapping process result.
S3, judging that the size of the mapping processing result exceeds the memory capacity threshold, executing the over-write operation by the Map terminal, and over-writing the key value pairs subjected to partition sorting to the same disk file in parallel to obtain an over-write processing result;
s4, the Reduce end MPI process draws the overflow write processing result of the Map end and uses the Reduce task to Reduce the key value pair to obtain the Reduce processing result;
specifically, MPI-IO is used for replacing a POSIX interface, writing requests of a plurality of MPI processes are collected, and key value pair data which are divided into the same partition after passing through Shuffle are parallelly written on the same disk file in an overflowing manner.
And S5, writing the Reduce processing result into the disk by the Reduce end.
As a further preferred embodiment of the method, referring to fig. 2, the step of reading the data set slice from the target file by the Map-side MPI process specifically includes:
s11, the MPI process (root process) with the process number of 0 inquires and caches the metadata of the target data set from the distributed file system;
s12, the MPI root process calculates the total number of data slices of the target data set after being segmented according to the predefined data slice size spliteSize and the target data set size fileSize
Figure BDA0003307808090000051
S13, the MPI root process broadcasts the data slice size and the total number of slices to other MPI processes in the group;
s14, calculating the slice numbers { i, i + m, i +2 m, i +3 m, … } by each MPI process according to the process number i and the total number m of the MPI processes in the group;
s15, the MPI process calculates the absolute offset of the read data slice in the target file according to the size of the data slice and the slice number, and obtains the head address p of the read data of the MPI process, wherein the head is the file header;
and S16, through the MPI-IO interface, the MPI process reads the data slices from the target data set in parallel and periodically to the cache from the first address p of each divided data slice.
In this embodiment, the Map-side MPI root process obtains metadata for storing a WordCount to-be-processed file from the distributed file system, divides the WordCount file into a plurality of data slices with a size of 64MB, reads a corresponding slice according to the number of each MPI process, and obtains the number of the read data slice by a way of complementing the slice number and the MPI process, for example, if 10 MPI processes need to read the data slice, the process 1 will read the slice with the number of 1, 11, 21. After the slice number is available, the offset of the read data on the file can be calculated, for example, the offset of the No. 1 slice is 0-64 MB, 640-704 MB.. And then, an MPI-IO interface is used for realizing parallel reading of the data files by a plurality of MPI processes.
As a further preferred embodiment of the method, referring to fig. 3, the step of running a Map task by the Map-side MPI process, mapping the data slice, and partitioning the mapping result to obtain a partitioned key value pair specifically includes:
s21, reading the data slices by rows by the MPI process of the Map end, and extracting the data to be used as a key value;
s22, Map operation is carried out on each key, and the key is mapped into a key value pair to obtain a mapping processing result;
s23, carrying out hash processing on the key value pairs, and dispersing the key value pairs with different key values into r different hash partitions, wherein r represents the number of Reduce end nodes executing the Reduce task, and the partitioned key value pairs are obtained.
In this embodiment, each MPI process performs mapping processing on a read 64MB data slice, and reads data in units of lines, and then maps each word in a line as a key to generate a key-value. For example, "how are you" would be mapped to (how,1), (are,1), (you,1) and then these key-value pairs would be hashed. Assuming that the number of Reduce ends is 8 in the MapReduce framework, the key value pair is distributed to 8 partitions according to the hash value of the key. Since all MPI processes use the same hash, eventually the same key value will be divided into the same partitions.
Further, as a preferred embodiment of the method, referring to fig. 4, the step of determining that the size of the mapping processing result exceeds the memory capacity threshold, executing an over-write operation at the Map end, and performing parallel over-write on the key value pairs sorted by the partitions to the same disk file to obtain an over-write processing result specifically includes:
s31, judging that the size of the mapping processing result exceeds a memory capacity threshold value, and creating r write-overflow files on the disk;
s32, each MPI process generates r IO requests for writing the key value pairs of r partitions in the mapping processing result into a disk;
s33, MPI-IO aggregates IO requests written to the same write-over file in different MPI processes;
s34, initiating a POSIX API call to the system based on the MPI-IO middleware, and writing key value pairs belonging to the same Hash partition in different MPI processes to different areas of the write-overflow file according to the process number of the MPI process and the offset of the write file;
specifically, referring to fig. 5, writing key-value pairs belonging to the same hash partition in different MPI processes to different areas of the write-overflow file specifically includes:
s341, each MPI process generates r IO requests to write the key value pairs into r files, wherein r is the number of Reduce ends. Since the hash space is shared, the jth IO request of different MPI processes is written into a file by the key value pair classified into the jth partition, so that they can be stored by one file.
And S342, writing the data from the m processes into the overflow write file according to the number m of MPI processes at the Map end, taking m as a period and the size of the fixed data block. The MPI process with process number i writes data into the i, i + m, i +2 × m … th block, respectively, and if the data block size is blockSize, the process with process number i writes data into the offset as follows:
[ (i-1) × blockSize, i × blockSize ] [ (i-1+ m) × blockSize, (i + m) × blockSize ], [ (i-1+2 × m) × blockSize, (i +2 × m) × blockSize ]. in the file space.
And S343, aggregating a plurality of requests written by different MPI processes to the same file into a large request through MPI-IO, and writing the data from different MPI processes into different areas of the file in parallel according to the process numbers and the dividing method of S342.
S35, returning to the step S33 until all the mappings generated by Map operation are written;
s36, sorting the key value pairs in the overflow write file according to the key values, and stipulating the data with the same key value to obtain the overflow write processing result.
In this embodiment, the following example is added, and when the MPI-IO interface is used to write a file in parallel, each MPI process sends out a written file header, an absolute offset, a data size, a data type, and a process state, and the MPI-IO interface manages the file header, the absolute offset, the data size, the data type, and the process state in a unified manner. The interval written by the key value pair of each process can be abstracted into one block, and the overflow write file is composed of the blocks. Assuming that the size of the block is 64MB, when data is written into the over-written file, if the written interval is full, a space of 10 × 64M is newly expanded, wherein the interval with the offset of 0-64 MB belongs to the process 1, the interval with the offset of 65-128 MB belongs to the process 2, and the interval with the offset of 577-640 MB belongs to the process 10. And each process writes data into a corresponding interval, and the position of the interval is calculated through a file header and an absolute offset.
As a further preferred embodiment of the method, referring to fig. 6, the step of the Reduce end MPI process pulling the overflow write processing result of the Map end and performing reduction processing on the key-value pair by using a Reduce task to obtain a Reduce processing result specifically includes:
s41, receiving the overflow write File from each Map end by the Reduce end MPI process through the MPI _ File _ read interface;
specifically, the Map-side MPI process uses an MPI _ File _ write interface to send an overflow File in a blocking mode, so as to ensure that data information and a data envelope are safely stored; and the Reduce end MPI process performs data transmission among the processes on the premise of ensuring that the receiving cache is larger than the sending data through an MPI _ File _ read interface, receives the overflow write File from each Map end, and receives the data from the kth partition of each Map end.
And S42, merging the received overflow write files by the Reduce end, sorting all key value pairs according to key value comparison, and placing the values of the two keys into a value iterator when the two keys are the same. (ii) a
And S43, summarizing all values with the same key according to the Reduce task requirement to obtain a Reduce processing result.
In this example, the Reduce end acquires data from the Map end through the MPI interface, and each Reduce end receives a key value pair of a corresponding partition, for example, the Reduce end No. 1 receives an overflow file No. 1 from each Map end, and the Reduce end No. 2 receives an overflow text No. 2 from each Map end.
As a further preferred embodiment of the method, referring to fig. 7, the step of writing the Reduce processing result to the disk by the Reduce end specifically includes:
s51, performing secondary hash on the Reduce processing result according to the key value, and distributing the result to different files;
s52, the MPI process writes the Reduce processing result into a buffer, and returns to the step S51 until the memory is judged to reach the threshold value;
s53, jumping to the step S54 when the key is written into the file corresponding to the data, and jumping to the step S55 when the key is not written into the file corresponding to the data;
s54, summarizing the existing keys and the values of the keys to be written, and jumping to the step S56;
s55, inserting the key value pairs into the file according to the key value sequence;
and S56, returning to the step S51 until the Reduce processing result is written.
In this example, the Reduce end distributes the induction result, e.g., (you,300), to different files according to the hash of the key value. And (4) writing the induction result into a cache, wherein due to limited memory space, the situation that the memory cannot put down all Key values at the same time can occur, and at the moment, the induction result is subjected to overflow writing on the induced Key value pair. And judging whether the key is written in the written target file, if not, inserting the key according to the sequence, and if so, for example, if the key is written in the disk file (you,500), combining the key and the key to generate a new key value pair (you, 800).
While the preferred embodiments of the present invention have been illustrated and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (6)

1. A MapReduce overflow write improving method based on MPI-IO is characterized by comprising the following steps:
s1, reading a data set slice from the target file by the Map-end MPI process;
s2, the Map task is run by the MPI process at the Map end, the data slice is mapped and the mapping processing result is partitioned, and the partitioned key value pair is obtained;
s3, judging that the size of the mapping processing result exceeds the memory capacity threshold, executing the over-write operation by the Map terminal, and over-writing the key value pairs subjected to partition sorting to the same disk file in parallel to obtain an over-write processing result;
s4, the Reduce end MPI process draws the overflow write processing result of the Map end and uses the Reduce task to Reduce the key value pair to obtain the Reduce processing result;
and S5, writing the Reduce processing result into the disk by the Reduce end.
2. The method as claimed in claim 1, wherein the step of reading the data set slice from the target file by the Map-side MPI process specifically includes:
s11, inquiring and caching metadata of the target data set from the distributed file system;
s12, the MPI root process calculates the total number of data slices of the target data set after being segmented according to the size of the target data set and the size of a predefined data slice;
s13, the MPI root process broadcasts the data slice size and the total number of slices to other MPI processes in the group;
s14, calculating each MPI process to obtain a slice number according to the process number and the total number of the MPI processes in the group;
s15, calculating the absolute offset of the read data slice in the target file by the MPI process according to the size of the data slice and the slice number to obtain the initial address of each data slice;
and S16, through the MPI-IO interface, the MPI process reads the data slices from the target data set in parallel and periodically to the cache from the first addresses of the respective divided data slices.
3. The method for improving MapReduce overflow write based on MPI-IO as claimed in claim 2, wherein the Map-end MPI process runs a Map task, performs mapping processing on a data slice, and partitions a mapping processing result to obtain a partitioned key value pair, and specifically includes:
s21, reading the data slices by rows by the MPI process of the Map end, and extracting the data to be used as a key value;
s22, Map operation is carried out on each key, and the key is mapped into a key value pair to obtain a mapping processing result;
s23, carrying out hash processing on the key value pairs, and dispersing the key value pairs with different key values into r different hash partitions, wherein r represents the number of Reduce end nodes executing the Reduce task, and the partitioned key value pairs are obtained.
4. The method according to claim 2, wherein the step of determining that the size of the mapping processing result exceeds the memory capacity threshold, executing an over-write operation at the Map end, and over-writing the key value pairs sorted by the partitions to the same disk file in parallel to obtain an over-write processing result specifically includes:
s31, judging that the size of the mapping processing result exceeds a memory capacity threshold value, and creating r write-overflow files on the disk;
s32, each MPI process generates r IO requests for writing the key value pairs of r partitions in the mapping processing result into a disk;
s33, MPI-IO aggregates IO requests written to the same write-over file in different MPI processes;
s34, initiating POSIXAPI calling to the system based on the MPI-IO middleware, and writing key value pairs belonging to the same hash partition in different MPI processes to different areas of the write-overflow file according to the process number of the MPI process and the offset of the write file;
s35, returning to the step S33 until all the mappings generated by Map operation are written;
s36, sorting the key value pairs in the overflow write file according to the key values, and stipulating the data with the same key value to obtain the overflow write processing result.
5. The method as claimed in claim 4, wherein the Reduce-side MPI process pulls the overflow result of the Map side and uses Reduce task to Reduce the key-value pair to obtain the Reduce processing result, and the method specifically includes:
s41, receiving the overflow write File from each Map end by the Reduce end MPI process through the MPI _ File _ read interface;
s42, merging the received overflow write files by the Reduce end, and sorting all key value pairs according to key value comparison;
and S43, summarizing all values with the same key according to the Reduce task requirement to obtain a Reduce processing result.
6. The method as claimed in claim 5, wherein the step of writing the Reduce processing result to the disk by the Reduce end includes:
s51, performing secondary hash on the Reduce processing result according to the key value, and distributing the result to different files;
s52, the MPI process writes the Reduce processing result into a buffer, and returns to the step S51 until the memory is judged to reach the threshold value;
s53, jumping to the step S54 when the key is written into the file corresponding to the data, and jumping to the step S55 when the key is not written into the file corresponding to the data;
s54, summarizing the existing keys and the values of the keys to be written, and jumping to the step S56;
s55, inserting the key value pairs into the file according to the key value sequence;
and S56, returning to the step S51 until the Reduce processing result is written.
CN202111208323.6A 2021-10-18 2021-10-18 MPI-IO-based MapReduce overflow writing improving method Pending CN114116293A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111208323.6A CN114116293A (en) 2021-10-18 2021-10-18 MPI-IO-based MapReduce overflow writing improving method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111208323.6A CN114116293A (en) 2021-10-18 2021-10-18 MPI-IO-based MapReduce overflow writing improving method

Publications (1)

Publication Number Publication Date
CN114116293A true CN114116293A (en) 2022-03-01

Family

ID=80375878

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111208323.6A Pending CN114116293A (en) 2021-10-18 2021-10-18 MPI-IO-based MapReduce overflow writing improving method

Country Status (1)

Country Link
CN (1) CN114116293A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024012153A1 (en) * 2022-07-14 2024-01-18 华为技术有限公司 Data processing method and apparatus

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024012153A1 (en) * 2022-07-14 2024-01-18 华为技术有限公司 Data processing method and apparatus

Similar Documents

Publication Publication Date Title
EP3254210B1 (en) Big data statistics at data-block level
EP2735978B1 (en) Storage system and management method used for metadata of cluster file system
CN110825748B (en) High-performance and easily-expandable key value storage method by utilizing differentiated indexing mechanism
US8732139B2 (en) Method and system for dynamically partitioning very large database indices on write-once tables
US8868576B1 (en) Storing files in a parallel computing system based on user-specified parser function
US7418544B2 (en) Method and system for log structured relational database objects
Bortnikov et al. Accordion: Better memory organization for LSM key-value stores
US20160350302A1 (en) Dynamically splitting a range of a node in a distributed hash table
TW201530328A (en) Method and device for constructing NoSQL database index for semi-structured data
WO2017041570A1 (en) Method and apparatus for writing data to cache
CN104572505A (en) System and method for ensuring eventual consistency of mass data caches
CN112306974A (en) Data processing method, device, equipment and storage medium
CN114116293A (en) MPI-IO-based MapReduce overflow writing improving method
Chai et al. Adaptive lower-level driven compaction to optimize LSM-tree key-value stores
WO2020125362A1 (en) File system and data layout method
CN108280123B (en) HBase column polymerization method
Chen et al. Workload-Aware Log-Structured Merge Key-Value Store for NVM-SSD Hybrid Storage
US11341163B1 (en) Multi-level replication filtering for a distributed database
Li et al. Financial big data hot and cold separation scheme based on hbase and redis
Yu et al. LTG-LSM: The optimal structure in LSM-tree combined with reading hotness
US20240086362A1 (en) Key-value store and file system
US11829341B2 (en) Space-efficient persistent hash table data structure
US11747998B1 (en) Indexing technique for large scale distributed key-value systems
Mishra A survey of LSM-Tree based Indexes, Data Systems and KV-stores
US20230297575A1 (en) Storage system and data cache method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination