CN107329705B

CN107329705B - Shuffle method for heterogeneous storage

Info

Publication number: CN107329705B
Application number: CN201710532428.4A
Authority: CN
Inventors: 潘锋烽; 熊劲
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2017-07-03
Filing date: 2017-07-03
Publication date: 2020-06-05
Anticipated expiration: 2037-07-03
Also published as: CN107329705A

Abstract

The invention relates to a Shuffle method aiming at heterogeneous storage, which comprises the following steps: respectively forming corresponding resource pools by the heterogeneous storage devices according to the media types of the heterogeneous storage devices; and writing the Shuffle data into the corresponding resource pool according to the load type.

Description

Shuffle method for heterogeneous storage

Technical Field

The invention relates to the technical field of big data processing, in particular to a Shuf flash method aiming at heterogeneous storage.

Background

With the development of science and technology, the world enters the big data era today, the Shuffle stage is an extremely important stage in the big data processing process, fig. 1 is a schematic flow diagram of the Shuffle stage, as shown in fig. 1, Shuffle refers to a process of performing data exchange between stages of different types, so that data is redistributed onto each node according to a certain rule, and generally, the running efficiency of the whole program is directly affected by the performance of Shuffle.

In the prior art, there are the following methods for optimizing the Shuffle stage:

themis article published in Proceedings of the 3rd ACM Symposium on Cloud Computing (SoCC),2012, proposes to use a dynamic memory allocation strategy to store the data in the process at the Shuffle stage, that is, in the process of processing the data, the read-write times of the data from the disk are only twice, and the rest processes can not interact with the disk; SpongeFiles publishes an article in Proceedings of the 2014 ACM SIGMOD international conference on Management of data, and proposes to share unused memory space in the Task, and the two methods only accelerate through the memory and are easily limited by the memory type;

in addition, Sailfish published in Proceedings of the 3rd ACM Symposium on cloud computing (SoCC),2012, proposes to aggregate the data of the partition corresponding to each Map Task when writing Shuffle data, and store the corresponding data by using a distributed file system; Hadoop-A, published in Proceedings of the2011 International reference for High Performance Computing, Networking, Storage and Analysis, proposes to use the Network-dependent Merge algorithm to execute the Shuffle stage by utilizing the characteristics of the High speed Network (RDMA), but the two methods have the defects of being too dependent on Network Performance and having low read-write efficiency on a memory.

Therefore, a Shuffle optimization method capable of reading different types of memories efficiently is needed.

Disclosure of Invention

The invention aims to provide a Shuffle method for heterogeneous storage, which can overcome the defects of the prior art and specifically comprises the following steps:

step 1), respectively forming corresponding resource pools by heterogeneous storage equipment according to media types of the heterogeneous storage equipment;

and 2) writing the Shuffle data into the corresponding resource pool according to the load type.

Preferably, the resource pool of step 1) is a resource pool composed of an SSD and a resource pool composed of an HDD.

Preferably, for a load type with a large amount of data or a large occupied time proportion in the Shuffle phase, the data is stored in a resource pool formed by the SSD.

Preferably, for the load type with small amount of data or small occupied time in the Shuffle phase, the data is stored in the resource pool formed by the HDD.

Preferably, the Shuffle data is read in a manner of directly reading from the corresponding resource pool.

Preferably, for the load type with unknown Shuffle phase characteristics, the data is stored in the heterogeneous memory in a polling manner.

Preferably, the Shuffle data is read from the HDD or the SSD or from the SSD or the HDD respectively in a sequential and segmented reading manner.

According to another aspect of the invention, a MapReduce programming method is provided, which comprises adopting the Shuffle method for heterogeneous storage.

According to another aspect of the present invention, there is provided a computer system comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor executes the program to perform the steps as described above.

According to another aspect of the present invention, there is provided a computer readable storage medium comprising a computer program stored on the readable storage medium, wherein the program performs the steps as described above.

Compared with the prior art, the invention has the following beneficial technical effects: according to the Shuffle method for heterogeneous storage, provided by the invention, the resource pools are formed by the storage devices of the same type under the heterogeneous storage environment, and the data storage efficiency at the Shuffle stage is optimized by selecting and using the storage devices according to the characteristics of the load; when data is read, a sequential and sectional reading mode is adopted, and compared with the traditional random and unordered reading mode, the advantages of high-performance storage resources in heterogeneous storage equipment can be fully exerted, so that the reading speed of the Shuffle data is accelerated.

Drawings

FIG. 1 is a schematic flow chart of the Shuffle stage.

FIG. 2 is a test result of the Shuffle phase on different storage configurations.

FIG. 3 is a schematic diagram of a method for managing heterogeneous storage resources in a Shuffle phase according to the present invention.

FIG. 4 is a schematic diagram of the placement of the Shuffle data poll in different storage resources.

FIG. 5 is a schematic diagram of sequentially reading Shuffle data provided by the present invention.

FIG. 6 is the effect on Sort execution time at different scales provided by the present invention.

Figure 7 is the percentage of boost provided by the present invention at different scales.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more clearly apparent, the Shuffle method for heterogeneous storage provided in the embodiments of the present invention is described below with reference to the accompanying drawings.

In order to study the performance of the Shuffle phase when reading and writing are performed on different types of storage devices, the inventor takes the Sort application as an example and evaluates the execution time of the application on the different types of storage devices.

Fig. 2 shows the test results of the Shuffle phase in different storage configurations, where the different storage configurations respectively refer to HS (using two storage devices, one of which is a mechanical hard disk HDD and the other is a solid state disk SSD), HH (using two mechanical hard disk HDDs), and SS (using two solid state disk SSDs), and as shown in fig. 2, for heterogeneous storage devices, the Shuffle phase is managed by using a polling method commonly used in current large data processing platforms, so that the data in the Shuffle phase is uniformly distributed on the underlying storage device, but the execution time of the HS is closer to HH, and the performance advantage of the SSD cannot be fully exerted. Therefore, in a heterogeneous storage scenario, the storage device types are not distinguished, and high-performance storage resources are occupied, which results in resource waste.

In order to optimize the performance of the Shuffle phase, the inventor proposes that, in a heterogeneous storage scenario, applications of different types at an upper layer should perform allocation of storage resources by sensing the type of storage resources at a lower layer of the Shuffle phase according to characteristics of the applications themselves, for example, according to a degree of influence of execution time of a load on the Shuffle phase, and perform data reading on different heterogeneous storage resources, so as to exert advantages of the heterogeneous storage resources.

In an embodiment of the present invention, a Shuffle method for heterogeneous storage is provided, where the method respectively executes data access of Shuffle phases according to different storage types, and includes the following steps:

s10, writing the Shuffle data into a memory

In a heterogeneous storage scene, the existing data writing mode at the Shuffle stage generally only adopts a polling mode, and the performance advantage of heterogeneous storage cannot be exerted by the polling mode. Therefore, the inventor researches and proposes a way for managing heterogeneous storage resources in a Shuffle stage according to load characteristics, that is, for a plurality of different types of storage resources, the same type of storage resources are combined into a resource pool, and the resource pool is selectively used according to the load characteristics.

Taking a heterogeneous memory composed of HDDs and SSDs as an example, fig. 3 is a schematic diagram of a resource management manner of the Shuffle phase heterogeneous storage provided by the present invention, and as shown in fig. 3, according to the method provided by the present invention, all HDDs are combined into a HDD resource pool, and all SSDs are combined into an SSD resource pool.

As is well known, HDDs have a large capacity and a low cost compared to SSDs; the SSD has high reading speed and low power consumption. For the characteristics of the storage resources, for the user, different storage strategies can be selected for different load types. For example, for the Shuffle-assist type load, the data volume of the Shuffle stage is large and the occupied time proportion is large, and an ALL _ SSD strategy can be selected to put the Shuffle data into the SSD, so as to improve the data reading speed and optimize the Shuffle performance, such as the Sort load; aiming at the load of the Shuffle-light type, the data volume of the Shuffle stage is small, the occupied time proportion is small, or the load of the CPU-intensive type, the occupied CPU resource is higher when the program runs, the performance bottleneck is not in the Shuffle stage generally, the ALL _ HDD strategy can be selected to put the Shuffle data into the HDD, so as to save the cost, such as Wordcount load and Kmeans load.

In addition, if the user cannot judge the characteristics of the Shuffle stage, a Random policy can be selected, fig. 4 is a schematic diagram of placing Shuffle data in different storage resources in a polling manner, and as shown in fig. 4, the data is uniformly distributed on the bottom storage device according to a conventional polling manner.

S20, reading Shuffle data from a memory

In the Shuffle stage, when data is read from heterogeneous storage resources, different reading modes can be adopted according to different storage strategies. If the data is only stored on one type of storage device, such as an HDD or an SSD, the data can be directly read when being read; if the data is stored on different types of storage devices, for example in a round robin storage, the reading may be done in sequential segments.

Taking a heterogeneous memory composed of an HDD and an SSD as an example, fig. 5 is a schematic diagram of sequentially reading Shuffle data provided by the present invention, and as shown in fig. 5, when Shuffle data uniformly stored in a polling manner is to be read, data on the same storage device can be read respectively according to the type of the storage device. For example, data on the SSD is pulled first and then on the HDD, or data on the HDD is pulled first and then on the SSD.

By adopting the reading mode, the data can be pulled from the SSD without being influenced by the HDD, and the advantage of the high-performance storage device in heterogeneous storage is maximized by utilizing the characteristic of quick reading of the SSD.

In another embodiment of the present invention, the efficiency of sequentially reading data in the heterogeneous storage device in a segmented manner is related to the ratio of data distributed on the high performance storage device, for example, for a heterogeneous memory composed of an HDD and an SSD, taking Sort load as an example, fig. 6 is the influence on the execution time of Sort under different data distribution ratios provided by the present invention, as shown in fig. 6, by adopting a sequential segmented reading manner, the larger the data distribution ratio on the SSD is, the shorter the execution time is, and the better the performance improvement effect is.

In another embodiment of the present invention, FIG. 7 is a percentage increase at different data distribution ratios relative to a conventional data read strategy. As shown in fig. 7, compared with the conventional random unordered reading method, the method provided by the present invention employs sequential segmented reading, and when the storage ratio of data on the SSD is 1/2, the percentage of data increase reaches the maximum.

Compared with the prior art, the Shuffle method for heterogeneous storage provided by the embodiment of the invention applies the characteristics of the heterogeneous storage to the Shuffle stage, so that the upper layer application can select different storage resources to store data according to the characteristics of load execution; when the Shuffle data is read, the reading sequence is divided into a plurality of stages, and each stage reads the data partition on the same type of storage resource, so that the reading of the Shuffle data is accelerated.

Although the present invention has been described by way of preferred embodiments, the present invention is not limited to the embodiments described herein, and various changes and modifications may be made without departing from the scope of the present invention.

Claims

1. A Shuffle method for heterogeneous storage, wherein a heterogeneous storage device is composed of an SSD and an HDD, the method comprising the steps of:

step 2), writing the Shuffle data into the corresponding resource pool according to the load type; wherein the load types include: the load type with large quantity of data or large occupied time proportion of the Shuffle stage, the load type with small quantity of data or small occupied time proportion of the Shuffle stage and the load type with unknown characteristics of the Shuffle stage;

for the load type with large quantity of the Shuffle stage data or large occupied time proportion, storing the Shuffle data in the SSD of the heterogeneous storage device;

for the load type with small quantity of the Shuffle stage data or small occupied time proportion, storing the Shuffle data in the HDD of the heterogeneous storage device;

for the load type with unknown characteristics of the Shuffle stage, the Shuffle data is stored in the heterogeneous storage device in a polling mode, and the Shuffle data is read from the SSD and then from the HDD in a sequential and segmented reading mode, or the Shuffle data is read from the HDD and then from the SSD.

2. The Shuffle method for heterogeneous storage according to claim 1, wherein said step 1) said resource pool is a resource pool consisting of SSD and a resource pool consisting of HDD.

3. The Shuffle method for heterogeneous storage according to claim 2, wherein for a load type with a large data amount or a large occupied time proportion in a Shuffle phase, the data is stored in a resource pool formed by the SSD.

4. The Shuffle method for heterogeneous storage according to claim 2, wherein for a load type with a small data amount or a small occupied time proportion in a Shuffle phase, the data is stored in a resource pool formed by the HDDs.

5. The Shuffle method for heterogeneous storage according to claim 3 or 4, wherein Shuffle data is read in a manner of directly reading from a corresponding resource pool.

6. A MapReduce programming method comprising the Shuffle method for heterogeneous storage according to any one of claims 1 to 5.

7. A computer system comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor performs the steps of claim 6 when executing the program.

8. A computer-readable storage medium comprising a computer program stored on the readable storage medium, wherein the program performs the steps of claim 6.