CN108804040B - Hadoop map-reduce calculation acceleration method based on kernel bypass technology - Google Patents

Hadoop map-reduce calculation acceleration method based on kernel bypass technology Download PDF

Info

Publication number
CN108804040B
CN108804040B CN201810568335.1A CN201810568335A CN108804040B CN 108804040 B CN108804040 B CN 108804040B CN 201810568335 A CN201810568335 A CN 201810568335A CN 108804040 B CN108804040 B CN 108804040B
Authority
CN
China
Prior art keywords
hadoop
kernel bypass
method based
map
reduce
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810568335.1A
Other languages
Chinese (zh)
Other versions
CN108804040A (en
Inventor
赵继胜
吴宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
SHANGHAI FUDIAN INTELLIGENT TECHNOLOGY Co.,Ltd.
Original Assignee
Shanghai Fudian Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Fudian Intelligent Technology Co ltd filed Critical Shanghai Fudian Intelligent Technology Co ltd
Priority to CN201810568335.1A priority Critical patent/CN108804040B/en
Publication of CN108804040A publication Critical patent/CN108804040A/en
Application granted granted Critical
Publication of CN108804040B publication Critical patent/CN108804040B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/061Improving I/O performance
    • G06F3/0611Improving I/O performance in relation to response time
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/061Improving I/O performance
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0655Vertical data movement, i.e. input-output transfer; data movement between one or more hosts and one or more storage devices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/067Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a hadoop map-reduce operation acceleration method based on kernel bypass technology (kernel bypass), which comprises the following steps: 1. the read-write speed of the Solid State Disk (SSD) is improved through the kernel bypass technology, and 2, the network is read and written at high speed through the kernel bypass technology. The two high-speed reading and writing technologies can respectively accelerate cache reading and writing of the hadoop map-reduce and network I/O. The Shuffle process is a main process (see the left side of an abstract figure) for consuming cache resources and network bandwidth in map-reduce calculation, and the performance of the data processing process (see the right side of the abstract figure) is effectively improved by integrating the two high-speed reading and writing technologies. Because the map-reduce calculation consists of a plurality of iterations, and each iteration comprises a shuffle process, tuning the performance of the shuffle brings remarkable performance improvement to the whole map-reduce calculation process.

Description

Hadoop map-reduce calculation acceleration method based on kernel bypass technology
Technical Field
The invention belongs to the technical field of information, and particularly relates to an I/O performance optimization method based on an operating system kernel bypass technology, which is mainly used for improving the operational performance of hadoop map-reduce.
Background
Apache Hadoop as an operation engine for big data processing is widely applied to the fields of enterprises, education, scientific research and the like. As an operation engine for parallel processing, Hadoop is simple and intuitive in a program development model and has good fault-tolerant capability, so that Hadoop can be quickly developed and applied and deployed on massive operation nodes, and the production rate of large data application development is greatly improved. Various software frameworks using Hadoop as an operation engine are also rapidly developed, and for example, Spark, Hive, Mahout and the like cover wide application fields from a distributed data warehouse to machine learning and the like. Hadoop is becoming an increasingly important industry standard for large data and parallel processing.
In the face of more and more application expansion, as a computing engine, Hadoop inevitably faces the technical pressure of performance improvement, so that the performance optimization technology for a Hadoop operation model is continuously explored and researched in the industry and the academia. In this patent, we propose to use Intel NVMe [1] protocol to improve the performance of cache and network I/O in the Hadoop operation process in a kernel bypass manner, thereby improving the overall efficiency of Hadoop map-reduce based operation.
Disclosure of Invention
Aiming at a Hadoop map-reduce operation frame, the patent aims to provide a method for improving the performance of a shuffle process in the Hadoop map-reduce process, so that the overall performance of the Hadoop map-reduce operation is improved.
In order to achieve the purpose, the invention provides a method for improving the I/O performance based on kernel bypass and an Intel NVMe protocol. The hadoop is used for developing big data application, mainly by utilizing the distributed parallel processing capacity of the hadoop, and the distributed parallel processing of the hadoop is mainly based on a map-reduce operation model. The map-reduce consists of the following 3 steps:
map Process: the computing task is divided according to data and is placed on different distributed computing nodes (such as an x86 server node), and multiple nodes perform parallel computing;
2, a Shuffle process, in which operation result data of the map process is stored in a local storage medium (a mechanical disk or a solid-state disk SSD), and then the data is sent to other nodes in a Shuffle form to perform a reduce process (see the map Shuffle in fig. 1);
and 3, a Reduce process, namely summarizing the data sent by each node by a Reduce calculation formula (such as accumulation, multiplication and the like), and finally outputting a result (see the Reduce of the attached figure 1).
The performance improvement made by this patent is: the optimization is performed for the two I/O operations in the shuffle process, which are written to the local storage medium and distribute data to different nodes for the reduce process.
For the improvement of the read-write performance of a storage medium, the efficient read-write of the solid-state disk SSD is carried out by utilizing the read-write mode of NVMe equipment with kernel bypass, so that the additional delay and the memory occupation caused by the kernel of an operating system are avoided;
for improving the network transmission performance, the efficient network data transmission is carried out by using an IP network communication mode based on the NVMe protocol by using the kernel bypass, so that the access to the kernel of an operating system in the traditional TCP protocol stack is avoided.
In the implementation part of the present invention, we describe in detail how to perform efficient NVMe-based read/write performance enhancement on SSD and IP networks by using SPDK [2] and DPDK [3] function libraries (with software packages and drivers), respectively
Drawings
FIG. 1 is a basic illustration of the hadoop map-reduce process, in which the storage medium is SSD
FIG. 2 is a comparison illustration of improving I/O performance of storage by core bypass (the left diagram shows a common storage medium access mode, and the right diagram shows a medium access mode by core bypass)
FIG. 3 is a comparison illustration of improving network I/O performance by using kernel bypass technology (the left diagram is a general storage medium access mode, and the right diagram is a medium access mode using kernel bypass technology)
Citations of documents
[1]http://nvmexpress.org/wp-content/uploads/2013/04/NVM_ whitepaper.pdf
[2]https://software.intel.com/en-us/articles/introduction-to-the-storage-performance-develo pment-kit-spdk
[3]https://software.intel.com/en-us/networking/dpdk
Detailed Description
As shown in fig. 1, the present invention is specifically implemented as follows:
1. the whole I/O system optimization process introduced by the patent comprises I/O optimization of map result data written into a storage medium and network I/O system optimization of shuffle data distributed to a reduce process, which are carried out under a hadoop framework. By replacing part of the original I/O function of the hadoop, the I/O acceleration function based on the kernel bypass and the NVMe drive is integrated into the hadoop, so that the performance is improved in the map-reduce calculation process (see attached figures 2 and 3);
2. the optimization process of the I/O system of the patent comprises the following steps;
a. in the hadoop operation initialization stage, detecting whether the distributed system supports NVMe driving, if not, using a standard hadoop I/O mode, wherein pseudo codes are as follows;
Figure BDA0001685102200000021
b. a core bypass data read-write mechanism (see figure 2) realized by SPDK is adopted in the process of writing map results into storage, codes do not need to be modified, and only a rear-end module for storing I/O needs to be replaced by the implementation supporting core bypass operation realized by SPDK;
c. when data is distributed in the shuffle process, a kernel bypass network I/O mechanism (see figure 3) realized by DPDK is adopted;
d. in the processes of b and c, if I/O errors occur, retry is carried out, the retry number depends on a preset threshold value, and I/O exception handling provided by hadoop is carried out after the retry number is exceeded
3. And (3) data exchange processing: the hadoop engine is based on Java, and for the use of the C/C + + -based SPDK/DPDK function library, data exchange (through Java Native Interface (JNI)) needs to be performed between a Java virtual machine and a heap memory of a Linux process, so that the hadoop can write data into the cache of the SPDK/DPDK and perform further I/O, and a special data conversion module is provided for the hadoop. See pseudo code for data read and write mechanism as follows (taking SSD write as an example):
Figure BDA0001685102200000022
Figure BDA0001685102200000031
4. error detection and handling strategy: the read-write operation of the SSD and the IP network is realized by the SPDK/DPDK function library layer realized by C language. Two layers of processing strategies are adopted for possible I/O faults (such as read-write timeout or device state errors): multiple attempts are made at the C language level of calling the SPDK/DPDK function library, when the trial and error exceeds a preset threshold, the error is returned to the Java operation layer in a Java exception mode, the mode avoids extra burden brought by returning the Java operation layer for multiple times in the trial and error process, and the pseudo code is as follows:
Figure BDA0001685102200000032

Claims (7)

1. a Hadoopmap-Reduce calculation acceleration method based on a kernel bypass technology comprises the following steps:
step one, in a hadoop operation initialization stage, detecting whether a distributed system supports NVMe driving, and if not, using a standard hadoop I/O mode;
step two, adopting a kernel bypass data read-write mechanism realized by SPDK in the process of writing map results into storage, wherein codes do not need to be modified; only the back-end module for storing I/O needs to be replaced by the implementation supporting the core bypass operation realized by SPDK;
thirdly, a kernel bypass network I/O mechanism realized by DPDK is adopted when data are distributed in the shuffle process;
step four, in the step two and the step three, if I/O errors occur, retry is carried out, the retry number depends on a preset threshold value, and I/O exception processing provided by hadoop is carried out after the retry number is exceeded;
the I/O exception handling includes: the read-write operation of the SSD and the IP network is realized in an SPDK/DPDK function library layer realized by C language, and two layers of processing strategies are adopted for the I/O abnormity: and (4) carrying out multiple attempts at the C language level of calling the SPDK/DPDK function library, and returning the error to the Java operation layer in the form of Java exception when the trial and error exceeds a preset threshold.
2. The HadoopMap-Reduce computation acceleration method based on kernel bypass technique as in claim 1, characterized by establishing an efficient storage and network I/O mechanism.
3. The HadoopMap-Reduce computation acceleration method based on kernel bypass technology as claimed in claim 1, characterized in that a data caching and distribution mechanism for device I/O in kernel bypass mode based on Intel NVMe driver is established.
4. The HadoopMap-Reduce computation acceleration method based on kernel bypass technique according to claim 1, characterized in that Apache Hadoop is extended in the way of software plug-in.
5. The Hadoop map-Reduce calculation acceleration method based on kernel bypass technology as claimed in claim 1, wherein I/O error processing is divided into two layers, first, trial and error is performed on the drive processing layer, and when the number of trial and error exceeds a preset threshold, a Java-based exception processing layer of Hadoop is returned.
6. The HadoopMap-Reduce computational acceleration method based on kernel bypass technique as claimed in claim 5, characterized in that the I/O acceleration is implemented using SPDK and DPDK function libraries of open source code provided by Intel.
7. The Hadoop Map-Reduce computation acceleration method based on kernel bypass technology as claimed in claim 5, characterized in that the extended Apache Hadoop is implemented by adopting an extended software interface to add support for device I/O in kernel bypass mode based on Intel NVMe driver.
CN201810568335.1A 2018-06-05 2018-06-05 Hadoop map-reduce calculation acceleration method based on kernel bypass technology Active CN108804040B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810568335.1A CN108804040B (en) 2018-06-05 2018-06-05 Hadoop map-reduce calculation acceleration method based on kernel bypass technology

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810568335.1A CN108804040B (en) 2018-06-05 2018-06-05 Hadoop map-reduce calculation acceleration method based on kernel bypass technology

Publications (2)

Publication Number Publication Date
CN108804040A CN108804040A (en) 2018-11-13
CN108804040B true CN108804040B (en) 2020-07-07

Family

ID=64087152

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810568335.1A Active CN108804040B (en) 2018-06-05 2018-06-05 Hadoop map-reduce calculation acceleration method based on kernel bypass technology

Country Status (1)

Country Link
CN (1) CN108804040B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105653476A (en) * 2014-11-12 2016-06-08 华为技术有限公司 Communication method between data processor and memory equipment, and related device
CN106506513A (en) * 2016-11-21 2017-03-15 国网四川省电力公司信息通信公司 Firewall policy data analysis set-up and method based on network traffics
CN107480080A (en) * 2017-07-03 2017-12-15 香港红鸟科技股份有限公司 A kind of Zero-copy data stream based on RDMA

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170180272A1 (en) * 2012-10-03 2017-06-22 Tracey Bernath System and method for accelerating network applications using an enhanced network interface and massively parallel distributed processing

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105653476A (en) * 2014-11-12 2016-06-08 华为技术有限公司 Communication method between data processor and memory equipment, and related device
CN106506513A (en) * 2016-11-21 2017-03-15 国网四川省电力公司信息通信公司 Firewall policy data analysis set-up and method based on network traffics
CN107480080A (en) * 2017-07-03 2017-12-15 香港红鸟科技股份有限公司 A kind of Zero-copy data stream based on RDMA

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
JVM-Bypass for Efficient Hadoop Shuffling;Yandong Wang等;《2013 IEEE 27th International Symposium on Parallel & Distributed Processing》;20131231;第569-578页 *

Also Published As

Publication number Publication date
CN108804040A (en) 2018-11-13

Similar Documents

Publication Publication Date Title
US10496597B2 (en) On-chip data partitioning read-write method, system, and device
KR102028252B1 (en) Autonomous memory architecture
US10025533B2 (en) Logical block addresses used for executing host commands
Qin et al. How to apply the geospatial data abstraction library (GDAL) properly to parallel geospatial raster I/O?
CN103761988A (en) SSD (solid state disk) and data movement method
CN103647850A (en) Data processing method, device and system of distributed version control system
CN116561051B (en) Hardware acceleration card and heterogeneous computing system
KR20110028212A (en) Autonomous subsystem architecture
CN111444134A (en) Parallel PME (pulse-modulated emission) accelerated optimization method and system of molecular dynamics simulation software
US9342564B2 (en) Distributed processing apparatus and method for processing large data through hardware acceleration
CN115033188A (en) Storage hardware acceleration module system based on ZNS solid state disk
CN112200310B (en) Intelligent processor, data processing method and storage medium
US10061747B2 (en) Storage of a matrix on a storage compute device
CN113869495A (en) Method, device and equipment for optimizing convolutional weight layout of neural network and readable medium
CN104598409A (en) Method and device for processing input and output requests
CN108804040B (en) Hadoop map-reduce calculation acceleration method based on kernel bypass technology
CN116074179B (en) High expansion node system based on CPU-NPU cooperation and training method
Li et al. Dual buffer rotation four-stage pipeline for CPU–GPU cooperative computing
US10289447B1 (en) Parallel process scheduling for efficient data access
Ali et al. A New Merging Numerous Small Files Approach for Hadoop Distributed File System
KR20210108487A (en) Storage Device Behavior Orchestration
US9569280B2 (en) Managing resource collisions in a storage compute device
US9183211B1 (en) Cooperative storage of shared files in a parallel computing system with dynamic block size
US11550718B2 (en) Method and system for condensed cache and acceleration layer integrated in servers
US20220107844A1 (en) Systems, methods, and devices for data propagation in graph processing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20200410

Address after: 200433, No. 15, No. 323, National Road, Shanghai, Yangpu District (centrally registered)

Applicant after: SHANGHAI FUDIAN INTELLIGENT TECHNOLOGY Co.,Ltd.

Address before: 200082 the 15 level (323) of Guo Ding Road, Yangpu District, Shanghai.

Applicant before: SHANGHAI FUDIAN INTELLIGENT TECHNOLOGY Co.,Ltd.

Applicant before: Zhao Jisheng

Applicant before: Wu Yu

GR01 Patent grant
GR01 Patent grant