CN105653708A

CN105653708A - Hadoop matrix processing method and system of heterogeneous cluster

Info

Publication number: CN105653708A
Application number: CN201511028067.7A
Authority: CN
Inventors: 刘勇; 喻之斌; 须成忠
Original assignee: Shenzhen Institute of Advanced Technology of CAS
Current assignee: Shenzhen Institute of Advanced Technology of CAS
Priority date: 2015-12-31
Filing date: 2015-12-31
Publication date: 2016-06-08

Abstract

A Hadoop matrix processing method of a heterogeneous cluster comprises the following steps that a physical cluster is established, and a Master node and multiple Slaver nodes are set; a programming environment in a Java development environment is configured for the Master node and the multiple Slaver nodes respectively, and Map and Reduce codes of a matrix multiplication CUDA version are prewritten; relevant information of a first matrix A and a second matrix B stored in an internal storage are read, and MapReduce matrix multiplication operation is conducted on the stored first matrix A and the stored second matrix B according to the prewritten codes; a control operation result is directly written into a distributed file system HDFS, wherein the A is equal to (aij) and is the m * s matrix, and the B is equal to (bij) and is the s * n matrix. The Hadoop matrix processing method improves the limited multiplication performance of the Hadoop matrix from the perspective of an algorithm, can more deeply improve the performance of programs and effectively improves the efficiency of the matrix multiplication operation.

Description

A kind of Hadoop matrix disposal method of isomeric group and system

Technical field

The invention belongs to technical field of data processing, particularly relate to Hadoop matrix disposal method and the system of a kind of isomeric group.

Background technology

High matrix operation has been widely used in the key areas such as industry, science and technology, and from image procossing, data mining to biological computation etc., matrix multiplication is one of calculating the most important in matrix operation. But the expansion along with matrix scale, carries out matrix multiple and becomes difficulty in the short period of time. Classical matrix multiplication adopts individual node serial processing or GPU parallel processing plan. Although the program improves performance to a certain extent, but and it is not suitable for mass data processing. Hadoop is one can, to the Distributed Architecture of big data processing, be the realization of increasing income that MapReduce programming model is the most popular. Which simplify data distribution, process, calculate and task scheduling, and there is the high characteristic such as fault-tolerant, highly reliable, Highly Scalable and high resource utilization. Programming personnel only needs to write Map and Reduce function, and Hadoop automatically by each node to cluster of task matching, and can execute the task, thus reaches data parallel. Paper (Sun Yuanshuai, old, official is newly equal, Lin Chen) " the big data multiplication treatment process based on Hadoop ", it is proposed to adopt Law of Inner Product and outer produce method to realize the matrix multiple of MapReduce.

But, (1), for mass data processing application, Hadoop performance is unsatisfactory. Mass data processing is applied, and has two features: computation-intensive and data-intensive, and Hadoop is mainly applicable to data-intensive applications; (2) adopt Law of Inner Product MapReduce only an operation just can finish the work, but the middle Output rusults in Map stage is very big, Hadoop framework needed intermediate result is write this local disk in the Map stage, the Shuffle stage needs the intermediate result copying corresponding subregion, therefore, the program seldom uses in actual applications. Outer produce method is when reducing certain concurrent granularity, original operation is divided into two, relatively reduce the data volume of intermediate result, but the output of first operation needs the input as the 2nd operation, now need to wait that first operation completes to carry out the 2nd operation.

Summary of the invention

The present invention is in view of above-mentioned the deficiencies in the prior art, it is provided that a kind of Hadoop matrix disposal method of isomeric group, effectively promotes the efficiency that is multiplied of Hadoop matrix.

Embodiments of the invention provide a kind of Hadoop matrix disposal method of isomeric group, comprise the following steps,

Build a physical cluster, a Master node and multiple Slaver node are set;

Described Master node and multiple Slaver node configure programming environment under Java development environment respectively, Map and the Reduce code of CUDA version and pre-matrix is multiplied;

Read in internal memory and stored the first matrix A and the relevant information of the 2nd matrix B, and according to pre-code of compiling, the first matrix A of described storage and the 2nd matrix B are carried out MapReduce matrix multiple computing;

Control algorithm result directly writes into distributed document system HDFS;

Wherein, described A=(a_ij) it is the matrix of m �� s, B=(b_ij) it is the matrix of s �� n.

Preferably, the programming environment under described Java development environment refers to Java development environment JDK, the programming environment CUDA of Hadoop, IntelGPU, JCuda, Ganglia;

Wherein, JCuda provides the API that Java directly accesses CUDA, the CPU of Ganglia monitor in real time cluster, internal memory, network, hard disk utilization ratio.

Preferably, the storage mode of described first matrix A and the 2nd matrix B adopts tlv triple form storage mode, and concrete column information comprises i, j, a_i ^Tb_j;

Wherein, a_i ^TIt it is the i-th row of the first matrix A; b_jIt is the jth row of the 2nd matrix B.

Preferably, the computing of MapReduce matrix multiple specifically comprises:

In the Map stage, obtain emit ((i, j), a according to pre-code of compiling_i ^T��b_j), whereinIn the Reduce stage, directly obtain the result in Map stage.

Preferably, after described step control algorithm result directly writes into distributed document system HDFS, also comprise step,

Build Web server, the software-hardware configuration information of the acceleration when described physical cluster of display program.

Preferably, if the data amount check of reduce stage processing is zero, the Map stage intermediate input result is directly write into distributed document system HDFS.

Preferably, before the storage mode of described first matrix A and the 2nd matrix B adopts tlv triple form storage mode to store, first described first matrix A and the 2nd matrix B are carried out pre-treatment, gather the relevant information in the first matrix A and the 2nd matrix B according to triple store forms mode.

Embodiments of the invention also provide the Hadoop matrix disposal system of a kind of isomeric group, and described treatment system comprises:

Environment builds unit, for building a physical cluster, and arranges a Master node and multiple Slaver node;

Unit compiled in advance by configuration and code, for being the programming environment under Joint Enterprise Java development environment, Map and the Reduce code of CUDA version and pre-matrix is multiplied;

Storage unit, needs, for storing, the matrix information carrying out multiplication operation;

Actuator unit, for the matrix information stored in reading cells, and carries out MapReduce matrix multiple computing according to pre-code of compiling to described storage matrix;

Output unit, operation result is directly write into distributed document system HDFS by control.

Preferably, described treatment system also comprises the monitoring of performance and display unit, for the software-hardware configuration information of the acceleration when described physical cluster that shows program.

Preferably, adopting tlv triple form storage mode according to pre-information storage mode in described storage unit of two matrixes that described storage matrix carries out MapReduce matrix multiple computing by code of compiling, concrete column information comprises i, j, a_i ^Tb_j;

Wherein, a_i ^TIt it is the i-th row of first matrix; b_jIt is the jth row of the 2nd matrix.

In above technical scheme, adopt Master node and multiple Slaver node parallel processing Hadoop matrix multiplication task, and Map and the Reduce code of the CUDA version that is multiplied by pre-matrix, realize GPU Hadoop matrix multiplication task to be accelerated, Hadoop matrix multiple limited capacity is promoted from algorithm angle, can the performance of more profound raising program, effectively improve the efficiency of matrix multiple computing.

Accompanying drawing explanation

Fig. 1 is the schematic flow sheet of the Hadoop matrix disposal method of the isomeric group of an embodiment of the present invention.

Fig. 2 is the structure block diagram of the Hadoop matrix disposal system of the isomeric group of an embodiment of the present invention.

Fig. 3 is the Hadoop matrix disposal system architecture figure of a kind of isomeric group of the present invention.

Embodiment

In order to make technical problem solved by the invention, technical scheme and useful effect clearly understand, below in conjunction with drawings and Examples, the present invention is further elaborated. It is to be understood that specific embodiment described herein is only in order to explain the present invention, it is not intended to limit the present invention.

As shown in Figure 1, embodiments of the invention provide the Hadoop matrix multiplication algorithm of a kind of isomeric group, comprise the following steps,

Step S100, builds a physical cluster, arranges a Master node and multiple Slaver node;

Step S200, configures the programming environment under Java development environment respectively on described Master node and multiple Slaver node, Map and the Reduce code of CUDA version and pre-matrix is multiplied;

Step S300, reads in internal memory and has stored the first matrix A and the relevant information of the 2nd matrix B, and according to pre-code of compiling, the first matrix A of described storage and the 2nd matrix B is carried out MapReduce matrix multiple computing;

Step S400, control algorithm result directly writes into distributed document system HDFS;

Preferably, in step s 200, at each Master and Slaver node deploy and Install and configure Java development environment JDK, the programming environment CUDA of Hadoop, IntelGPU, JCuda, Ganglia. Wherein JCuda provides the API that Java directly accesses CUDA, the CPU of Ganglia monitor in real time cluster, internal memory, network, hard disk utilization ratio etc.

Hadoop adopts Java language programming realization, and GPU is CUDA (intelGPU)/OpenCL (AMDGPU) writes, in order to enable Hadoop task seamless operation in GPU, code reunification must be solved, Hadoop provides Pipes, Streaming two kinds programming interface supports other programming languages, and Java self also has JNI scheme to support other programming languages. CUDAruntime can be connected by JCuda with driverapi and Java, thus realize Java program and call GPU resource.

Pipes adopts the mode of packaging process by Socket transmission " key-value to ", and network transmission can be brought very big expense by this, and program routine difficulty. Streaming adopts the mode of packaging process by stdiostream transmission " key-value to ", and network transmission also becomes main performance bottleneck, but test procedure is simple. The programming of JNI scheme is complicated, and exploitativeness is not high. Therefore, this scheme is considered from programming difficulty, program performance, debugging difficulty, adopts JCuda scheme.

Owing to the Map phase data process of Hadoop is with behavior unit, if so data store adopts two dimension sheet form, need whole matrix reading internal memory when then reading a certain row of matrix, obtain corresponding row again, performance can reduce greatly, therefore Hadoop is for the tlv triple form storage scheme of the calculation process employing table 1 of matrix.

Table 1: the tlv triple form file layout of matrix

rowIndex	colIndex	value
			��	��	��
i	j	a_ij
			��	��	��

MapReduce matrix multiple:

When adopting Law of Inner Product to carry out matrix multiple computing, the calculating of each element in Matrix C does not rely on each other, it is possible to reach the concurrent granularity of m �� n. The flow process of MapRedce is as follows:

Map:

For each element a of matrix A_ij, i.e. (ija_ij), emit ((i, k), a_ij),k��[1,n],

For each element b of matrix B_jk, i.e. (jkb_jk), emit ((i, k), b_jk),i��[1,m]��

Reduce:

For each key:(ik)

Calculate Value:

The work output that can find intermediate data result from calculation process is m �� s �� n A matrix element and s �� n �� m B matrix element, the Output rusults data volume of centre is expanded m doubly relative to original matrix data volume, huge network transmission expense can be brought in the shuffle stage.

Therefore, as a kind of preferred version, in described step S300, the storage mode of the first matrix A of the present invention and the 2nd matrix B adopts tlv triple form storage mode, and concrete column information comprises i, j, a_i ^Tb_j;

Specifically, the storage mode of described first matrix A and the 2nd matrix B adopts the tlv triple form storage mode shown in table 2. Before the storage mode of described first matrix A and the 2nd matrix B adopts the tlv triple form storage mode shown in table 2 to store, first described first matrix A and the 2nd matrix B are carried out pre-treatment, gather the relevant information in the first matrix A and the 2nd matrix B according to triple store forms mode.

Table 2: pretreated data storage format

rowIndex	colIndex	value
			��	��	��
i	j	a_i ^T b_j
			��	��	��

Further, in step S300, described MapReduce matrix multiple computing specifically comprises:

In the Map stage, obtain emit ((i, j), a according to pre-code of compiling_i ^T��b_j), wherein

In the Reduce stage, directly obtain the result in Map stage, so the Reduce stage is without the need to any operation.

Can find from calculation process, original data volume is very big, but to program and have no effect, and middle Output rusults only m �� n matrix element, interim data only have 1/ (2*s) (wherein s refers to the subscript in matrix A and B) of scheme above, so the shuffle stage network I/O expense that the Map stage writes magnetic disc i/o expense and Reduce all obviously reduces.

Meanwhile, further, MapReduce is optimized by Hadoop, if the data amount check of reduce stage processing is zero, the Map stage intermediate input result is directly write into distributed document system HDFS, so performance will obtain bigger raising.

More preferably, shown in composition graphs 2, at described step S400, after control algorithm result directly writes into distributed document system HDFS, also comprise step S500, build Web server, the software-hardware configuration information of the acceleration when described physical cluster of display program.

Adopting Law of Inner Product realization matrix multiplication operation, middle interim data volume is big, and data, from the source of matrix multiple computing, are carried out pre-treatment by this scheme, when not reducing parallel degree, and the intermediate data result of obvious reduction task.

As shown in Figure 3, embodiments of the invention also provide the Hadoop matrix disposal system of a kind of isomeric group, comprising:

Environment builds unit 001, for building a physical cluster, and arranges a Master node and multiple Slaver node.

Unit 002 compiled in advance by configuration and code, for being the programming environment under Joint Enterprise Java development environment, Map and the Reduce code of CUDA version and pre-matrix is multiplied.

Each Master and Slaver node deploy and Install and configure Java development environment JDK, the programming environment CUDA of Hadoop, IntelGPU, JCuda, Ganglia. Wherein JCuda provides the API that Java directly accesses CUDA, the CPU of Ganglia monitor in real time cluster, internal memory, network, hard disk utilization ratio etc.

Storage unit 003, needs, for storing, the matrix information carrying out multiplication operation;

Preferably, the storage mode of the first matrix A of the present invention and the 2nd matrix B adopts tlv triple form storage mode, and concrete column information comprises i, j, a_i ^Tb_j;

Actuator unit 004, for the matrix information stored in reading cells, and carries out MapReduce matrix multiple computing according to pre-code of compiling to described storage matrix;

Output unit 005, operation result is directly write into distributed document system HDFS by control.

Further, described treatment system also comprises the monitoring of performance and display unit 006, for the software-hardware configuration information of the acceleration when described physical cluster that shows program.

The Hadoop matrix disposal method of the isomeric group that the embodiment of the present invention provides and system have the following advantages:

(1) program adopts JCuda scheme, has performance better relative to Pipes, Streming, relatively easy relative to JNI programming, simultaneously convenient debugging and test procedure.

(2) matrix multiple computing is done performance optimization according to the performance bottleneck of Hadoop self framework in application program aspect, give GPU process at a large amount of calculating section, it is possible to the performance of more profound raising program simultaneously.

(3) speeding scheme of a kind of Hadoop matrix multiple computing is proposed from system structure angle.

The foregoing is only the better embodiment of the present invention, not in order to limit the present invention, all any amendment, equivalent replacement and improvement etc. done within the spirit and principles in the present invention, all should be included within protection scope of the present invention.

Claims

1. the Hadoop matrix disposal method of an isomeric group, it is characterised in that: comprise the following steps, build a physical cluster, a Master node and multiple Slaver node are set;

Control algorithm result directly writes into distributed document system HDFS;

2. the Hadoop matrix disposal method of isomeric group according to claim 1, it is characterised in that:

Programming environment under described Java development environment refers to Java development environment JDK, the programming environment CUDA of Hadoop, IntelGPU, JCuda, Ganglia;

3. the Hadoop matrix disposal method of isomeric group according to claim 1, it is characterised in that:

The storage mode of described first matrix A and the 2nd matrix B adopts tlv triple form storage mode, and concrete column information comprises i, j, a_i ^Tb_j;

4. the Hadoop matrix disposal method of isomeric group according to claim 3, it is characterised in that: MapReduce matrix multiple computing specifically comprises:

In the Reduce stage, directly obtain the result in Map stage.

5. the Hadoop matrix disposal method of isomeric group according to claim 1, it is characterised in that: after described step control algorithm result directly writes into distributed document system HDFS, also comprise step,

6. the Hadoop matrix disposal method of isomeric group according to claim 4, it is characterised in that: if the data amount check of reduce stage processing is zero, the Map stage intermediate input result is directly write into distributed document system HDFS.

7. the Hadoop matrix disposal method of isomeric group according to claim 3, it is characterized in that: before the storage mode of described first matrix A and the 2nd matrix B adopts tlv triple form storage mode to store, first described first matrix A and the 2nd matrix B are carried out pre-treatment, gather the relevant information in the first matrix A and the 2nd matrix B according to triple store forms mode.

8. the Hadoop matrix disposal system of an isomeric group, it is characterised in that: described treatment system comprises:

9. the Hadoop matrix disposal system of isomeric group according to claim 8, it is characterised in that:

Described treatment system also comprises the monitoring of performance and display unit, for the software-hardware configuration information of the acceleration when described physical cluster that shows program.

10. the Hadoop matrix disposal system of isomeric group according to claim 8, it is characterised in that:

Adopting tlv triple form storage mode according to pre-information storage mode in described storage unit of two matrixes that described storage matrix carries out MapReduce matrix multiple computing by code of compiling, concrete column information comprises i, j, a_i ^Tb_j;