CN104408047A

CN104408047A - Method for uploading text file to HDFS (hadoop distributed file system) in multi-machine parallel mode based on NFS (network file system) file server

Info

Publication number: CN104408047A
Application number: CN201410584207.8A
Authority: CN
Inventors: 房体盈; 辛国茂
Original assignee: Langchao Electronic Information Industry Co Ltd
Current assignee: Inspur Electronic Information Industry Co Ltd
Priority date: 2014-10-28
Filing date: 2014-10-28
Publication date: 2015-03-11

Abstract

The invention provides a method for uploading a text file to an HDFS (hadoop distributed file system) in a multi-machine parallel mode based on an NFS (network file system) file server. The method comprises the following steps: selecting N host computers from an HDFS cluster, and then selecting any node as a main node and other N-1 nodes as slave nodes; obtaining a file in a to-be-uploaded directory of a to-be-uploaded NFS file server on the main node; as to each file, adopting a parallel uploading method, namely that all machines in the cluster participate into uploading, wherein each host computer in the cluster takes charge of uploading of continuous data blocks in 1/N size of each file, so as to achieve the parallel uploading target. Thus the uploading speed is increased.

Description

A kind of text multi-host parallel based on NFS file server uploads to HDFS method

Technical field

The present invention relates to large technical field of data storage, specifically a kind of text multi-host parallel based on NFS file server uploads to HDFS method.

Background technology

Along with the development of computer network, the epoch of mass data arrive.The data use amount in Internet data center's prediction whole world will increase by 44 times to the year two thousand twenty, reaches 35.2ZB.

For the storage of large data sets like this, analysis, management and excavation, conventional art (comprising traditional relational) is incompetent, how analysis the most best and to understand these data are pendulum task of top priority in face of everybody.And in the technology had now and instrument, the most ripe also the most successful a set of large data solution is that Hadoop file stores Computational frame and framework associated component thereon.For a large amount of texts that every day generates, if upload to HDFS fast for follow-up process, be the current problem faced.For solving the problem uploaded fast of text, proposing herein and a kind ofly uploading to HDFS method based on the text multi-host parallel based on NFS file server.

HDFS gives tacit consent to employing three copy mechanism, for the client of HDFS, when some users write data by a client in HDFS, if this client there is DataNode node, NameNode override considers the data write copy to be kept on the DataNode node of this client, two other copy is saved on other DataNode nodes of cluster, like this in whole cluster, if only there is a client write operation, the work of 3 DataNode nodes is only had in cluster, other DataNode nodes are idle, the performance of whole cluster can not be played.

Summary of the invention

The object of this invention is to provide a kind of text multi-host parallel based on NFS file server and upload to HDFS method.

The object of the invention is to realize in the following manner, choose N number of main frame in HDFS cluster, then select any node as host node, other N-1 node is as from node, on the primary node, the NFS file server that acquisition will be uploaded will upload file under catalogue, for each file, adopt parallel method for uploading, namely in cluster, all machines all participate in uploading, and in cluster, the continuous print data block uploading each file 1/N size is responsible for by each main frame, reach the parallel object uploaded, thus raising uploading speed, concrete steps flow process is:

1) each node of the N number of node of MainPut program computation data block start-stop to be uploaded byte stream on host node, and start BlockPut program parallelization on N number of node and upload; If first time is run, on each node an executable program BlockPut being installed, being responsible for for uploading this node institute the data block uploaded, then initiate order startup BlockPut program to each from node;

2) on each node, BlockPut program is responsible for data block to be uploaded to upload to HDFS, BlockPut opens a file input stream InputStream to be uploaded, InputStream navigates to banner word throttling, after on HDFS create a unique file, start-stop byte stream is written in HDFS unique file.

Catalogue to be uploaded is mounted to the acquiescence unified directory of N number of node.

N is not more than the number clients that NFS file server walks abreast when can to reach maximum bandwidth when reading,

Object beneficial effect of the present invention is: this to have chosen in cluster N number of node clearly as client, a file is divided into N number of data block upload simultaneously, each client is responsible for one piece, each piecemeal saves as an independently file on HDFS, can utilize the performance of whole cluster to greatest extent.A text block parallel is uploaded, plays the performance of cluster to greatest extent, transfer efficiency in raising.

Accompanying drawing explanation

Fig. 1 is based on multi-host parallel upload process frame diagram.

Embodiment

With reference to Figure of description, method of the present invention is described in detail below.

Choose N number of main frame in HDFS cluster, then select any node as host node, other N-1 node is as from node, on the primary node, the NFS file server that acquisition will be uploaded will upload file under catalogue, for each file, adopt parallel method for uploading, namely in cluster, all machines all participate in uploading, in cluster, the continuous print data block uploading each file 1/N size is responsible for by each main frame, reach the parallel object uploaded, thus raising uploading speed, a kind of text multi-host parallel based on NFS file server of the present invention uploads to HDFS method, whole flow process is:

Except the technical characteristic described in instructions, be the known technology of those skilled in the art.

Claims

1. one kind uploads to HDFS method based on the text multi-host parallel of NFS file server, it is characterized in that, choose N number of main frame in HDFS cluster, then select any node as host node, other N-1 node is as from node, on the primary node, the NFS file server that acquisition will be uploaded will upload file under catalogue, for each file, adopt parallel method for uploading, namely in cluster, all machines all participate in uploading, in cluster, the continuous print data block uploading each file 1/N size is responsible for by each main frame, reach the parallel object uploaded, thus raising uploading speed, concrete steps flow process is:

The each node of the N number of node of MainPut program computation data block start-stop to be uploaded byte stream on host node, and start BlockPut program parallelization on N number of node and upload; If first time is run, on each node an executable program BlockPut being installed, being responsible for for uploading this node institute the data block uploaded, then initiate order startup BlockPut program to each from node;

2. method according to claim 1, is characterized in that acquiescence unified directory catalogue to be uploaded being mounted to N number of node.

3. method according to claim 1, is characterized in that, N is not more than the number clients that NFS file server walks abreast when can to reach maximum bandwidth when reading.