CN105022779A

CN105022779A - Method for realizing HDFS file access by utilizing Filesystem API

Info

Publication number: CN105022779A
Application number: CN201510229757.2A
Authority: CN
Inventors: 杨莉; 王森; 沈映泉; 赵薇; 段嘉杰
Original assignee: Electric Power Research Institute of Yunnan Power System Ltd
Current assignee: Electric Power Research Institute of Yunnan Power System Ltd
Priority date: 2015-05-07
Filing date: 2015-05-07
Publication date: 2015-11-04

Abstract

A method for realizing HDFS file access by utilizing Filesystem API is disclosed. The method comprises Hadoop cluster environment construction, an HDFS file uploading and downloading method and a HDFS file read-write process. The method can realize efficient uploading and downloading of HDFS files.

Description

One utilizes Filesystem API to realize HDFS file access method

Technical field

The invention belongs to computer distribution type file access field, further relate to a kind of realization of the HDFS file access based on Filesystem API.

Background technology

Along with the develop rapidly of network technology, many enterprises and group carry out the storage of data, calculating and mutual by data storage service provider.Hadoop is a distributed system architecture, is developed by Apache foundation.User can when not understanding distributed low-level details, exploitation distributed program.Make full use of power high-speed computation and the storage of cluster.Hadoop achieves a distributed file system (Hadoop Distributed File System), is called for short HDFS.

HDFS adopts principal and subordinate (Master/Slave) framework, and a HDFS cluster is made up of a Namenode and multiple Datanode.Namenode is a central server, be responsible for the operation of the name space of file system, such as open, close, Rename file or catalogue, it is responsible for the mapping of maintenance documentation path to data block, data block to the mapping of Datanode, and monitors the heartbeat of Datanode and the number of service data block copy.And the metadata information of NameNode is placed in main memory, a record probably accounts for 150 bytes, and its structure mainly comprises: (1) listed files information; (2) information of the blocks of files that each file is corresponding; (3) information of the DataNode that each blocks of files is corresponding; (4) file attribute, as creation-time, creates this, copy number etc.Datanode is responsible for processing the read-write requests of client, it carry out under Namenode United Dispatching data block establishment, delete and copy, general physical node (machine) deploy one.

HDFS has the feature of high fault tolerance, and design is used for being deployed on cheap (low-cost) hardware.And it provides high transmission rates (high throughput) to visit the data of application program, be applicable to the application program that those have super large data set (large data set).HDFS adopts stream data access mode access data, is all the read or write for a whole blocks at every turn.

Summary of the invention

The object of the present invention is to provide a kind of implementation method of the HDFS file access system based on Filesystem API, realize the high-level efficiency upload and download to HDFS document.

To achieve these goals, the invention provides one and utilize Filesystem API to realize HDFS file access method, it is characterized in that, comprise that Hadoop cluster environment is built, the uploading of HDFS file, method for down loading and HDFS file read-write flow process; Wherein,

Hadoop cluster environment is built and is comprised the steps:

Step S1, installs linux system;

Step S2, creates Hadoop user's group and user under linux system;

Step S3, installs JDK and configuration surroundings variable;

Step S4, revises the host name of each main frame, and the mapping relations between configure host;

Step S5, installs ssh service, creates ssh without cryptographic acess;

Step S6, each main frame is installed and configures Hadoop;

Step S7, starts Hadoop after installation, and whether checking installs correct;

The method for uploading of HDFS document comprises the steps:

Step S1, user is to upload file in Hadoop distributed file system;

Step S2, utilizes org.apache.hadoop.fs.FileInputStream class to create the inlet flow of local file;

Step S3, utilize org.apache.hadoop.conf.Configuration class to read Hadoop file system configuration item, the configurations be configured in core-site.xml all is here as the criterion;

Step S4, org.apache.hadoop.fs.FileSystem are the core classes of user operation HDFS, are the abstract base class of a universal document system, can be distributed formula file system and inherit, and obtain file HDFS file system corresponding to URI by such;

Step S5, opens the output stream of Hadoop file in the mode created, this output stream points to HDFS file destination;

Step S6, copies to HDFS file destination by file from local file system with IOUtils instrument;

Step S7, shows the catalogue of the All Files in current HDFS file destination;

The method for down loading of HDFS file comprises the steps:

Step S1, user is download file from Hadoop distributed file system;

Step S2, utilize org.apache.hadoop.conf.Configuration class to read Hadoop file system configuration item, the configurations be configured in core-site.xml all is here as the criterion;

Step S3, obtains file HDFS file system corresponding to URI by org.apache.hadoop.fs.FileSystem class;

Step S4, allows FileSystem open FSDataInputStream file input stream corresponding to a URI, file reading;

Step S5, with IOUtils instrument file selected files from HDFS file destination be saved in local file system specified path under;

Step S6, closes inlet flow and output stream;

It is as follows that flow process read by HDFS file:

Step S1, client sends by the open function of FileSystem the request opened file;

Request is sent to Namenode by RPC agreement by step S2, FileSystem;

Step S3, Namenode check meta information, return the data block location of corresponding document;

Step S4, FileSystem return a FSDataInputStream to client, allow it from FSDataInputStream, read data;

The read function of step S5, client call FSDataInputStream;

Step S6, client starts to read data from Datanode in streaming fashion;

After the data block reading of step S7, current Datanode, close the connection of this stream and Datanode; Then connect the nearest Datanode of the next data block of this file, and carry out block reading;

Step S8, after client reads complete data, calls the close function of FSDataInputStream, closes this stream;

It is as follows that flow process write by HDFS file:

Step S1, the file that first client will be uploaded carries out piecemeal in units of 64M, is respectively block1, block2...block n, sends establishment file request by the create function of FileSystem simultaneously;

Request is sent to Namenode by RPC agreement by step S2, FileSystem;

Step S3, the file that establishment one is new inside the Namespace of Namenode, Namenode returns available Datanode simultaneously;

Step S4, FileSystem return a FSDataOutputStream to client, for writing data;

The write function of step S5, client call FSDataOutputStream;

Step S6, client starts in streaming fashion block1 to be write Datanode:(1) block1 of 64M is divided by the package of 64k; (2) then first package is sent to first Datanode1; (3) after Datanode1 receives, first package is sent to Datanode2, Client sends second package to Datanode1 simultaneously; (4) after Datanode2 receives first package, send to Datanode3, receive second package that Datanode1 sends simultaneously; (5) by that analogy, until block1 is sent;

Step S7, Datanode1, Datanode2, Datanode3 send block1 and send successful message to NameNode, Datanode1 to Client.After Client receives the message that Datanode1 sends, send message to Namenode; Now, block1 sends and terminates completely, jumps to step S6, starts to write the piecemeal that block2, block3...block n etc. is remaining, until block n sends and terminates completely;

Step S8, after client write data complete, calls the close function of FSDataOutputStream, closes this stream;

Step S9, FileSystem notify that Namenode write is complete;

When file system client client carries out write operation, first it is recorded in amendment daily record edit log; Namenode saves the metadata information of file system in internal memory; After have recorded amendment daily record, Namenode then revises the data structure in internal memory; Before each write operation success, amendment daily record all can be synchronized to file system; Fsimage file, be the checkpoint of metadata on hard disk in internal memory, it is a kind of form of serializing, directly can not revise on hard disk; When Namenode failure, then the metadata information of up-to-date checkpoint is loaded into internal memory from fsimage, then re-executes the operation in amendment daily record one by one; SecondaryNamenode is just used to help Namenode by metadata information checkpoint in internal memory on hard disk; The process of checkpoint is as follows:

Step S1, SecondaryNamenode notify that Namenode generates new journal file, and later daily record is all write in new journal file;

Step S2, SecondaryNamenode http get obtains fsimage file and old journal file from Namenode;

Step S3, SecondaryNamenode are by fsimage files loading in internal memory, and the operation in execution journal file, then generates new fsimage file;

Step S4, SecondaryNamenode pass new fsimage file http post back Namenode;

Step S5, Namenode by old fsimage file and old journal file, can be changed to the new journal file of new fsimage file and step S1 generation, then upgrade fstime file, write the time of this checkpoint;

Step S6, the fsimage file in such Namenode saves the metadata information of up-to-date checkpoint, and journal file empties and restarts record modification.

FSNamesystem of the present invention is file system name space system class, and it is defined as follows:

public class F SNamesystem{

Public FSDirectory dir; // storage file is set

final BlocksMap blocksMap＝newBlocksMap(DEFAULT_INITIAL_MAP_CAPACITY，DEFAULT_MAP_LOAD_FACTOR)；

//BlocksMap class maintenance block (Block) is to the mapping table of its metadata, and metadata information comprises the Datanode of inode belonging to block, storage block.

Public CorruptReplicasMap corruptReplicas=new CorruptReplicasMap (); The mapping table of // fail block.

NavigableMap<String, DatanodeDescriptor>datanodeMap=new TreeMap<String, DatanodeDescriptor> (); //datanode is to the mapping table of block

ArrayList<DatanodeDescriptor>heartbeats＝new ArrayList<DatanodeDescriptor>()；

The subset of //datanodeMap, only comprises the DatanodeDescriptor thinking and survive, and HeartbeatMonitor regularly can remove expired element

private UnderReplicatedBlocks neededReplications＝new UnderReplicatedBlocks()；

// entity class of the not enough block of copy amount of some block is described, and, priority is set for block, is carried out the set of the block of management block copy deficiency by a priority query.

Private PendingReplicationBlocks pendingReplications; // the current list not yet completing the block of block copy replication is described.

Public LeaseManager leaseManager=new LeaseManager (this); // lease of file is managed.

Daemon hbthread=null; // periodically call FSNamesystem class definition heartbeatCheck method, monitor the heartbeat state information that Datanode node sends, and handle it

public Daemon lmthread＝null；//LeaseMonitor thread

Daemon smmthread=null; // be used for periodically checking whether the condition reaching and leave safe mode, therefore, this thread must start (namely reaching threshold) after entering safe mode.

Public Daemon replthread=null; // periodically call two methods: computing block copy amount, to make a plan and to dispatch Datanode process; The copy that the streamline that process does not complete block copies

private ReplicationMonitor replmon＝null；//Replication metrics

Private Host2NodesMap host2DataNodeMap=new Host2NodesMap (); // be used for the mapping of the main frame->DatanodeDescriptor array of preserving Datanode node

NetworkTopology clusterMap=new NetworkTopology (); // represent a computer cluster with tree network topological structure, a cluster may be made up of multiple data center (Data Center), is dispersed with the frame (Rack) of a lot of computing machines arranged for computation requirement in these data centers.

Private DNSToSwitchMapping dnsToSwitchMapping; // this interface is a definition supporting plug-in unit, by the resolver changed between plug-in definition DNS-name/IP-address->RackID.

ReplicationTargetChooser replicator; // selection positions to the deposit position of the block copy of specifying realize class.

Private HostsFileReader hostsReader; // be used for following the tracks of Datanode, which Datanode allows to be connected to Namenode, and which can not be connected to Namenode, record in the list of all specifying in such }.

Accompanying drawing explanation

Fig. 1 is the HDFS files passe schematic flow sheet based on Filesystem API;

Fig. 2 is the HDFS file download schematic flow sheet based on Filesystem API;

Fig. 3 is that process flow diagram read by HDFS file;

Fig. 4 is that process flow diagram write by HDFS file;

Fig. 5 is checkpoint process flow diagram.

Embodiment

One utilizes Filesystem API to realize HDFS file access method, the present invention includes Hadoop cluster environment dispose, upload the local document of Windows system to Hadoop distributed file system (HDFS), download HDFS document to local assigned catalogue, flow process read by file, flow process write by file and checkpoint process first, linux system is installed and creates Hadoop user's group and user under this systems, in addition, also JDK to be installed and configuration surroundings variable; Secondly, revise the host name of each main frame, and the mapping relations between configure host; Again, ssh service is installed, creates ssh without cryptographic acess; Finally, each main frame installed and configures Hadoop, after installation, starting Hadoop, and whether checking installs correct.

Hadoop cluster environment is built and is comprised the steps:

Step S1, installs linux system;

Step S2, creates Hadoop user's group and user under linux system;

Step S3, installs JDK and configuration surroundings variable;

Step S5, installs ssh service, creates ssh without cryptographic acess;

Step S6, each main frame is installed and configures Hadoop;

The method for uploading of HDFS document as shown in Figure 1, comprises the steps:

Step S1, user is to upload file in Hadoop distributed file system;

Step S7, shows the catalogue of the All Files in current HDFS file destination;

The method for down loading of HDFS file as shown in Figure 2, comprises the steps:

Step S1, user is download file from Hadoop distributed file system;

Step S6, closes inlet flow and output stream;

HDFS file reads flow process as shown in Figure 3, specific as follows:

Request is sent to Namenode by RPC agreement by step S2, FileSystem;

The read function of step S5, client call FSDataInputStream;

Step S6, client starts to read data from Datanode in streaming fashion;

HDFS file writes flow process as shown in Figure 4, specific as follows:

Request is sent to Namenode by RPC agreement by step S2, FileSystem;

Step S4, FileSystem return a FSDataOutputStream to client, for writing data;

The write function of step S5, client call FSDataOutputStream;

Step S9, FileSystem notify that Namenode write is complete;

When file system client client carries out write operation, first it is recorded in amendment daily record edit log; Namenode saves the metadata information of file system in internal memory; After have recorded amendment daily record, Namenode then revises the data structure in internal memory; Before each write operation success, amendment daily record all can be synchronized to file system; Fsimage file, be the checkpoint of metadata on hard disk in internal memory, it is a kind of form of serializing, directly can not revise on hard disk; When Namenode failure, then the metadata information of up-to-date checkpoint is loaded into internal memory from fsimage, then re-executes the operation in amendment daily record one by one; SecondaryNamenode is just used to help Namenode by metadata information checkpoint in internal memory on hard disk; The process of checkpoint is as shown in Figure 5, specific as follows:

Step S4, SecondaryNamenode pass new fsimage file http post back Namenode;

Claims

1. utilize Filesystem API to realize a HDFS file access method, it is characterized in that, comprise that Hadoop cluster environment is built, the uploading of HDFS file, method for down loading and HDFS file read-write flow process; Wherein,

Hadoop cluster environment is built and is comprised the steps:

Step S1, installs linux system;

Step S2, creates Hadoop user's group and user under linux system;

Step S3, installs JDK and configuration surroundings variable;

Step S5, installs ssh service, creates ssh without cryptographic acess;

Step S6, each main frame is installed and configures Hadoop;

The method for uploading of HDFS document comprises the steps:

Step S1, user is to upload file in Hadoop distributed file system;

Step S7, shows the catalogue of the All Files in current HDFS file destination;

The method for down loading of HDFS file comprises the steps:

Step S1, user is download file from Hadoop distributed file system;

Step S6, closes inlet flow and output stream;

It is as follows that flow process read by HDFS file:

Request is sent to Namenode by RPC agreement by step S2, FileSystem;

The read function of step S5, client call FSDataInputStream;

Step S6, client starts to read data from Datanode in streaming fashion;

It is as follows that flow process write by HDFS file:

Step S1, the file that first client will be uploaded carries out piecemeal in units of 64M, is respectively block1, block2 ... blockn, sends establishment file request by the create function of FileSystem simultaneously;

Request is sent to Namenode by RPC agreement by step S2, FileSystem;

Step S4, FileSystem return a FSDataOutputStream to client, for writing data;

The write function of step S5, client call FSDataOutputStream;

Step S7, Datanode1, Datanode2, Datanode3 send block1 and send successful message to NameNode, Datanode1 to Client.After Client receives the message that Datanode1 sends, send message to Namenode; Now, block1 sends and terminates completely, jumps to step S6, starts to write block2, block3 ... the piecemeal that block n etc. are remaining, terminates completely until blockn sends;

Step S9, FileSystem notify that Namenode write is complete;

Step S4, SecondaryNamenode pass new fsimage file http post back Namenode;