CN104113597B

CN104113597B - The HDFS data read-write method of a kind of many Data centres

Info

Publication number: CN104113597B
Application number: CN201410344218.9A
Authority: CN
Inventors: 董博; 阮建飞; 郑庆华; 贺欢; 张汉宁; 张未展
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2014-07-18
Filing date: 2014-07-18
Publication date: 2016-06-08
Anticipated expiration: 2034-07-18
Also published as: CN104113597A

Abstract

The present invention provides a kind of HDFS data read-write system and the method for many Data centres, it is characterized in that, set up global metadata server, for the metadata information of the store and management overall situation, and be responsible for receiving client data read-write access request, according to preset schedule algorithms selection HDFS Data centre; Client terminal and selected data center carry out data read-write operation alternately, after having operated, the metadata node of Data centre again by the change synchronizing information of metadata to global metadata server. System and method of the present invention achieves the reading and writing data access of many HDFS Data centre, it is provided that unified data access interface, effectively achieves resource and the data sharing of many HDFS Data centre.

Description

The HDFS data read-write method of a kind of many Data centres

Technical field

The present invention relates to computer cloud memory technology, in particular to a kind of data read-write system based on HDFS distributed document system and method.

Background technology

Cloud storage extends in cloud computing (CloudComputing) concept and develops. Cloud storage refers to by functions such as cluster, gridding technique or distributed document systems, various dissimilar storing device a large amount of in network is gathered collaborative work by application software, common externally offer data store and Operational Visit function, and ensure the security etc. of data.

The technology being representative with the Hadoop project distributed file system (HadoopDistributedFileSystem is called for short HDFS) of open source community Apache and parallel programming framework HadoopMapReduce at present becomes the mainstream technology of mass data storage and analyzing and processing gradually. Wherein, HDFS becomes one of most popular distributed document system gradually, is the main flow file system currently building cloud storage.

HDFS system architecture as shown in Figure 1, forms primarily of metadata node NameNode, data node DataNode and client terminal Client. Wherein, NameNode also claims Master node, is in charge of title space and the data block map information of HDFS, configuration replication policy, and processes client-requested. DataNode, also referred to as Slave node, stores actual data, performs the read-write operation of data block, and periodically reports the data block information of storage to NameNode. Client terminal Client, for cutting data file, accesses or manages HDFS by order line pipe; Mutual with NameNode, obtain file location information; Mutual with DataNode, carry out data read and write operation.

At present, HDFS is widely used in Data centre by numerous enterprises, colleges and universities, scientific research institutions etc., becomes the basic storage system of Data centre gradually, carries mass data storage task. Along with more and more foundation being dispersed in the middle-size and small-size Data centre of independence everywhere, how effectively shared the storage resources of each Data centre and data are, the data access interface providing unified how to outer business, becomes one of core difficult problem of restriction cloud storage system high speed development and widespread use. The current HDFS reading and writing data technology about many Data centres, there are no concrete open report, is technical problem currently urgently to be resolved hurrily.

Summary of the invention

It is an object of the invention to provide the access interface that a kind of read-write that can be data provides unified, it is achieved HDFS data read-write system that many data center information and resource are effectively shared and method.

For reaching above object, invention takes following technical scheme to be achieved:

The HDFS data read-write system of a kind of many Data centres, it is characterised in that, comprise a global metadata server, n Data centre, a client terminal, all there are a metadata node and multiple data node in each Data centre; Adopt Wide area network to link between global metadata server with client terminal and each Data centre metadata node, link by local area network between the metadata node of each Data centre with data node; Global metadata server is used for the metadata information of the store and management overall situation, is responsible for each Data centre distribution metadata name space; The metadata node of each Data centre all comprises a GMSplugin module, is responsible for global metadata server registration and timing report data center resource using state and metadata information; Global metadata server is responsible for receiving client terminal HDFS reading and writing data access request, and meets the Data centre of requirement according to preset schedule algorithms selection; The metadata node at client-access selected data center, the scheduling of HDFS reading and writing data is carried out by this metadata node, client terminal after HDFS reading and writing data completes, the metadata node of Data centre again by the change synchronizing information of metadata to global metadata server.

The HDFS data read-write method of a kind of many Data centres, it is characterised in that, comprise the big step of read and write two:

The first step, HDFS data are read, and comprising:

(1) global metadata server is set up, for the metadata information of the store and management overall situation; Global metadata server is each Data centre distribution name space, and metadata information is reported to global metadata server by each Data centre;

(2) global metadata server receives client terminal read data request, selects the Data centre meeting reading requirement by preset algorithm, returns the metadata node information at selected data center;

(3) metadata node of client-access Data centre, metadata node returns to client terminal according to preset schedule algorithm data block and data section dot information;

(4) client terminal and data node carry out alternately, read data, notify metadata node after having read, and reading completes according to operation;

2nd step, HDFS data write, comprising:

(1) step (1) read with HDFS data;

(2) global metadata server receives client terminal read data request, selects the Data centre meeting write requirement by preset algorithm, returns the metadata node information at selected data center;

(3) metadata node of HDFS Data centre selected by client-access, metadata node creates metadata information, and distributes data node according to preset algorithm, and data section dot information is returned to client terminal;

(4) client terminal and data node carry out carrying out data writing operation, notifying metadata node after having write alternately; Adopting piecemeal writing mechanism during client terminal write data, data block copy copy is completed automatically by data node, and all data blocks all write and successfully notify that metadata node has write afterwards;

(5) after write process completes, the metadata node of Data centre by the change synchronizing information of metadata to global metadata server.

In aforesaid method, described client terminal read data request comprises any feature of file path, data block index, buffer size;Described client terminal write data requests comprises the new any feature creating file path, write size of data, access rights.

Data centre's selection algorithm that described global metadata server is preset, according to any feature reading or writing the data distribution of request of data and each Data centre, system performance, condition of loading, adopts that data distribute preferentially, performance priority policy selection Data centre.

Described metadata node preset schedule algorithm comprises any feature of the distance according to size of data, piecemeal quantity, data block and client terminal, data block distribution, selects by distance priority, distribution fairness policy.

The HDFS data read-write system of many Data centres of the present invention adopts two layers of logical separation scheduling architecture. Global logic layer has global metadata server to be responsible for the selection of the name distribution in space of each Data centre, the inquiry of global metadata, Data centre when reading and writing data, and is by integrated for each independent data center unified core. Business Logic is by carrying out regarded as output controlling to the metadata node of HDFS, increase GMSplugin module, and link as subordinate module with global metadata server, thus form the many HDFS data center resource share framework can supported metadata synchronization and share. The overall situation that the present invention realizes metadata while retaining the function such as metadata node data management of original HDFS Data centre is shared. This kind of mode reduces system complexity while keeping original system efficient stable, it is possible to effectively realize the reading and writing data access of many HDFS Data centre fast.

Accompanying drawing explanation

Fig. 1 is HDFS system tray composition.

Fig. 2 is the HDFS data read-write system framework figure of the many Data centres of the present invention.

Fig. 3 is the HDFS time data stream journey figure of the many Data centres of the present invention.

The HDFS that Fig. 4 is the many Data centres of the present invention writes data flowchart.

Embodiment

In order to be illustrated more clearly in the technical scheme of the present invention, describe the present invention below in conjunction with the drawings and specific embodiments.

As shown in Figure 2, the HDFS data read-write system of a kind of many Data centres, comprise a global metadata server (GlobalMetadataServer, GMS), it is numbered n the Data centre of 01 to N, a client terminal Client, all there are a metadata node (NameNode) and multiple data node (DataNode) in each Data centre, wherein Wide area network is adopted to link between global metadata server and client terminal, Wide area network is adopted to link between the metadata node of global metadata server and each Data centre, link by local area network between the metadata node of each Data centre with data node. global metadata server is used for the metadata information of the store and management overall situation, is responsible for each Data centre distribution metadata name space, the metadata node of each Data centre all comprises a GMSplugin (global metadata server middleware) module, and link with global metadata server, to global metadata server registration also timing report data center resource using state and metadata information.

Global metadata server is responsible for receiving client terminal HDFS reading and writing data access request, and meets the Data centre of requirement according to preset schedule algorithms selection; The metadata node at the above-mentioned selected data center of client-access, the scheduling of HDFS reading and writing data is carried out by this metadata node, after client terminal HDFS reading and writing data completes, the metadata node of Data centre again by the change synchronizing information of metadata to global metadata server.

Global metadata server is used for the metadata information of the store and management overall situation; It is responsible for each Data centre distribution metadata name space; It is responsible for receiving client terminal HDFS reading and writing data access request, and meets metadata node corresponding to the Data centre of requirement according to preset schedule algorithms selection; It is responsible for receiving the metadata updates of the metadata node of each Data centre.

Global metadata server manages three template compositions primarily of access interface, GMS service routine, metadata; Access interface is the mutual interface module of client terminal and global metadata server, is responsible for process client terminal to requests such as the reading and writing of HDFS data, inquiries; Module is guarded in the service that GMS service routine is global metadata server, and operation monitoring, the module of being responsible for global metadata server heavily open, and ensure the steady running of global metadata server; Metadata manages the metadata node of Shi Ge Data centre and the mutual interface module of global metadata server, it is in charge of the metadata node of each Data centre, receive the metadata synchronization update request of each Data centre and store global metadata information, the process reading and writing data request that receives of access interface module, and according to the suitable Data centre of global metadata information and each Data centre condition selecting.

GMSplugin module, is a middleware for carrying out communicating with global metadata server, is responsible for global metadata server registration, in real time to global metadata server sync notebook data center situation information and metadata information.

The metadata node (containing GMSplugin module) of each Data centre for managing catalogue tree and the file metadata information at notebook data center, when the metadata of metadata node changes by preset algorithm by the real-time synchronizing information of GMSplugin module to global metadata server; Metadata node is responsible for the management of the data node at notebook data center, the process of client terminal HDFS reading and writing data request of data; Metadata node according to data parameters to be visited and preset schedule strategy, can select data node from the Data centre of its management.

The data node of each Data centre for manage on node storage, block list, data read-write; Data node carries out the establishment of block, deletion and duplication under the scheduling of metadata node; Data section is pressed preset algorithm and is periodically reported data block information to metadata node.

Client is used for and system interaction, and client terminal carries out writing the piecemeal of data, and mutual with metadata node, the data node of global metadata server and Data centre respectively, completes HDFS data read-write operation.

Optional 1 to 200 of the HDFS data read-write system number n of Tu2Duo Data centre.

By Fig. 2 system, present invention also offers the HDFS data reading method of many Data centres, it be described below by Fig. 3:

S301 sets up global metadata server, for the metadata information of the store and management overall situation; Global metadata server is each HDFS Data centre distribution name space, and metadata information is reported to global metadata server by each Data centre;

S302 global metadata server receives client terminal HDFS read data request, selects the HDFS Data centre meeting reading requirement by preset algorithm, returns the metadata node information at selected data center;

Client terminal read data request comprises the information such as file path, data block index, buffer size;

Preset schedule algorithm according to reading the information such as the data distribution of HDFS request of data and each Data centre, system performance, condition of loading, adopt data distribution preferentially, the policy selection Data centre such as performance priority;

The metadata node of S303 client-access HDFS Data centre, metadata node returns to client terminal according to preset schedule algorithm data block and data section dot information;

Metadata node comprises the information such as the distance according to data block and client terminal, piecemeal quantity, data block distribution according to preset schedule algorithm and provides recommendation reading order, select by distance priority, distribution fairness policy, it is possible to develop customization as required by those skilled in the art;

S304 client terminal and data node carry out alternately, read data, notify metadata node after having read, and reading completes according to operation.

By Fig. 2 system, the present invention provides the HDFS data write method of many Data centres, is described below by Fig. 4:

S401 sets up global metadata server, for the metadata information of the store and management overall situation; Global metadata server is each Data centre distribution name space, and metadata information is reported to global metadata server by each Data centre;

S402 global metadata server receives client terminal read data request, selects the HDFS Data centre meeting write requirement by preset algorithm, returns the metadata node information of selected HDFS Data centre;

Client terminal write data requests comprises new establishment file path, the write information such as size of data, access rights;

Global metadata server preset schedule algorithm selects concrete Data centre according to information such as the data distribution of request information and each Data centre, system performance, condition of loading, adopting the strategies such as data distribution is preferential, performance priority to dispatch, scheduling algorithm can by those skilled in the art's flexible customization as required;

The metadata node of HDFS Data centre selected by S403 client-access, metadata node creates metadata information, and according to preset schedule algorithm assigns data node, and data section dot information is returned to client terminal;

Metadata node preset schedule algorithm comprises according to information such as size of data, piecemeal quantity, data block distributions, dispatches by strategies such as distance priority, distribution justices, it is possible to develop customization as required by those skilled in the art;

S404 client terminal and data node carry out carrying out data writing operation, notifying metadata node after having write alternately; Adopting piecemeal writing mechanism during client terminal write data, data block copy copy is completed automatically by data node, and all data blocks all write and successfully notify that metadata node has write afterwards;

S405 after write process completes, the metadata node of HDFS Data centre by the change synchronizing information of metadata to global metadata server.

In sum, the invention solves along with disperseing the middle-size and small-size Data centre of independence everywhere to get more and more, and be difficult to realize the storage resources of each Data centre and data effectively shared, and provide the problems such as unified data access interface how to outer business, achieve unified management, unified interface, the HDFS reading and writing data framework for many Data centres of open and stable and method.

Claims

1. the HDFS data read-write method of a Zhong Duo Data centre, it is characterised in that, comprise the big step of read and write two:

The first step, HDFS project distributed file body coefficient, according to reading, comprising:

2nd step, HDFS project distributed file body system data write, comprising:

(1) with HDFS project distributed file body coefficient according to the step (1) read;

(3) metadata node of Data centre of HDFS project distributed file body system selected by client-access, metadata node creates metadata information, and distributes data node according to preset algorithm, and data section dot information is returned to client terminal;

(5) after write process completes, the metadata node of Data centre by the change synchronizing information of metadata to global metadata server;

In aforesaid method, the metadata node of described each Data centre all comprises a GMSplugin global metadata server middleware module, is responsible for global metadata server registration and timing report data center resource using state and metadata information; Data centre's selection algorithm that described global metadata server is preset, according to any feature reading or writing the data distribution of request of data and each Data centre, system performance, condition of loading, adopts that data distribute preferentially, performance priority policy selection Data centre; Described metadata node preset schedule algorithm comprises the information according to size of data, piecemeal quantity and data block distribution, and the strategy fair by distance priority and distribution is selected.