CN105824867A

CN105824867A - Mass file management system based on multi-stage distributed metadata

Info

Publication number: CN105824867A
Application number: CN201510929600.0A
Authority: CN
Inventors: 张伟; 何广柏; 许棉耿; 郑丁科
Original assignee: Guangdong Eshore Technology Co Ltd
Current assignee: Guangdong Eshore Technology Co Ltd
Priority date: 2015-12-14
Filing date: 2015-12-14
Publication date: 2016-08-03

Abstract

The invention discloses a mass file management system based on multi-stage distributed metadata. The system comprises a metadata management module, a data operating module and an application interface module, wherein the metadata management module manages the metadata through a metadata cluster and is used for realizing metadata distribution; the data operating module is used for providing file content storage by utilizing a data cluster; the application interface module is used for establishing, reading and deleting a metadata file. According to the system disclosed by the invention, Leveled FS management of mass files is realized by a two-stage scheme of providing physical file storage based on HDFS, wherein mass Leveled FS metadata is managed by HBase, a small amount of large file data is managed by HDFS, various advantages of HDFS are inherited, and the defects brought by the cause that mass small files are managed by HDFS are effectively overcome.

Description

A kind of mass file based on multistage distribution metadata management system

Technical field

The present invention relates to file management system, particularly relate to a kind of mass file based on multistage distribution metadata management system.

Background technology

In prior art, information media based on paper is gradually substituted by digitized information medium, and various information, data are designed to the binary data of all kinds of form and by computer system management, and computer stores various data by file system.At present, data exponentiallyization increases, and whole world data volume is added up almost without method.Along with the continuous expansion of storage demand, Storage Techniques is also the most progressive, and computer hard disk data density is increasing, and capacity is increasing.But, without the most how to develop, monolithic hard disk, the memory capacity of single server are limited, although can meet personal information storage demand, but the most not enough for business data storage demand.

The appearance of distributed file system is contemplated to meet the storage demand of mass data.But distributed file system utilizes multiple being physically distributed the local storage capacity of multiple computer nodes of network link, deposits, by the same management of metadata, data distribution, the file system realizing being provided that overall storage capacity in logic.Distributed file system feature: 1, super large memory space: the memory space of distributed file system is broken through unit storage completely and limited, and is the combination storage capacity of the All hosts of composing document system, has the storage capacity of super large；2, storage capacity is expansible: the storage capacity of distributed file system can be extended by increasing host node quantity.

At present, most popular distributed file system is the hdfs file system of apache foundation.HDFS divides NameNode and DataNode two class component, NameNode is responsible for metadata management, DataNode is responsible for physical data storage, NameNode manages the metadata information of whole file system, each hdfs file divides Block to deposit, each Block size is fixed, and physical file corresponding for Block is stored in DataNode.But this system has following defects that

1, metadata capacity limit, metadata is managed by NameNode, NameNode all leaves metadata in internal memory, limited by host memory and JVM garbage reclamation characteristic, the metadata finite capacity system of storage, it is impossible to the file that management quantity is excessive, during according to actual production system experience more than 50,000,000 file, there is instability in NameNode performance, and therefore, HDFS can not manage the file that quantity is excessive.

2, metadata operation performance bottleneck, client, when operation file, first passes through NameNode and operates metadata, then reconnect concrete DataNode and carry out file data read-write operation.NameNode service is only provided by mono-node of main NameNode, and the number of concurrent of support is limited, it is impossible to meet the operation that there is heap file concurrent reading and writing.

3, failure recovery time is long, and in the case of system file quantity is excessive, metadata capacity also can be big, needs the biggest memory management.Metadata capacity is bigger when, NameNode starts needs the long period, therefore recovers when fault to be also required to the longer time.

Summary of the invention

The technical problem to be solved in the present invention is, for the deficiencies in the prior art, it is provided that a kind of based on multistage distribution metadata and can break through metadata capacity limit, optimize metadata operation performance, reduce failure recovery time mass file management system.

For solving above-mentioned technical problem, the present invention adopts the following technical scheme that.

A kind of mass file based on multistage distribution metadata manages system, and it includes: metadata management module, and it passes through metadata cluster management metadata, is used for realizing meta-data distribution；Data operation modules, it utilizes data cluster to provide file content storage；AIM, it is used for creating, reads, deletes meta data file and reading and writing of files content.

Preferably, in described metadata management module, metadata is managed by HBase table, to realize super large metadata capacity, and metadata table is opened in-memery characteristic, and HBase uses distributed schemes to provide metadata storage, ensures metadata readwrite performance by in-memory characteristic.

Preferably, in described data operation modules, HDFS concentrates the local storage of all nodes, and provides unified storage view, and concrete file read-write is distributed to different HDFS nodes, and all data deposit multiple copies.

Preferably, in described AIM, when creating meta data file, first determine whether that file has existed, if file does not exists, create metadata, secondly, a HDFS file is opened by additional mode, write file content, updates content metadata afterwards, file status makes into " creating ".

Preferably, in described AIM, when reading meta data file, if file exists, then open the HDFS file of correspondence, the hdfs position that inquiry this document is corresponding, return file input stream, read file data for application.

The present invention utilizes HBase to realize the metadata management of distribution, the two-step scheme of physical file storage is provided to realize mass file LeveledFS management based on HDFS, wherein: the LeveledFS metadata of HBase management magnanimity, the large file that HDFS management is a small amount of, it not only inherits every advantage of HDFS, also effectively prevent the drawback brought with HDFS management mass small documents.

Accompanying drawing explanation

Fig. 1 is the composition frame chart of mass file of the present invention management system.

Fig. 2 is the interaction schematic diagram of metadata and HDFS file.

Detailed description of the invention

With embodiment, the present invention is described in more detail below in conjunction with the accompanying drawings.

The invention discloses a kind of mass file based on multistage distribution metadata management system, shown in Fig. 1 and Fig. 2, it includes:

Metadata management module 1, it manages metadata by metadata cluster HBase, is used for realizing meta-data distribution；

Data operation modules 2, it utilizes data cluster HDFS to provide file content storage；

AIM 3, it is used for creating, reads, deletes meta data file and reading and writing of files content.

Further, in described metadata management module 1, metadata is managed by HBase table, to realize super large metadata capacity, and metadata table is opened in-memery characteristic, and HBase uses distributed schemes to provide metadata storage, ensures metadata readwrite performance by in-memory characteristic.

As a kind of optimal way, in described data operation modules 2, HDFS concentrates the local storage of all nodes, and provides unified storage view, and concrete file read-write is distributed to different HDFS nodes, and all data deposit multiple copies.

In the present embodiment, in described AIM 3, when creating meta data file, first determine whether that file has existed, if file does not exists, create metadata, secondly, a HDFS file is opened by additional mode, write file content, updates content metadata afterwards, file status makes into " creating ".

Further, in described AIM 3, when reading meta data file, if file exists, then open the HDFS file of correspondence, the hdfs position that inquiry this document is corresponding, return file input stream, read file data for AIM.

In mass file management system based on multistage distribution metadata disclosed by the invention, file system includes two-stage, file content stores based on HDFS, metadata is based on HBase, one corresponding multiple presents system file of HDFS file, continuous a section in each file correspondence HDFS file of file management system.

As a kind of optimal way, in actual applications, following process is specifically included:

One, metadata management, it manages metadata by HBase, it is achieved meta-data distribution, this metadata proposal effective breakthrough performance bottleneck:

Metadata capacity is big: manage metadata by HBase table, it is achieved super large metadata capacity；Meanwhile, in order to ensure metadata readwrite performance, metadata table is opened in-memery characteristic.HBase uses distributed schemes to provide metadata storage, and capacity is broken through unit internal memory and limited, and ensures metadata readwrite performance by in-memory characteristic.High-throughput: metadata is distributed to multiple RegionServer management, and the access request of metadata is distributed to different RegionServer nodes, and the handling capacity of metadata request breaks through unit performance bottleneck.Metadata performance is expansible: after meta-data distribution, can increase node according to demand and promote metadata process performance；High reliability: HBase has perfect highly reliable mechanism, and part of nodes can realize switching extremely in the case of ensureing data consistency, and the high availability for file system provides basis；

Two, storage realizes, and utilizes HDFS to provide document memory to store, completely succession HDFS advantage:

Data capacity is big: HDFS concentrates the local of all nodes of cluster to store the storage view providing same, it is provided that mass storage ability；IO performance is high: concrete file read-write is distributed to different HDFS nodes, can give full play to the IO ability of a large amount of hard disk of cluster；High reliability: all data deposit multiple copies, part of nodes fault is not result in shortage of data；Storage capacity is expansible: when storage capacity deficiency, can very easy expansion storage capacity；

Three, file read-write operations, file API: realize the FileSystemAPI of HDFS, it is provided that the exploitation API consistent with HDFS；

Document creation: 1, judge that file has existed, if there is no carrying out establishment metadata, utilizes HBase lock mechanism to realize atomic operation, it is ensured that file system consistency；2, after metadata creates successfully, open a HDFS file by additional mode, have caching mechanism to reduce the metadata operation to HDFS, write file content；3, after file content writes, update content metadata, file status is made into " creating "；

File reads: obtain file metadata, if file exists, opens the HDFS file of correspondence, i.e. reduces the operation to HDFS, hdfs position corresponding for seek to this document by caching, return file input stream, read file data for application；Deletion, mobile file: only more new metadata, do not carry out data movement operations；

Four, file system operation is supporting:

Physical examination: provide physical examination order to check file system health current health program and data consistency；Command-line tool: provide file system operation order line, it is provided that file operation and directory operation；Defragmentation: deleting file and HDFS file can be caused to have more junk data, file system provides defragmentation instrument, when rubbish reaches certain proportion, carries out data compilation；

Based on technical scheme, it is achieved that following functional objective: file system files quantity is unrestricted, and performance is not affected by quantity of documents；Have HDFS ad eundem IO performance；There is provided and HDFSAPI mode DLL FileSystem, FileStatus, compatible application based on HDFS exploitation；Support the Permission Management Model of similar POSIX；Support file operation: new files, deletion file, mobile file, read-only mode open file；Support directory operation: ls, mkdir, rmr, rename/mv；Support that MapReduce processes framework；Health check tool is provided；There is provided supporting command-line tool: support the functions such as ls, mv, delete, du；Possesses defragmentation function.Along with the development of telecommunication technology, telecommunication charging system needs quantity of documents to be processed and file size constantly to increase, and at present except HDFS, does not has more preferable replacement scheme, but HDFS management is huge number of exists greater risk.The present invention is that telecommunications billing system provides perfect CDR file storage scheme, can meet that charging bill file data amount is big, quantity of documents is many, access is frequent, the high requirement of reliability requirement.

The above is preferred embodiment of the present invention, is not limited to the present invention, all amendment, equivalent or improvement etc. made in the technical scope of the present invention, should be included in the range of the present invention protected.

Claims

1. mass file based on a multistage distribution metadata management system, it is characterised in that include:

Metadata management module, it passes through metadata cluster management metadata, is used for realizing meta-data distribution；

Data operation modules, it utilizes data cluster to provide file content storage；

AIM, it is used for creating, reads, deletes meta data file and reading and writing of files content.

2. mass file based on multistage distribution metadata management system as claimed in claim 1, it is characterized in that, in described metadata management module, metadata is managed by HBase table, to realize super large metadata capacity, and metadata table is opened in-memery characteristic, and HBase uses distributed schemes to provide metadata storage, ensures metadata readwrite performance by in-memory characteristic.

3. mass file based on multistage distribution metadata management system as claimed in claim 1, it is characterized in that, in described data operation modules, HDFS concentrates the local storage of all nodes, and unified storage view is provided, and concrete file read-write is distributed to different HDFS nodes, and all data deposit multiple copies.

4. mass file based on multistage distribution metadata management system as claimed in claim 1, it is characterized in that, in described AIM, when creating meta data file, first determine whether that file has existed, if file does not exists, create metadata, secondly, open a HDFS file by additional mode, write file content, update content metadata afterwards, file status is made into " creating ".

5. mass file based on multistage distribution metadata management system as claimed in claim 1, it is characterized in that, in described AIM, when reading meta data file, if file exists, then open the HDFS file of correspondence, the hdfs position that inquiry this document is corresponding, return file input stream, read file data for application.