CN105824867A - Mass file management system based on multi-stage distributed metadata - Google Patents

Mass file management system based on multi-stage distributed metadata Download PDF

Info

Publication number
CN105824867A
CN105824867A CN201510929600.0A CN201510929600A CN105824867A CN 105824867 A CN105824867 A CN 105824867A CN 201510929600 A CN201510929600 A CN 201510929600A CN 105824867 A CN105824867 A CN 105824867A
Authority
CN
China
Prior art keywords
file
metadata
hdfs
data
mass
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510929600.0A
Other languages
Chinese (zh)
Inventor
张伟
何广柏
许棉耿
郑丁科
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Eshore Technology Co Ltd
Original Assignee
Guangdong Eshore Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Eshore Technology Co Ltd filed Critical Guangdong Eshore Technology Co Ltd
Priority to CN201510929600.0A priority Critical patent/CN105824867A/en
Publication of CN105824867A publication Critical patent/CN105824867A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/11File system administration, e.g. details of archiving or snapshots
    • G06F16/122File system administration, e.g. details of archiving or snapshots using management policies
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems

Abstract

The invention discloses a mass file management system based on multi-stage distributed metadata. The system comprises a metadata management module, a data operating module and an application interface module, wherein the metadata management module manages the metadata through a metadata cluster and is used for realizing metadata distribution; the data operating module is used for providing file content storage by utilizing a data cluster; the application interface module is used for establishing, reading and deleting a metadata file. According to the system disclosed by the invention, Leveled FS management of mass files is realized by a two-stage scheme of providing physical file storage based on HDFS, wherein mass Leveled FS metadata is managed by HBase, a small amount of large file data is managed by HDFS, various advantages of HDFS are inherited, and the defects brought by the cause that mass small files are managed by HDFS are effectively overcome.

Description

A kind of mass file based on multistage distribution metadata management system
Technical field
The present invention relates to file management system, particularly relate to a kind of mass file based on multistage distribution metadata management system.
Background technology
In prior art, information media based on paper is gradually substituted by digitized information medium, and various information, data are designed to the binary data of all kinds of form and by computer system management, and computer stores various data by file system.At present, data exponentiallyization increases, and whole world data volume is added up almost without method.Along with the continuous expansion of storage demand, Storage Techniques is also the most progressive, and computer hard disk data density is increasing, and capacity is increasing.But, without the most how to develop, monolithic hard disk, the memory capacity of single server are limited, although can meet personal information storage demand, but the most not enough for business data storage demand.
The appearance of distributed file system is contemplated to meet the storage demand of mass data.But distributed file system utilizes multiple being physically distributed the local storage capacity of multiple computer nodes of network link, deposits, by the same management of metadata, data distribution, the file system realizing being provided that overall storage capacity in logic.Distributed file system feature: 1, super large memory space: the memory space of distributed file system is broken through unit storage completely and limited, and is the combination storage capacity of the All hosts of composing document system, has the storage capacity of super large;2, storage capacity is expansible: the storage capacity of distributed file system can be extended by increasing host node quantity.
At present, most popular distributed file system is the hdfs file system of apache foundation.HDFS divides NameNode and DataNode two class component, NameNode is responsible for metadata management, DataNode is responsible for physical data storage, NameNode manages the metadata information of whole file system, each hdfs file divides Block to deposit, each Block size is fixed, and physical file corresponding for Block is stored in DataNode.But this system has following defects that
1, metadata capacity limit, metadata is managed by NameNode, NameNode all leaves metadata in internal memory, limited by host memory and JVM garbage reclamation characteristic, the metadata finite capacity system of storage, it is impossible to the file that management quantity is excessive, during according to actual production system experience more than 50,000,000 file, there is instability in NameNode performance, and therefore, HDFS can not manage the file that quantity is excessive.
2, metadata operation performance bottleneck, client, when operation file, first passes through NameNode and operates metadata, then reconnect concrete DataNode and carry out file data read-write operation.NameNode service is only provided by mono-node of main NameNode, and the number of concurrent of support is limited, it is impossible to meet the operation that there is heap file concurrent reading and writing.
3, failure recovery time is long, and in the case of system file quantity is excessive, metadata capacity also can be big, needs the biggest memory management.Metadata capacity is bigger when, NameNode starts needs the long period, therefore recovers when fault to be also required to the longer time.
Summary of the invention
The technical problem to be solved in the present invention is, for the deficiencies in the prior art, it is provided that a kind of based on multistage distribution metadata and can break through metadata capacity limit, optimize metadata operation performance, reduce failure recovery time mass file management system.
For solving above-mentioned technical problem, the present invention adopts the following technical scheme that.
A kind of mass file based on multistage distribution metadata manages system, and it includes: metadata management module, and it passes through metadata cluster management metadata, is used for realizing meta-data distribution;Data operation modules, it utilizes data cluster to provide file content storage;AIM, it is used for creating, reads, deletes meta data file and reading and writing of files content.
Preferably, in described metadata management module, metadata is managed by HBase table, to realize super large metadata capacity, and metadata table is opened in-memery characteristic, and HBase uses distributed schemes to provide metadata storage, ensures metadata readwrite performance by in-memory characteristic.
Preferably, in described data operation modules, HDFS concentrates the local storage of all nodes, and provides unified storage view, and concrete file read-write is distributed to different HDFS nodes, and all data deposit multiple copies.
Preferably, in described AIM, when creating meta data file, first determine whether that file has existed, if file does not exists, create metadata, secondly, a HDFS file is opened by additional mode, write file content, updates content metadata afterwards, file status makes into " creating ".
Preferably, in described AIM, when reading meta data file, if file exists, then open the HDFS file of correspondence, the hdfs position that inquiry this document is corresponding, return file input stream, read file data for application.
The present invention utilizes HBase to realize the metadata management of distribution, the two-step scheme of physical file storage is provided to realize mass file LeveledFS management based on HDFS, wherein: the LeveledFS metadata of HBase management magnanimity, the large file that HDFS management is a small amount of, it not only inherits every advantage of HDFS, also effectively prevent the drawback brought with HDFS management mass small documents.
Accompanying drawing explanation
Fig. 1 is the composition frame chart of mass file of the present invention management system.
Fig. 2 is the interaction schematic diagram of metadata and HDFS file.
Detailed description of the invention
With embodiment, the present invention is described in more detail below in conjunction with the accompanying drawings.
The invention discloses a kind of mass file based on multistage distribution metadata management system, shown in Fig. 1 and Fig. 2, it includes:
Metadata management module 1, it manages metadata by metadata cluster HBase, is used for realizing meta-data distribution;
Data operation modules 2, it utilizes data cluster HDFS to provide file content storage;
AIM 3, it is used for creating, reads, deletes meta data file and reading and writing of files content.
Further, in described metadata management module 1, metadata is managed by HBase table, to realize super large metadata capacity, and metadata table is opened in-memery characteristic, and HBase uses distributed schemes to provide metadata storage, ensures metadata readwrite performance by in-memory characteristic.
As a kind of optimal way, in described data operation modules 2, HDFS concentrates the local storage of all nodes, and provides unified storage view, and concrete file read-write is distributed to different HDFS nodes, and all data deposit multiple copies.
In the present embodiment, in described AIM 3, when creating meta data file, first determine whether that file has existed, if file does not exists, create metadata, secondly, a HDFS file is opened by additional mode, write file content, updates content metadata afterwards, file status makes into " creating ".
Further, in described AIM 3, when reading meta data file, if file exists, then open the HDFS file of correspondence, the hdfs position that inquiry this document is corresponding, return file input stream, read file data for AIM.
In mass file management system based on multistage distribution metadata disclosed by the invention, file system includes two-stage, file content stores based on HDFS, metadata is based on HBase, one corresponding multiple presents system file of HDFS file, continuous a section in each file correspondence HDFS file of file management system.
The present invention utilizes HBase to realize the metadata management of distribution, the two-step scheme of physical file storage is provided to realize mass file LeveledFS management based on HDFS, wherein: the LeveledFS metadata of HBase management magnanimity, the large file that HDFS management is a small amount of, it not only inherits every advantage of HDFS, also effectively prevent the drawback brought with HDFS management mass small documents.
As a kind of optimal way, in actual applications, following process is specifically included:
One, metadata management, it manages metadata by HBase, it is achieved meta-data distribution, this metadata proposal effective breakthrough performance bottleneck:
Metadata capacity is big: manage metadata by HBase table, it is achieved super large metadata capacity;Meanwhile, in order to ensure metadata readwrite performance, metadata table is opened in-memery characteristic.HBase uses distributed schemes to provide metadata storage, and capacity is broken through unit internal memory and limited, and ensures metadata readwrite performance by in-memory characteristic.High-throughput: metadata is distributed to multiple RegionServer management, and the access request of metadata is distributed to different RegionServer nodes, and the handling capacity of metadata request breaks through unit performance bottleneck.Metadata performance is expansible: after meta-data distribution, can increase node according to demand and promote metadata process performance;High reliability: HBase has perfect highly reliable mechanism, and part of nodes can realize switching extremely in the case of ensureing data consistency, and the high availability for file system provides basis;
Two, storage realizes, and utilizes HDFS to provide document memory to store, completely succession HDFS advantage:
Data capacity is big: HDFS concentrates the local of all nodes of cluster to store the storage view providing same, it is provided that mass storage ability;IO performance is high: concrete file read-write is distributed to different HDFS nodes, can give full play to the IO ability of a large amount of hard disk of cluster;High reliability: all data deposit multiple copies, part of nodes fault is not result in shortage of data;Storage capacity is expansible: when storage capacity deficiency, can very easy expansion storage capacity;
Three, file read-write operations, file API: realize the FileSystemAPI of HDFS, it is provided that the exploitation API consistent with HDFS;
Document creation: 1, judge that file has existed, if there is no carrying out establishment metadata, utilizes HBase lock mechanism to realize atomic operation, it is ensured that file system consistency;2, after metadata creates successfully, open a HDFS file by additional mode, have caching mechanism to reduce the metadata operation to HDFS, write file content;3, after file content writes, update content metadata, file status is made into " creating ";
File reads: obtain file metadata, if file exists, opens the HDFS file of correspondence, i.e. reduces the operation to HDFS, hdfs position corresponding for seek to this document by caching, return file input stream, read file data for application;Deletion, mobile file: only more new metadata, do not carry out data movement operations;
Four, file system operation is supporting:
Physical examination: provide physical examination order to check file system health current health program and data consistency;Command-line tool: provide file system operation order line, it is provided that file operation and directory operation;Defragmentation: deleting file and HDFS file can be caused to have more junk data, file system provides defragmentation instrument, when rubbish reaches certain proportion, carries out data compilation;
Based on technical scheme, it is achieved that following functional objective: file system files quantity is unrestricted, and performance is not affected by quantity of documents;Have HDFS ad eundem IO performance;There is provided and HDFSAPI mode DLL FileSystem, FileStatus, compatible application based on HDFS exploitation;Support the Permission Management Model of similar POSIX;Support file operation: new files, deletion file, mobile file, read-only mode open file;Support directory operation: ls, mkdir, rmr, rename/mv;Support that MapReduce processes framework;Health check tool is provided;There is provided supporting command-line tool: support the functions such as ls, mv, delete, du;Possesses defragmentation function.Along with the development of telecommunication technology, telecommunication charging system needs quantity of documents to be processed and file size constantly to increase, and at present except HDFS, does not has more preferable replacement scheme, but HDFS management is huge number of exists greater risk.The present invention is that telecommunications billing system provides perfect CDR file storage scheme, can meet that charging bill file data amount is big, quantity of documents is many, access is frequent, the high requirement of reliability requirement.
The above is preferred embodiment of the present invention, is not limited to the present invention, all amendment, equivalent or improvement etc. made in the technical scope of the present invention, should be included in the range of the present invention protected.

Claims (5)

1. mass file based on a multistage distribution metadata management system, it is characterised in that include:
Metadata management module, it passes through metadata cluster management metadata, is used for realizing meta-data distribution;
Data operation modules, it utilizes data cluster to provide file content storage;
AIM, it is used for creating, reads, deletes meta data file and reading and writing of files content.
2. mass file based on multistage distribution metadata management system as claimed in claim 1, it is characterized in that, in described metadata management module, metadata is managed by HBase table, to realize super large metadata capacity, and metadata table is opened in-memery characteristic, and HBase uses distributed schemes to provide metadata storage, ensures metadata readwrite performance by in-memory characteristic.
3. mass file based on multistage distribution metadata management system as claimed in claim 1, it is characterized in that, in described data operation modules, HDFS concentrates the local storage of all nodes, and unified storage view is provided, and concrete file read-write is distributed to different HDFS nodes, and all data deposit multiple copies.
4. mass file based on multistage distribution metadata management system as claimed in claim 1, it is characterized in that, in described AIM, when creating meta data file, first determine whether that file has existed, if file does not exists, create metadata, secondly, open a HDFS file by additional mode, write file content, update content metadata afterwards, file status is made into " creating ".
5. mass file based on multistage distribution metadata management system as claimed in claim 1, it is characterized in that, in described AIM, when reading meta data file, if file exists, then open the HDFS file of correspondence, the hdfs position that inquiry this document is corresponding, return file input stream, read file data for application.
CN201510929600.0A 2015-12-14 2015-12-14 Mass file management system based on multi-stage distributed metadata Pending CN105824867A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510929600.0A CN105824867A (en) 2015-12-14 2015-12-14 Mass file management system based on multi-stage distributed metadata

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510929600.0A CN105824867A (en) 2015-12-14 2015-12-14 Mass file management system based on multi-stage distributed metadata

Publications (1)

Publication Number Publication Date
CN105824867A true CN105824867A (en) 2016-08-03

Family

ID=56513473

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510929600.0A Pending CN105824867A (en) 2015-12-14 2015-12-14 Mass file management system based on multi-stage distributed metadata

Country Status (1)

Country Link
CN (1) CN105824867A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106503198A (en) * 2016-11-02 2017-03-15 北京集奥聚合科技有限公司 A kind of cold data recognition methodss and system based on hadoop metadata
CN106960055A (en) * 2017-04-01 2017-07-18 广东浪潮大数据研究有限公司 A kind of file delet method and device
CN114780022A (en) * 2022-03-25 2022-07-22 北京百度网讯科技有限公司 Method and device for realizing write-addition operation, electronic equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070276838A1 (en) * 2006-05-23 2007-11-29 Samy Khalil Abushanab Distributed storage
CN102708165A (en) * 2012-04-26 2012-10-03 华为软件技术有限公司 Method and device for processing files in distributed file system
CN103106286A (en) * 2013-03-04 2013-05-15 曙光信息产业(北京)有限公司 Method and device for managing metadata
CN103595791A (en) * 2013-11-14 2014-02-19 中国科学院深圳先进技术研究院 Cloud accessing method for mass remote sensing data

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070276838A1 (en) * 2006-05-23 2007-11-29 Samy Khalil Abushanab Distributed storage
CN102708165A (en) * 2012-04-26 2012-10-03 华为软件技术有限公司 Method and device for processing files in distributed file system
CN103106286A (en) * 2013-03-04 2013-05-15 曙光信息产业(北京)有限公司 Method and device for managing metadata
CN103595791A (en) * 2013-11-14 2014-02-19 中国科学院深圳先进技术研究院 Cloud accessing method for mass remote sensing data

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106503198A (en) * 2016-11-02 2017-03-15 北京集奥聚合科技有限公司 A kind of cold data recognition methodss and system based on hadoop metadata
CN106960055A (en) * 2017-04-01 2017-07-18 广东浪潮大数据研究有限公司 A kind of file delet method and device
CN106960055B (en) * 2017-04-01 2020-08-04 广东浪潮大数据研究有限公司 File deletion method and device
CN114780022A (en) * 2022-03-25 2022-07-22 北京百度网讯科技有限公司 Method and device for realizing write-addition operation, electronic equipment and storage medium
CN114780022B (en) * 2022-03-25 2023-01-06 北京百度网讯科技有限公司 Method and device for realizing additional writing operation, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
US9672267B2 (en) Hybrid data management system and method for managing large, varying datasets
CN104731921B (en) Storage and processing method of the Hadoop distributed file systems for log type small documents
JP6046260B2 (en) Table format for MapReduce system
CN102158546B (en) Cluster file system and file service method thereof
CN105787093B (en) A kind of construction method of the log file system based on LSM-Tree structure
CN101187901B (en) High speed cache system and method for implementing file access
CN107391391B (en) Method, system and the solid state hard disk of data copy are realized in the FTL of solid state hard disk
CN102012933B (en) Distributed file system and method for storing data and providing services by utilizing same
JP2016505935A (en) Separation of content and metadata in a distributed object storage ecosystem
CN101488153A (en) Method for implementing high-capacity flash memory file system in embedded type Linux
CN106775446A (en) Based on the distributed file system small documents access method that solid state hard disc accelerates
CN105868286A (en) Parallel adding method and system for merging small files on basis of distributed file system
CN103037004A (en) Implement method and device of cloud storage system operation
CN105045850B (en) Junk data recovery method in cloud storage log file system
Menon et al. CaSSanDra: An SSD boosted key-value store
CN104462185A (en) Digital library cloud storage system based on mixed structure
CN104054071A (en) Method for accessing storage device and storage device
CN105653720A (en) Database hierarchical storage optimization method capable of achieving flexible configuration
CN105824867A (en) Mass file management system based on multi-stage distributed metadata
CN109933564A (en) File system management method, device, terminal, the medium of quick rollback are realized based on chained list and N-ary tree construction
CN103942301A (en) Distributed file system oriented to access and application of multiple data types
CN108776690A (en) The method of HDFS Distribution and Centralization blended data storage systems based on separated layer handling
CN109669916A (en) A kind of distributed objects storage architecture and platform based on CMSP and KUDU
CN105631010A (en) Optimization method based on HDFS small file storage
CN105701179A (en) Windows access method of distributed file system based on UniWhale

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20160803