CN103501319A

CN103501319A - Low-delay distributed storage system for small files

Info

Publication number: CN103501319A
Application number: CN201310429804.9A
Authority: CN
Inventors: 王鲁俊; 龙翔; 王雷
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2013-09-18
Filing date: 2013-09-18
Publication date: 2014-01-08

Abstract

The invention discloses a low-delay distributed storage system for small files. According to the low-delay distributed storage system for the small files, all DataServers of the low-delay distributed storage system are logically organized to form a ring, the consistent Hash scheme is adopted by the low-delay distributed storage system, hashing is carried out on IDs of the DataServers according to a specific Hash algorithm, the DataServers are distributed on a ring of a whole Hash value range according to Hash values, a central CV node is arranged in a cluster, cluster topological information managed by the CV node comprises all moving DataServer lists in the cluster and a version number of the current cluster topological information, and a client locally caches the cluster topological information. When the client carries out clustering for the first time, the client can have access to the CV node to obtain the cluster topological information, and locally cache the cluster topological information, and the locally-cached cluster topological information is used when follow-up read-write is carried out. When the client carries out read-write, hashing is carried out on filenames according to the filenames and the consistent Hash scheme, the DataServers where the small files are located are determined, then, comparison between the version number of the cluster topological information stored on the DataServers and the version number of the cluster topological information stored on the client is carried out, and when the version numbers are the same, the actual read-write operation can be carried out on the DataServers.

Description

A kind of distributed memory system towards small documents of low delay

Technical field

The present invention relates to distributed storage and mass small documents field of storage, be specifically related to a kind of system of the distributed storage towards small documents of low delay.

Background technology

Small documents is often referred to the file that document size is less than HDFS default tile size (being 64MB), and in current application, photo files, music file, email content of text, microblogging content etc. can be thought small documents.

The small documents problem has caused some concerns in academia and industrial quarters gradually.Famous social network sites Facebook has stored 2,600 hundred million pictures, and capacity surpasses 20PB, and these file overwhelming majority all are less than 64MB.In the supercomputer field, for example, the application program on ORNL ' s CrayXT5 cluster (18688 nodes, 12 processors of each node) can, periodically by the application state writing in files, cause system to produce a large amount of small documents.U.S.'s Pacific Northwest National Laboratory a research report of 2007 shows, 1,200 ten thousand files are arranged in this laboratory system, wherein 94% file is less than 64MB, 58% file is less than 64KB, in concrete scientific research computing environment, for example, during some biology calculates, may produce 3,000 ten thousand files, and its mean size only has 190KB.The huge whale net of music site has been included the music file of 3,600,000 MP3 format.Other documents also show, the data of accessing on the Internet mostly are the small documents of high access frequency.

GFS technical control people Sean Quinlan mentions one of them application scenarios of BigTable towards small documents in the GFS interview.Hadoop existing problems aspect the processing mass small documents are also pointed out in the report about Small File Problem of the famous Hadoop application Cloudera of company issue.

Hadoop itself provides Hadoop archive(HAR) be used for small documents is merged into to large file.The HAR file is to go up by HDFS the file system that builds a stratification to carry out work, and HAR file is that the archive order by Hadoop is created, this order actual motion a Mapreudce task small documents is packaged into to the HAR file.

GIGA+ has studied the applicable cases of mass small documents under single catalogue, the good directory design scheme GIGA+ of a kind of autgmentability has been proposed, GIGA+ is by being distributed to index on server nodes different in cluster and avoiding synchronous and serializing, realize asynchronous a, final consistency, the directory design of the outmoded Index Status of tolerable.This design is that original cluster application program is without change to having supplementing of cluster file system now.

Facebook has designed the Haystack storage system for its picture-storage application.Haystack Store is by being bundled to picture a large-scale volume(100G) in, by picture id and picture, the retrieving information (offset and size) in volume is set up and is shone upon and be placed in internal memory.System made the assembly of the Haystack Cache picture that comes buffer memory to newly increase, and have Haystack Directory to process mapping and the load balancing of volume.Carry out accelerated reconstruction memory-mapped information by setting up index file (index file).

TFS is the storage system towards mass small documents of increasing income, and is widely used in the companies such as Taobao, Renren Network.TFS comprises three parts: TFS cluster, Meta service cluster, and client library.

The TFS cluster mainly comprises NameServer and a plurality of DataServer.Mode with the Haystack system management piece of Facebook is consistent, and TFS merges a large amount of small documents to become a large file, and this large file is called piece (Block), and each piece has unique ID, and piece is distributed on each DataServer.NameServer is responsible for the condition managing of DataServer, and the mapping relations of maintenance block and DataServer.The read-write of the not responsible real data of NameServer, the read-write of real data is completed by DataServer.The TFS cluster is used TFS filename storing documents, and the TFS filename is the character string after piece number, skew, file size are encoded.

The Meta service cluster comprises that a main controlled node RootServer and a plurality of service node MetaServer form.RootServer mainly manages all MetaServer, and MetaServer is for managing the mapping of user-defined file name and TFS filename.TFS is used the MySQL database that the rear end persistent storage is provided at present.

TFS can not configure the Meta service cluster, and now TFS only uses the TFS filename, does not support the user-defined file name.The TFS that remembers this configuration is TFS-noname.The cluster that note has configured the Meta service cluster is TFS-name.

TFS can completing user user-defined file name access small documents, but still there are 3 problems in TFS.

Needing to set up repeatedly network in the first, TFS reading and writing of files process connects.When client is write small documents, at first client library is accessed the NameServer in the TFS cluster, NameServer specifies one can write piece for writing this small documents, then client-access this can write piece place DataServer and carry out actual write operation and return to TFS filename of client, client visits again the Meta service cluster, by the user-defined file name with just obtained TFS filename mapping relations and write the Meta service cluster.When client is read small documents, at first client accesses the Meta service cluster, read corresponding TFS filename according to the User Defined filename, then client-access NameServer, resolve the TFS filename, obtain wherein piece number, and arrive the mapping relations of DataServer according to the piece of managing in NameServer number, obtain the DataServer that should access.Client visits again DataServer, carries out actual file read operation.TFS is when carrying out read-write operation, if there is no buffer memory MetaServer information in client library, client need to first be accessed RootServer and be obtained current active MetaServer, then carries out follow-up processing.Therefore TFS is minimum carries out three networks and has connected read-write operation one time, while accessing for the first time, needs four networks to connect.This is one of reason of TFS reading and writing of files poor efficiency.

The MySQL of the second, TFS operating weight level, as the rear end storage system of storage TFS filename and user-defined file name mapping relations, than the NoSQL database with lightweight, postpones expense also larger.

The 3rd, the NameServer of TFS has recorded the information of all, has safeguarded that piece number arrives the mapping relations of DataServer.If NameServer carries out while recovering after occurring to lose efficacy, must reconstructed block information and mapping relations.From TFS Organization Chart and read-write flow process, NameServer is the Single Point of Faliure of TFS cluster, and when NameServer lost efficacy, whole cluster read-write is all unavailable.Therefore the TFS availability still is improved space.

Summary of the invention

The problem existed for TFS, the present invention designs a kind of new distributed memory system towards small documents.System architecture is as Fig. 1, and all DataServer logically are organized into ring (S1 in figure～S8 node).System adopts consistency Hash scheme, according to specific hash algorithm, the ID of DataServer is carried out to Hash, and according to cryptographic Hash, each DataServer is distributed on the ring of whole Hash codomain.

The CV(Central Version at Yi Ge center is set in cluster) node, each DataServer periodically sends its heartbeat message to the CV node, and the CV node receives these message, for the topology information of management cluster.The cluster topology information of CV node administration comprises the DataServer list of all activities in cluster and the version number of current cluster topology information.The ID of each movable DataServer and IP address and the port that this DataServer monitors have been preserved in the DataServer list.Cluster topology information version number means with the monotonically increasing timestamp.Whenever cluster has new DataServer to add or original DataServer while exiting, the CV node regenerates a cluster topology information, and the version number of this cluster topology information is set to the current time stamp, then the cluster topology information that the CV node is new by this sends to the DataServer of all current actives, and all like this DataServer can preserve same cluster global information.

Client is in local cache cluster topology information.Client during cluster, can be accessed the CV node and obtain the cluster topology information, and be buffered in this locality for the first time, uses the cluster topology information of local cache during follow-up read-write.

When client is read and write, at first according to filename, according to consistency Hash scheme, filename is carried out to Hash, and determine the DataServer that this small documents drops on.Then contrast the version number of the cluster topology information of cluster topology information that DataServer preserves and client storage, if version number is consistent, at DataServer, carry out actual read-write operation.

DataServer has two primary clusterings, and as Fig. 2, one is the piece Management Unit, and one is the search information managing assembly.The piece Management Unit is used small documents to be merged into the scheme of bulk.System is allocated larger blocks of files in advance, and the small documents then newly write can write in bulk.At known small documents place piece number, small documents is in the situation that piece bias internal amount and these retrieving informations of small documents size just can retrieve this small documents from a DataServer.System stores the mapping relations of management document name to retrieving information with Key-Value, that is:

Key:filename→Value:(BlockId,Offset,Size)

System has also been realized a Key-Value storage that is similar to Redis and possesses the persistence function.System stores to manage retrieving information with this Key-Value.

The accompanying drawing explanation

Fig. 1 is system architecture diagram.

Fig. 2 is the structure chart of DataServer in system.

Embodiment

Step 1: design a piece Management Unit.

Small documents is left in bulk, and each bulk preassignment is good.The small documents newly write sequentially writes in bulk.The piece Management Unit provides to writing a small documents in piece and read the interface of a small documents from piece.After writing a small documents in the piece Management Unit, the piece Management Unit returns to a retrieving information, and retrieving information comprises piece number, side-play amount, size.The piece Management Unit can read out according to retrieving information a small documents from the piece Management Unit.

Step 2: design a key-value Management Unit.

Realize the storage of an internal memory key-value, and the key-value that each is deposited in writes disk simultaneously.Key is the small documents filename, and value is retrieving information.Key-value at internal memory realizes with Hash table, hash function use murmurhash algorithm.To the key-value couple of each new insertion, all order writes disk.

Step 3: design CV node.

Design CV node, for receiving the heartbeat message of DataServer, and safeguard the topology information of whole cluster.When the cluster topology information changes, the CV node sends up-to-date cluster topology information to each node.Start a monitoring service on the CV node, monitor the heartbeat message that each is connected to the DataServer of CV.And use std::vector to manage all active DataServer.For each new heartbeat message, if the sender DataServer of this message add vector, and mark cluster topology information has had renewal not in vector.If exist, this DataServer deleted from vector and add the vector end to.Check first DataServer in vector, if the last heartbeat message time gap current time of this DataServer surpasses certain threshold value, clear up this DataServer(and clear up corresponding network connection simultaneously when the network connected reference make mistakes), mark cluster topology information has had renewal.If the cluster topology information has had renewal, new cluster topology information is sent to all active DataServer.

Step 4: read-write flow process.

The flow process that system is write small documents is as follows:

1. if client is access system for the first time, client-access CV node, the topology information of request cluster, and be recorded to this locality.During connected reference, if not access system for the first time, client terminal local buffer memory the topology information of cluster.

2. client is carried out Hash to filename, and determines that according to the consistency hash algorithm which DataServer this small documents should be processed by.

3. the DataServer obtained in client-access 2, send to this DataServer by the filename of the cluster topology information of client-cache, small documents, small documents content buffering area.

4.DataServer at first judge that whether the cluster topology information of client-cache is out-of-date, whether the version number that contrasts the cluster topology information of DataServer record itself and cluster topology information in client write request message is consistent.If unanimously turn 5.If inconsistent, contrast the cluster topology information in client write request message, judge whether difference can affect this write operation, if do not affect, mark NEED_UPDATE also turns 5, otherwise tell the failure of client write operation, and new cluster topology information is sent to client, write operation finishes.

5.DataServer access search information managing assembly, check whether this small documents filename exists, if exist, tells the client file name to exist.Otherwise turn 6.

6.DataServer by the content write-in block of small documents, the retrieving information piece Management Unit obtained and filename write the search information managing assembly with the key-value form by the piece Management Unit simultaneously.And return and write success message to client, if be provided with the NEED_UPDATE mark, tell client by new cluster topology information simultaneously, write operation finishes.

The flow process that system is read small documents is as follows:

3. the DataServer obtained in client-access 2.Judge that whether cluster topology information subsidiary in the client read request is consistent with the cluster topology information version number of local DataServer record.If consistent, turn 4.If inconsistent, mark NEED_UPDATE.

4.DataServer inquire about the filename of this small documents to the search information managing assembly, check whether this small documents filename exists.If exist, read out retrieving information, turn 5.If there is no, to the client Transmit message, do not have message, if 3 be provided with the NEED_UPDATE mark, new cluster topology information is attached in there is not message in file, the notice client is upgraded the cluster topology information in buffer memory, and read operation finishes.

5.DataServer the retrieving information by obtaining in 4 reads the small documents content, and sends to client from the piece Management Unit, if mark NEED_UPDATE, new cluster topology information is attached in this message, read operation finishes.

The TFS that company of Taobao increases income, in the read-write process, at least carries out three networks and connects.In the situation that the client of TFS does not have the MetaServer information of buffer memory TFS, client library also needs to connect the RootServer of TFS again.

From our read-write flow process of system, can find out, in the situation of connected reference, client, after accessing for the first time the CV node, reads the cluster topology information, and at client-cache.When the cluster topology does not change, the follow-up read-write operation of client according to client the cluster topology information of buffer memory can directly determine the DataServer that client need to be accessed, this DataServer of client-access completes read request or write request.

When the cluster topology changes, at first the follow-up read-write requests of client according to the outmoded cluster topology information of buffer memory before client, determines the DataServer that will access.

If client successful connection, and correctly carried out reading or writing, show that DataServer judges the read-write requests that cluster change in topology now (newly-increased node or have node to exit) does not have influence on this small documents, can from request message read out subsidiary up-to-date cluster topology information after client completes this read-write requests more simultaneously, upgrade the outmoded cluster topology information buffer memory of client terminal local.Follow-up read-write requests just can visit by up-to-date cluster topology information the cluster of our design again.If connect unsuccessfully, show to want the node of access to lose efficacy and exit, client needs manyly once access the CV node, obtains current up-to-date cluster topology information, re-starts read-write.If access successfully but DataServer has judged the cluster topology change affects this time read-write requests, DataServer replys the read-write on client side failure, and up-to-date cluster topology information is attached to client, and client re-starts read-write.

Therefore designed system of the present invention has been simplified the read-write flow process, and the network reduced in each read-write process connects number of times.This improvement of the results show can effectively reduce delay.

In addition, designed system of the present invention uses the more Key-Value of lightweight to store to manage retrieving information.There is test to show, many at retrieving information, write continuously in the situation of retrieving information, Key-Value storage tends on performance the delay as lower in MySQL shows than traditional database and higher throughput.

From center node load, fault recovery speed, three angle comparative analysis TFS of system robustness and system of the present invention, prove that system of the present invention more has superiority aspect availability.

Whether the Centroid CV of system of the present invention only is responsible for monitoring DataServer still active.List of CV node maintenance, each is all the time that a DataServer and the last heartbeat thereof arrive for list.When the CV node is received the heartbeat message that new certain DataServer sends, if this DataServer be in list, do not have add list, otherwise upgrade the last heart time corresponding to this DataServer in list.Check in list simultaneously and whether the overtime DataServer that does not receive heartbeat is arranged, if having, it is removed from list.If list has DataServer newly-increased or that be eliminated, the cluster topology information version number CV safeguarded is updated to the current time stamp.And up-to-date cluster topology information is distributed to all active DataServer.Because the CV node is only to receive DataServer to send heartbeat message, safeguard a list, so the load of CV node process is very low.There is document to show, the higher node of load in cluster, the probability that inefficacy occurs is higher.Have document to show, the node that relates to more IO in cluster more easily lost efficacy simultaneously.Therefore, compare NameServer node in TFS, in system of the present invention, the load of CV node is much lower, the CV node only has network I/O and there is no disk I/O simultaneously, therefore under same running environment, the probability that in the likelihood ratio TFS that the CV node of system of the present invention occurs to lose efficacy, NameServer lost efficacy is lower.

NameServer is that Single Point of Faliure is the same in TFS, and under certain meaning, the CV node is the Single Point of Faliure of system of the present invention.Because NameServer in TFS has safeguarded all map informations to DataServer, therefore, after the NameServer node lost efficacy, need to rebuild this mapping relations, the data structure complexity of safeguarding in this process.And in system of the present invention, the CV node only safeguards that whether DataServer is still active, so the CV node failure, after process is restarted, can realize second recovery of level.Therefore in same situation about losing efficacy, system down time of the present invention is shorter, according to the availability calculations formula:

A = \frac{E [Uptime]}{E [Uptime] + E [Downtime]}

Wherein, E[Uptime] and E[Downtime] be respectively up duration (system can provide service) and down time (system can not provide service).Under same running environment, the E[Downtime of system of the present invention] less, so system availability is higher.

In TFS, each read-write operation must pass through NameServer, and therefore, as long as the NameServer of TFS lost efficacy, all requests of client all can't complete.In system of the present invention, in the situation of connected reference cluster, even CV node failure or part DataServer lost efficacy, for client part read-write requests, still may correctly complete.Therefore, the CV node is not the proper Single Point of Faliure of system of the present invention.

Claims

1. the present invention has designed new system architecture, it is characterized in that: do not have in vicissitudinous situation in the cluster topology, under the pattern of connected reference, while accessing, only need primary network to connect at every turn.This as TFS, accesses more efficient with respect to similar system.

2. the CV node load that system architecture of the present invention is used is extremely light, and it is characterized in that: CV node failure probability is low, and can realize second recovery of level.