CN103501319A - Low-delay distributed storage system for small files - Google Patents

Low-delay distributed storage system for small files Download PDF

Info

Publication number
CN103501319A
CN103501319A CN 201310429804 CN201310429804A CN103501319A CN 103501319 A CN103501319 A CN 103501319A CN 201310429804 CN201310429804 CN 201310429804 CN 201310429804 A CN201310429804 A CN 201310429804A CN 103501319 A CN103501319 A CN 103501319A
Authority
CN
Grant status
Application
Patent type
Prior art keywords
cluster
client
dataserver
topological information
write
Prior art date
Application number
CN 201310429804
Other languages
Chinese (zh)
Inventor
王鲁俊
龙翔
王雷
Original Assignee
北京航空航天大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date

Links

Abstract

The invention discloses a low-delay distributed storage system for small files. According to the low-delay distributed storage system for the small files, all DataServers of the low-delay distributed storage system are logically organized to form a ring, the consistent Hash scheme is adopted by the low-delay distributed storage system, hashing is carried out on IDs of the DataServers according to a specific Hash algorithm, the DataServers are distributed on a ring of a whole Hash value range according to Hash values, a central CV node is arranged in a cluster, cluster topological information managed by the CV node comprises all moving DataServer lists in the cluster and a version number of the current cluster topological information, and a client locally caches the cluster topological information. When the client carries out clustering for the first time, the client can have access to the CV node to obtain the cluster topological information, and locally cache the cluster topological information, and the locally-cached cluster topological information is used when follow-up read-write is carried out. When the client carries out read-write, hashing is carried out on filenames according to the filenames and the consistent Hash scheme, the DataServers where the small files are located are determined, then, comparison between the version number of the cluster topological information stored on the DataServers and the version number of the cluster topological information stored on the client is carried out, and when the version numbers are the same, the actual read-write operation can be carried out on the DataServers.

Description

一种低延迟的面向小文件的分布式存储系统 A distributed storage system for low latency for small files

技术领域 FIELD

[0001] 本发明涉及分布式存储和海量小文件存储领域,具体涉及一种低延迟的面向小文件的分布式存储系。 [0001] The present invention relates to a distributed storage and mass storage file small, and in particular relates to one kind of distributed storage systems for low latency, small files.

背景技术 Background technique

[0002] 小文件通常指文件尺寸小于HDFS默认块大小(即64MB)的文件,在目前的应用中, 照片文件、音乐文件、email文本内容、微博内容等都可以认为是小文件。 [0002] usually refers to a small file size is less than HDFS file default block size (ie 64MB) file in the current application, photo files, music files, email text content, microblogging content can be considered to be small files.

[0003] 小文件问题逐渐在学术界和工业界引起了一些关注。 [0003] small files problem gradually in academia and industry has caused some concern. 著名社交网站Facebook存储了2600亿张图片,容量超过20PB,这些文件绝大部分都小于64MB。 The famous social networking site Facebook pictures stored 260 billion, exceeding the capacity of 20PB, these files were less than 64MB. 在超级计算机领域, 例如,ORNL' s CrayXT5集群(18688个节点,每个节点12个处理器)上的应用程序会周期性地将应用状态写入文件,导致系统产生大量的小文件。 Application in the field of supercomputers, e.g., ORNL 's CrayXT5 cluster (18,688 nodes, each processor 12) on the application state periodically written to the file system leads to a large number of small files. 美国西北太平洋国家实验室2007年的一份研究报告表明,该实验室系统中有1200万个文件,其中94%的文件小于64MB,58%的文件小于64KB,在具体科研计算环境中,例如某些生物学计算中,可能产生3000万个文件, 而其平均大小只有190KB。 Pacific Northwest National Laboratory, a research report in 2007 showed that the laboratory system has 12 million files, of which less than 94% of the file 64MB, 58% of the file is less than 64KB, in specific scientific computing environment, such as a these biological calculations, may produce 30 million documents, and its average size is only 190KB. 音乐网站巨鲸网已经收录了360万MP3格式的音乐文件。 Whale Music Network website has a collection of 3.6 million music files in MP3 format. 其他文献也表明,互联网上访问的数据多为高访问频率的小文件。 Other documents also show that access data on the Internet, mostly small files high access frequency.

[0004] GFS技术领导人Sean Quinlan在GFS访谈中提到BigTable的其中一个应用场景即面向小文件。 [0004] GFS technology leader Sean Quinlan BigTable mentioned in the GFS interview in which a scenario that is for small files. 著名Hadoop应用公司Cloudera发布的关于Small File Problem的报告也指出Hadoop在处理海量小文件方面存在问题。 Small File Problem report on the application of well-known company Hadoop Cloudera Hadoop release also pointed out that there is a problem in dealing with the massive small files.

[0005] Hadoop本身提供了Hadoop archive (HAR)用来将小文件合并成大文件。 [0005] Hadoop itself provides Hadoop archive (HAR) for a file larger than the small files are merged. HAR文件是通过HDFS上构建一个层次化的文件系统来工作的,一个HAR文件是通过Hadoop的archive命令进行创建,这个命令实际运行了一个Mapreudce任务来将小文件打包成HAR文件。 HAR file is to build a hierarchical file system on HDFS to work, a HAR file is created by Hadoop's archive command, which actually runs a Mapreudce task to be packed into a small file HAR file.

[0006] GIGA+研究了单个目录下海量小文件的应用情况,提出了一种扩展性良好的目录设计方案GIGA+,GIGA+通过将索引分散到集群中不同的服务器节点上并避免同步和序列化,实现一个异步、最终一致性,可容忍陈旧索引状态的目录设计。 [0006] GIGA + studied the application of massive small files in a single directory, it presents a good scalability catalog design GIGA +, GIGA + by the index dispersed in the cluster different server nodes and to avoid synchronization and serialization, to achieve an asynchronous, eventual consistency, catalog design can tolerate the old index state. 该设计是对现有集群文件系统的补充,原有的集群应用程序无需改动。 The design complements the existing cluster file system, the original cluster applications without changes.

[0007] Facebook针对其图片存储应用设计了Haystack存储系统。 [0007] Facebook for its photo storage application designed Haystack storage system. Haystack Store通过将图片打包到一个大型的volume (100G)中,将图片id和图片在volume中的检索信息(offset和size)建立映射并放在内存中。 Haystack Store by pictures packed into a large volume (100G), the picture id and pictures retrieve information (offset and size) in the volume of a mapping and in memory. 系统建立了一个Haystack Cache的组件来缓存新增加的图片,并有Haystack Directory处理volume的映射和负载均衡。 The establishment of a system of components Haystack Cache to cache newly added pictures, and there Haystack Directory mapping processing volume and load balancing. 通过建立索引文件(index file)来加速重建内存映射信息。 Through the establishment of the index file (index file) to speed up the reconstruction of memory mapping information.

[0008] TFS是开源的面向海量小文件的存储系统,广泛应用于淘宝网、人人网等公司。 [0008] TFS is open source storage system for massive small files, widely used in Taobao, Renren and other companies. TFS 包括三个部分=TFS集群,Meta服务集群,以及客户端库。 = TFS TFS cluster consists of three parts, Meta Cluster service, and client libraries.

[0009] TFS 集群主要包括NameServer 和多个DataServer。 [0009] TFS and a cluster including a plurality NameServer DataServer. 同Facebook 的Haystack 系统管理块的方式一致,TFS将大量的小文件合并成为一个大文件,这个大文件称为块(Block),每个块具有唯一ID,块分布在各个DataServer上。 Manner consistent with the Facebook Haystack block management system, a large number of small files to the TFS merged into one large file, this file is called a large block (Block), each block has a unique ID, blocks distributed in each DataServer. NameServer负责DataServer的状态管理,并维护块与DataServer的映射关系。 NameServer DataServer responsible for state management, and maintain the mapping between the block and DataServer. NameServer不负责实际数据的读写,实际数据的读写由DataServer完成。 NameServer is not responsible for the actual data read and write, read and write the actual data is completed by the DataServer. TFS集群使用TFS文件名存放文件,TFS文件名是对块号、偏移、文件大小进行编码后的字符串。 TFS TFS using cluster storage file name of the file, the file name is TFS block number, offset, file sizes, encoded string.

[0010] Meta服务集群包括一个主控节点RootServer和多个服务节点MetaServer组成。 [0010] Meta service cluster includes a master node and a plurality of service nodes MetaServer RootServer composition. RootServer主要管理所有的MetaServer, MetaServer用于管理自定义文件名和TFS文件名的映射。 RootServer mainly manages all MetaServer, MetaServer for managing custom file names and file names TFS mapping. 目前TFS使用MySQL数据库提供后端持久化存储。 TFS is currently using MySQL database back-end persistent storage.

[0011 ] TFS可以不配置Meta服务集群,此时TFS只用TFS文件名,不支持自定义文件名。 [0011] TFS can not configure a cluster service Meta, only this time TFS TFS file name, does not support custom file names. 记这种配置的TFS为TFS-noname。 This configuration is referred to as TFS TFS-noname. 记配置了Meta服务集群的集群为TFS-name。 Remember to configure a cluster to cluster services Meta TFS-name.

[0012] TFS能够完成用户自定义文件名存取小文件,但TFS仍然存在3个问题。 [0012] TFS able to complete the user-defined file name to access small files, but there are still three problems TFS.

[0013] 第一,TFS读写文件过程中需要建立多次网络连接。 [0013] First, TFS read and write files the need to establish multiple network connections. 客户端写小文件时,首先客户端库访问TFS集群中的NameServer, NameServer指定一个可写块用于写入该小文件,然后客户端访问这个可写块所在DataServer进行实际的写入操作并返回给客户端一个TFS 文件名,客户端再访问Meta服务集群,将自定义文件名和刚得到TFS文件名映射关系写入Meta服务集群。 Client write small files, the first client library to access TFS cluster NameServer, NameServer specify a writable block is used to write the small file, and then the client can access the write block DataServer where the actual write operation and return TFS client to a file name, and then access the Meta client services cluster, the custom file names and file names just got TFS write Meta mapping between the service cluster. 客户端读小文件时,客户端首先访问Meta服务集群,根据用户自定义文件名读取对应的TFS文件名,然后客户端访问NameServer,解析TFS文件名,获得其中块号,并根据NameServer中管理的块号到DataServer的映射关系,得到应访问的DataServer。 When the client read small files, the client first visit Meta Cluster service, custom file name reads the corresponding TFS file names based on user, and then the client access NameServer, TFS resolve the file name, block number which is obtained, and management in accordance with NameServer the mapping between the block number DataServer obtain DataServer to be accessed. 客户端再访问DataServer,执行实际的文件读操作。 The client then visit DataServer, perform the actual document reads. TFS在执行读写操作时,如果客户端库中没有缓存MetaServer信息,客户端需要先访问RootServer获取当前活跃的MetaServer,然后进行后续的处理。 TFS when performing read and write operations, if the client is not in the library cache MetaServer information, the client needs to access RootServer get the current active MetaServer, then subsequent processing. 因此TFS最少进行三次网络连接来完成一次读写操作,第一次访问时, 需要四次网络连接。 Therefore TFS minimum of three network connection to complete a write operation, the first time you visit, you need four network connection. 这是TFS读写文件低效的原因之一。 This is one of the reasons TFS inefficient read and write files.

[0014] 第二,TFS使用重量级的MySQL作为存储TFS文件名和自定义文件名映射关系的后端存储系统,相比于用轻量级的NoSQL数据库,延迟开销也比较大。 [0014] Second, the use of TFS TFS heavyweight MySQL as a storage file name and file name of the custom mapping between the back-end storage systems, compared to the NoSQL database with a lightweight, cost of delay is relatively large.

[0015] 第三,TFS的NameServer记录了所有块的信息,维护了块号到DataServer的映射关系。 [0015] Third, in the TFS NameServer records all information blocks, maintains the mapping between the block numbers of DataServer. 如果NameServer发生失效后执行恢复时,必须重建块信息和映射关系。 If NameServer perform a recovery after failure occurs, it must be rebuilt block information and mappings. 由TFS架构图和读写流程可知,NameServer是TFS集群的单点故障,NameServer失效时,整个集群读写均不可用。 Seen from FIG TFS architecture and write processes, a single point of failure NameServer TFS cluster, when NameServer failure, not available to read and write the entire cluster. 因此TFS可用性仍有提高空间。 Therefore, the availability of TFS is still room for improvement.

发明内容 SUMMARY

[0016] 针对TFS存在的问题,本发明设计一种新的面向小文件的分布式存储系统。 [0016] TFS for the problems, the present invention contemplates a new distributed storage systems for small files. 系统架构如图1,所有的DataServer在逻辑上组织成环(图中的SI〜S8节点)。 The system architecture shown in Figure 1, all logically organized into DataServer (SI~S8 node in the graph) ring. 系统采用一致性哈希方案,按照特定哈希算法对DataServer的ID进行哈希,并根据哈希值将各个DataServer分布在整个哈希值域的环上。 The system uses consistent hashing scheme, the ID of hashing DataServer a specific hash algorithm, and a hash value of each hash DataServer distributed over the entire range of the ring.

[0017] 集群中设置一个中心的CV (Central Version)节点,每一个DataServer周期性地向CV节点发送其心跳消息,CV节点接收这些消息,用于管理集群的拓扑信息。 [0017] CV provided a cluster center (Central Version) node, to send its heartbeat message periodically CV DataServer each node, the receiving node CV these messages, topology information for managing the cluster. CV节点管理的集群拓扑信息包括集群中所有活动的DataServer列表以及当前集群拓扑信息的版本号。 CV management node cluster topology information, including a list of all the activities in the cluster DataServer and the current version number of cluster topology information. DataServer列表中保存了每一个活动的DataServer的ID和这个DataServer所监听的IP地址和端口。 DataServer list of saved IP address and port DataServer each event ID and the DataServer is listening. 集群拓扑信息版本号用单调递增的时间戳表示。 The version number of cluster topology information is represented by a monotonically increasing timestamps. 每当集群有新的DataServer加入或者原有的DataServer退出时,CV节点重新生成一个集群拓扑信息,并将这个集群拓扑信息的版本号设置为当前时间戳,然后CV节点将这个新的集群拓扑信息发送到所有当前活动的DataServer,这样所有的DataServer都会保存同样的集群全局信息。 Whenever there is a new cluster of DataServer join or quit when the original DataServer, CV rebuild a cluster node topology information, and the version number for the cluster topology information for the current time stamp, then this new CV node cluster topology information sent to all currently active DataServer, so that all will be saved DataServer same global cluster information.

[0018] 客户端在本地缓存集群拓扑信息。 [0018] The client information in the local cache cluster topology. 客户端第一次集群时,会访问CV节点获取集群拓扑信息,并缓存在本地,后续读写时使用本地缓存的集群拓扑信息。 When the client first cluster node can access CV To obtain the cluster topology information, and using the cache cluster topology information in the local cache locally, subsequent read and write.

[0019] 客户端进行读写时,首先根据文件名,按照一致性哈希方案,对文件名进行哈希, 并确定该小文件所落在的DataServer。 When [0019] the client read and write, the file name is first, according to the consistent hashing scheme, a hash of the file name, and determines that the small document falls DataServer. 然后对比DataServer保存的集群拓扑信息和客户端保存的集群拓扑信息的版本号,如果版本号一致,则在DataServer进行实际的读写操作。 And then compare the version number of cluster topology information stored DataServer saved cluster topology information and the client, if the version number of the same, the actual read and write operations DataServer.

[0020] DataServer有两个主要组件,如图2,一个是块管理组件,一个是检索信息管理组件。 [0020] DataServer has two main components, as shown in FIG 2, is a block management component, is a search information management component. 块管理组件使用小文件合并成大块的方案。 Component management block files into small chunks solution. 系统预先分配好较大的文件块,然后新写入的小文件会写入大块内。 The system pre-assigned blocks of large files, and then write small files will be written into the new chunk. 在已知小文件所在块号,小文件在块内偏移量和小文件大小这些检索信息的情况下,就可以从一个DataServer中检索出该小文件。 Small files reside in block number is known, the small block in the document information retrieval these small offset and file size, the smaller will be retrieved from a file in DataServer. 系统使用Key-Value 存储来管理文件名到检索信息的映射关系,即: Key-Value storage system uses to manage the mapping between the file name to retrieve information, namely:

[0021] Key:filename — Value:(Blockld, Offset, Size) [0021] Key: filename - Value: (Blockld, Offset, Size)

[0022] 系统设计并实现了一个类似于Redis并具备持久化功能的Key-Value存储。 [0022] Design and implementation of a Key-Value store Redis and similar features have persisted. 系统用这个Key-Value存储来管理检索信息。 The system uses this Key-Value store to retrieve information management.

附图说明 BRIEF DESCRIPTION

[0023] 图1是系统架构图。 [0023] FIG. 1 is a system architecture of FIG.

[0024] 图2是系统中DataServer的结构图。 [0024] FIG. 2 is a block diagram of a system in the DataServer.

具体实施方式 detailed description

[0025] 步骤1:设计一个块管理组件。 [0025] Step 1: Design of a block management component.

[0026] 将小文件存放在大块中,每个大块预分配好。 [0026] The files stored in small chunks, each chunk good pre-assigned. 新写入的小文件顺序写入大块中。 The new order of small files written in the writing of large blocks. 块管理组件提供向块中写入一个小文件和从块中读取一个小文件的接口。 Block management component provides a small file is written to and read interface block file from a small block. 向块管理`组件中写入一个小文件后,块管理组件返回一个检索信息,检索信息包括块号,偏移量,大小。 After writing a file to a small `component management block, management block assembly returns a search information, search information includes a block number, offset, size. 块管理组件可以根据检索信息从块管理组件中读取出一个小文件。 The block management component may retrieve information read from the block management component a small file.

[0027] 步骤2:设i十一个key-value管理组件。 [0027] Step 2: i that eleven key-value management component.

[0028] 实现一个内存key-value的存储,并将每一个存入的key-value同时写入磁盘。 [0028] implement a key-value storage memory, and each of the stored key-value simultaneously written to disk. key为小文件文件名,value为检索信息。 key is a small file name, value to retrieve information. 在内存的key-value用哈希表实现,哈希函数使用murmurhash算法。 In the key-value memory implemented with a hash table, the hash function used murmurhash algorithm. 对每一个新插入的key-value对,都顺序写入磁盘。 For each newly inserted key-value pairs are sequentially written to disk.

[0029] 步骤3:设计CV节点。 [0029] Step 3: Design CV node.

[0030] 设计CV节点,用于接收DataServer的心跳消息,并维护整个集群的拓扑信息。 [0030] CV design node for receiving a heartbeat message DataServer and maintains topology information of the entire cluster. 在集群拓扑信息发生变化时,CV节点向每一个节点发送最新的集群拓扑信息。 When the cluster topology information is changed, CV node sends the latest cluster topology information to each node. CV节点上启动一个监听服务,监听每一个连接到CV的DataServer的心跳消息。 Start listening service on a CV nodes, each monitor is connected to a heartbeat message DataServer CV's. 并使用std:: vector管理所有活跃的DataServer。 Using std :: vector and manage all active DataServer. 对于每一个新来的心跳消息,如果这个消息的发送者DataServer 没有在vector中,则加入vector,并标记集群拓扑信息有了更新。 For each new heartbeat message, if the sender of the message does not DataServer in vector, then added vector, and mark the cluster topology information has been updated. 如果已经存在,则将该DataServer从vector中删除并添加到vector末尾。 If it does exist, then the DataServer removed and added to the end of the vector from the vector. 检查vector中第一个DataServer, 如果这个DataServer的最后的一次心跳消息时间距离当前时间超过一定阈值,则清理这个DataServer (同时在网络连接访问出错时清理对应的网络连接),标记集群拓扑信息有了更新。 Check the first vector DataServer, DataServer if the last heartbeat message time exceeds a certain threshold distance from the current time, then this DataServer clean (cleaned simultaneously connected to a corresponding network access error when the network is connected), labeled with the cluster topology update. 如果集群拓扑信息有了更新,则将新的集群拓扑信息发送给所有活跃的DataServer。 If the cluster topology information has been updated, the new cluster topology information will be sent to all active DataServer.

[0031] 步骤4:读写流程。 [0031] Step 4: read and write processes.

[0032] 系统写小文件的流程如下: [0032] small file write workflow system is as follows:

[0033] 1.如果客户端是第一次访问系统,则客户端访问CV节点,请求集群的拓扑信息, 并记录到本地。 [0033] 1. If the client is a first access system, CV client access node, the cluster topology information request, and recorded locally. 连续访问时,如果不是第一次访问系统,则客户端本地已缓存了集群的拓扑信息。 Continuous access, if not the first time to access the system, the client has a local cache of cluster topology information.

[0034] 2.客户端对文件名进行哈希,并按照一致性哈希算法决定该小文件应由哪一个DataServer进行处理。 [0034] 2. The client file names hash and decide which DataServer a small file that should be processed in accordance with the consistent hash algorithm.

[0035] 3.客户端访问2中得到的DataServer,将客户端缓存的集群拓扑信息、小文件的文件名、小文件内容缓冲区发送到该DataServer。 [0035] 3. Client Access DataServer 2 was the file name of the client cache cluster topology information, small files, small files sent to the contents of the buffer DataServer.

[0036] 4.DataServer首先判断客户端缓存的集群拓扑信息是否过时,即对比DataServer本身记录的集群拓扑信息和客户端写请求消息中的集群拓扑信息的版本号是否一致。 [0036] 4.DataServer first determines the client buffer cluster topology information is outdated, i.e., the cluster topology information and the client itself Comparative DataServer recorded write request topology information if the version number of the cluster in the same message. 如果一致转5。 If the same turn 5. 如果不一致,则对比客户端写请求消息中的集群拓扑信息,判断差异是否会影响本次写操作,如果不影响,标记NEED_UPDATE并转5,否则告诉客户端写操作失败,并将新的集群拓扑信息发送给客户端,写操作结束。 If not, then compare the client write request cluster topology information in the message, determine whether the differences can affect the write operation, is not affected, and mark NEED_UPDATE turn 5, or tell the client write operation failed, and the new cluster topology information sent to the client write operation is complete.

[0037] 5.DataServer访问检索信息管理组件,检查该小文件文件名是否已经存在,若存在则告诉客户端文件名已存在。 [0037] 5.DataServer access to retrieve information management component, check the small file filename already exists, if there is to tell the client file name already exists. 否则转6。 Otherwise, turn 6.

[0038] 6.DataServer通过块管理组件将小文件的内容写入块中,同时将块管理组件得到的检索信息和文件名以key-value形式写入检索信息管理组件。 [0038] 6.DataServer by the block management component content of smaller write block in the file, while the block management component obtained search information and writes the file name to retrieve information management component in the form of key-value. 并向客户端返回写入成功消息,如果设置了NEED_UPDATE标记,则同时将新的集群拓扑信息告诉客户端,写操作结束。 And returned to the client successfully written the message, if you set NEED_UPDATE mark, while the new cluster topology information to tell the client write operation is complete.

[0039] 系统读小文件的流程如下: [0039] The file system read small flow is as follows:

[0040] 1.如果客户端是第一次访问系统,则客户端访问CV节点,请求集群的拓扑信息, 并记录到本地。 [0040] 1. If the client is a first access system, CV client access node, the cluster topology information request, and recorded locally. 连续访问时,如果不是第一次访问系统,则客户端本地已缓存了集群的拓扑信息。 Continuous access, if not the first time to access the system, the client has a local cache of cluster topology information.

[0041] 2.客户端对文件名进行哈希,并按照一致性哈希算法决定该小文件应由哪一个DataServer进行处理。 [0041] 2. The client file names hash and decide which DataServer a small file that should be processed in accordance with the consistent hash algorithm.

[0042] 3.客户端访问2中得到的DataServer。 [0042] 3. Client Access DataServer obtained in 2. 判断客户端读请求中附带的集群拓扑信息是否与本地DataServer记录的集群拓扑信息版本号一致。 The client determines cluster topology information read the same version number that came with the cluster topology information request of the local DataServer record. 如果一致,转4。 If consistent, turn 4. 如果不一致, 标记NEED_UPDATE。 If not, mark NEED_UPDATE.

[0043] 4.DataServer向检索信息管理组件查询该小文件的文件名,检查该小文件文件名是否存在。 [0043] 4.DataServer query file name of the small file management component to retrieve information, check the small file filename exists. 如果存在,读取出检索信息,转5。 If there is, the reading out retrieval information, transfer 5. 如果不存在,则向客户端发送文件不存在消息,若3设置了NEED_UPDATE标记,则将新的集群拓扑信息附带在文件不存在消息中,通知客户端更新缓存中的集群拓扑信息,读操作结束。 If not, then there is no message to the client to send the file, if 3 is set NEED_UPDATE mark, will be included with new topology information of the cluster in the file does not exist message notifying clients to update the cluster topology information in the cache, the read operation ends .

[0044] 5.DataServer通过4中得到的检索信息,从块管理组件中读取小文件内容,并发送给客户端,如果标记了NEED_UPDATE,则将新的集群拓扑信息附带在该消息中,读操作结束。 [0044] 5.DataServer 4 by retrieving the information obtained, reading the contents of the file from the small block management component, and sent to the client, if the flag NEED_UPDATE, then a new cluster topology information included in the message, read the end of the operation.

[0045] 淘宝公司开源的TFS在读写过程中,至少进行三次网络连接。 [0045] The TFS Taobao Open Source in the reading and writing process, at least three network connection. 在TFS的客户端没有缓存TFS的MetaServer信息的情况下,客户端库还需要再连接TFS的RootServer。 Without MetaServer cached information TFS TFS is in the client's case, the client libraries still need to connect the TFS RootServer. [0046] 从我们的系统的读写流程可以看出,连续访问的情况下,客户端在第一次访问CV节点后,读取到集群拓扑信息,并在客户端缓存。 [0046] As can be seen from the wear-leveling process of our system, in the case of continuous access, the client after the first access node CV, read the cluster topology information, and the client cache. 在集群拓扑没有发生变化时,客户端后续的读写操作根据客户端已经缓存的集群拓扑信息能够直接确定客户端需要访问的DataServer,客户端访问这个DataServer完成一次读请求或写请求。 When the cluster topology does not change, the subsequent read and write operations the client based on the client has cached cluster topology information directly determine the client needs to access DataServer, the client access DataServer complete a read or write requests.

[0047] 在集群拓扑发生变化时,客户端后续的读写请求首先根据客户端之前缓存的陈旧的集群拓扑信息,确定要访问的DataServer。 [0047] When the cluster topology changes, the client first subsequent read and write requests in accordance with the old cluster topology information previously cached client, to access DataServer determined.

[0048] 如果客户端连接成功,并正确进行了读取或者写入,则表明DataServer判断出此时集群拓扑变化(新增节点或者有节点退出)并没有影响到该小文件的读写请求,同时客户端再完成此次读写请求后能够从请求消息中读取出附带的最新的集群拓扑信息,更新客户端本地的陈旧的集群拓扑信息缓存。 [0048] If the client connection is successful, and a read or write correctly, it is judged at this time indicates that DataServer cluster topology change (the new node or nodes have exit) did not affect the small file read and write requests, At the same time the client again after the completion of the read and write requests from the request message can read the latest cluster topology information that came with the update of the old client's local cache cluster topology information. 再后续的读写请求就可以用最新的集群拓扑信息来访问我们设计的集群。 Then subsequent read and write requests can use the latest cluster topology information to access our design of clusters. 如果连接不成功,则表明想要访问的节点已经失效退出,则客户端需要多一次访问CV节点,获取当前最新的集群拓扑信息,重新进行读写。 If the connection is not successful, it means that you want to access node has failed to exit, then the client needs more than one access node CV to get the most current cluster topology information, re-read and write. 如果访问成功但DataServer判断出集群拓扑变化影响了此次读写请求,则DataServer回复客户端读写失败,并将最新的集群拓扑信息附带给客户端,客户端重新进行读写。 If successful visit but DataServer judge cluster topology changes affect the read and write requests, the read-write client DataServer reply failed, and comes with the latest cluster topology information to the client, the client re-read and write.

[0049] 因此本发明设计的系统简化了读写流程,减少了每次读写过程中的网络连接次数。 [0049] Thus, the present invention simplifies the design of a system to read and write process, reduce the number of connections per network during read and write. 实验结果证明这种改进能够有效降低延迟。 Experimental results show that this improvement can effectively reduce the delay.

[0050]另外,本发明设计的系统使用更轻量级的Key-Value存储来管理检索信息。 [0050] Further, the present invention is a system designed to use more lightweight Key-Value store to retrieve information management. 有测试表明,在检索信息比较多,连续写入检索信息的情况下,Key-Value存储往往会在性能上比传统数据库如MySQL表现出更低的延迟和更高的吞吐量。 A test showed that, in the information retrieval more, continuously written information retrieval situations, Key-Value stores often than traditional databases such as MySQL exhibits lower latency and higher throughput performance.

[0051] 从中心节点负载、故障恢复速度、系统健壮性三个角度对比分析TFS与本发明的系统,证明本发明的系统设计在可用性方面更有优势。 [0051] Compare the load from the central node, fault recovery speed, robustness of the system perspective of three TFS system of the present invention, the system of the present invention is designed to prove more advantageous in terms of usability.

[0052] 本发明的系统的中心节点CV只负责监控DataServer是否仍活跃。 [0052] central node system of the present invention DataServer CV only responsible for monitoring whether still active. CV节点维护一个列表,列表每一项都是一个DataServer及其最近一次心跳到来的时间。 CV node maintains a list, each of which is a list of the most recent heartbeat DataServer and arrival time. 当CV节点收到新的某个DataServer发送的心跳消息,如果这个DataServer是列表中没有的则加入列表,否则更新列表中这个DataServer对应的最近一次心跳时间。 When a node receives a new CV heartbeat message sent by a DataServer, if this DataServer is not on the list is added to the list, or update the list of the most recent heartbeat time this corresponding DataServer. 同时检查列表中是否有超时未收到心跳的DataServer,若有,则将其从列表中清除。 Is there a timeout while DataServer not received a heartbeat check list, if it is cleared from the list. 如果列表有新增或被清除的DataServer,则将CV维护的集群拓扑信息版本号更新为当前时间戳。 If the list has added or removed DataServer, CV will maintain cluster topology information is updated to the current version timestamp. 并把最新的集群拓扑信息分发到所有活跃的DataServer。 And to distribute the latest cluster topology information to all active DataServer. 由于CV节点仅仅是接收DataServer发送心跳消息, 维护一个列表,因此CV节点进程负载很低。 Because only the receiving node CV DataServer send heartbeat messages, maintain a list, so CV node process load is very low. 有文献表明,集群中负载越高的节点,发生失效的概率越高。 The literature shows that the higher the load of the cluster nodes, the higher the probability of failure. 同时有文献表明,集群中涉及更多IO的节点更容易发生失效。 While the literature shows that involve more IO nodes in the cluster are more prone to failure. 因此,相比TFS中NameServer节点,本发明的系统中CV节点的负载低得多,同时CV节点只有网络IO 而没有磁盘10,因此同样的运行环境下,本发明的系统的CV节点发生失效的概率比TFS中NameServer失效的概率更低。 Thus, compared to the TFS NameServer node, the load system according to the present invention is much lower in CV node while nodes CV and not only the network IO disk 10, and therefore under the same operating environment, the system of the present invention is CV node failure occurs TFS probability lower probability of failure than in NameServer.

[0053] NameServer在TFS中是单点故障一样,在某种意义下,CV节点是本发明的系统的单点故障。 [0053] NameServer TFS is in the same single point of failure, in some sense, the CV is a single point of failure nodes of the system according to the invention. 由于TFS中NameServer维护了所有块到DataServer的映射信息,因此在NameServer节点发生失效后,需要重建这种映射关系,这一过程中维护的数据结构复杂。 Since the TFS NameServer maintains mapping information to all the blocks of DataServer, so after NameServer node failure occurs, the mapping relationship needs to be rebuilt, the process complex data structures maintained. 而本发明的系统中CV节点仅仅维护DataServer是否仍活跃,因此CV节点失效,进程重启后, 可以实现秒级的恢复。 The system of the present invention, the only maintenance DataServer whether CV node is still active, so CV node fails, the process after the restart, recovery can be achieved in seconds. 因此同样失效的情况下,本发明的系统不可用时间更短,根据可用性计算公式: Thus in the case of failure of the same system of the present invention is shorter unavailable, availability is calculated according to:

Figure CN103501319AD00081

[0055] 其中,E[Uptime]和E[Downtime]分别为可用时间(系统能够提供服务)和不可用时间(系统不能提供服务)。 [0055] wherein, E [Uptime] and E [Downtime] are usable time (system capable of providing service) and an unavailable time (the system can not provide the service). 在同样的运行环境下,本发明的系统的E[D0wntime]更小,因此系统可用性更高。 Under the same operating environment, E system of the present invention [D0wntime] is smaller, and therefore higher system availability.

[0056] 在TFS中,每次读写操作必须经过NameServer,因此只要TFS的NameServer失效, 客户端的所有请求都无法完成。 [0056] In the TFS, each read and write operation must NameServer, so long as the TFS NameServer failure, all the client's request can not be completed. 在本发明的系统中,连续访问集群的情况下,即使CV节点失效或者部分DataServer失效,对于客户端一部分读写请求仍然可能正确完成。 In the system of the present invention, in the case of continuous access to the cluster, even if CV node failure or failure DataServer portion, a portion of the client may still read or write request is completed correctly. 因此,CV 节点并非本发明的系统的严格意义上的单点故障。 Thus, a single point of failure is not a strict sense node CV system according to the invention.

Claims (2)

  1. 1.本发明设计了新的系统架构,其特征在于:在集群拓扑没有变化的情况下,连续访问的模式下,每次访问时,只需要一次网络连接。 1. The present invention contemplates a new system architecture, which is characterized in that: in a case where the cluster topology does not change, the sequential access mode, each visit requires only one network connection. 这相对于类似的系统,如TFS,访问更为高效。 This is relative to similar systems, such as TFS, more efficient access.
  2. 2.本发明的系统架构使用的CV节点负载极轻,其特征在于:CV节点失效概率低,且能够实现秒级的恢复。 2. CV node system architecture of the present invention, the load is extremely light used, wherein: CV node failure probability is low, and the second stage of recovery can be achieved.
CN 201310429804 2013-09-18 2013-09-18 Low-delay distributed storage system for small files CN103501319A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 201310429804 CN103501319A (en) 2013-09-18 2013-09-18 Low-delay distributed storage system for small files

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 201310429804 CN103501319A (en) 2013-09-18 2013-09-18 Low-delay distributed storage system for small files

Publications (1)

Publication Number Publication Date
CN103501319A true true CN103501319A (en) 2014-01-08

Family

ID=49866489

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 201310429804 CN103501319A (en) 2013-09-18 2013-09-18 Low-delay distributed storage system for small files

Country Status (1)

Country Link
CN (1) CN103501319A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104780201A (en) * 2015-03-02 2015-07-15 新浪网技术(中国)有限公司 Data packet processing method and device for use in IPVS (Internet Protocol Virtual Server) cluster
CN105162891A (en) * 2015-10-14 2015-12-16 四川中科腾信科技有限公司 Data storage method based on IP network
CN105187565A (en) * 2015-10-14 2015-12-23 四川携创信息技术服务有限公司 Method for utilizing network storage data
CN106210151A (en) * 2016-09-27 2016-12-07 深圳市彬讯科技有限公司 Zedis distributed cache and server cluster monitoring method

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020147815A1 (en) * 2001-04-09 2002-10-10 Alexander Tormasov Distributed network data storage system and method
CN101741731A (en) * 2009-12-03 2010-06-16 中兴通讯股份有限公司 Content metadata storing, inquiring method and managing system in content delivery network (CDN)
CN102664914A (en) * 2012-03-22 2012-09-12 北京英孚斯迈特信息技术有限公司 IS/DFS-Image distributed file storage query system
CN103176754A (en) * 2013-04-02 2013-06-26 浪潮电子信息产业股份有限公司 Reading and storing method for massive amounts of small files

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020147815A1 (en) * 2001-04-09 2002-10-10 Alexander Tormasov Distributed network data storage system and method
CN101741731A (en) * 2009-12-03 2010-06-16 中兴通讯股份有限公司 Content metadata storing, inquiring method and managing system in content delivery network (CDN)
CN102664914A (en) * 2012-03-22 2012-09-12 北京英孚斯迈特信息技术有限公司 IS/DFS-Image distributed file storage query system
CN103176754A (en) * 2013-04-02 2013-06-26 浪潮电子信息产业股份有限公司 Reading and storing method for massive amounts of small files

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
付松龄等: "FlatLFS: 一种面向海量小文件处理优化的轻量级文件系统", 《国防科技大学学报》, vol. 35, no. 2, 30 April 2013 (2013-04-30) *
第 8 期: "海量小文件存储文件系统研究综述", 《计算机应用与软件》, vol. 29, 31 August 2012 (2012-08-31) *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104780201A (en) * 2015-03-02 2015-07-15 新浪网技术(中国)有限公司 Data packet processing method and device for use in IPVS (Internet Protocol Virtual Server) cluster
CN105162891A (en) * 2015-10-14 2015-12-16 四川中科腾信科技有限公司 Data storage method based on IP network
CN105187565A (en) * 2015-10-14 2015-12-23 四川携创信息技术服务有限公司 Method for utilizing network storage data
CN106210151A (en) * 2016-09-27 2016-12-07 深圳市彬讯科技有限公司 Zedis distributed cache and server cluster monitoring method

Similar Documents

Publication Publication Date Title
US8799413B2 (en) Distributing data for a distributed filesystem across multiple cloud storage systems
US7707165B1 (en) System and method for managing data versions in a file system
US7778972B1 (en) Dynamic object replication within a distributed storage system
US7209973B2 (en) Distributed network data storage system and method
US7716180B2 (en) Distributed storage system with web services client interface
US20100299553A1 (en) Cache data processing using cache cluster with configurable modes
US20040205152A1 (en) File replication method for distributed file systems
US20090228532A1 (en) Storage System
US20120197958A1 (en) Parallel Serialization of Request Processing
US20130073519A1 (en) Handling data extent size asymmetry during logical replication in a storage system
US20060053139A1 (en) Methods, systems, and computer program products for implementing single-node and cluster snapshots
US9436720B2 (en) Safety for volume operations
US6704730B2 (en) Hash file system and method for use in a commonality factoring system
US20100082547A1 (en) Log Structured Content Addressable Deduplicating Storage
US20140279930A1 (en) Fast crash recovery for distributed database systems
US20110231524A1 (en) Metadata management method for nas global namespace design
US20130218934A1 (en) Method for directory entries split and merge in distributed file system
US20120317155A1 (en) Hybrid Data Management System and Method for Managing Large, Varying Datasets
WO2001061563A1 (en) Hash file system and method for use in a commonality factoring system
US7653668B1 (en) Fault tolerant multi-stage data replication with relaxed coherency guarantees
CN102591970A (en) Distributed key-value query method and query engine system
US20120233522A1 (en) Method for handling large object files in an object storage system
CN101814045A (en) Data organization method for backup services
US20120233463A1 (en) Cluster Federation and Trust
US20110055494A1 (en) Method for distributed direct object access storage

Legal Events

Date Code Title Description
C06 Publication
C10 Entry into substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)