WO2022028033A1 - Hierarchical mapping-based automatic balancing storage method for ceph storage system - Google Patents

Hierarchical mapping-based automatic balancing storage method for ceph storage system Download PDF

Info

Publication number
WO2022028033A1
WO2022028033A1 PCT/CN2021/094042 CN2021094042W WO2022028033A1 WO 2022028033 A1 WO2022028033 A1 WO 2022028033A1 CN 2021094042 W CN2021094042 W CN 2021094042W WO 2022028033 A1 WO2022028033 A1 WO 2022028033A1
Authority
WO
WIPO (PCT)
Prior art keywords
storage
osd
sub
level
migration
Prior art date
Application number
PCT/CN2021/094042
Other languages
French (fr)
Chinese (zh)
Inventor
陈宁江
卢煜
Original Assignee
广西大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 广西大学 filed Critical 广西大学
Priority to JP2023503089A priority Critical patent/JP2023536693A/en
Publication of WO2022028033A1 publication Critical patent/WO2022028033A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/067Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0614Improving the reliability of storage systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0629Configuration or reconfiguration of storage systems
    • G06F3/0631Configuration or reconfiguration of storage systems by allocating resources to storage systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0646Horizontal data movement in storage systems, i.e. moving data in between storage devices or systems
    • G06F3/0647Migration mechanisms

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A hierarchical mapping-based automatic balancing storage method for a Ceph storage system. The method comprises: adding a level attribute to all object storage devices (OSD) in a storage cluster, dividing the devices into multiple sub-storage pools according to the level, adding a level attribute to placement groups (PG) by using OSD levels as a basis, and looking up an OSD combination in OSD sub-storage pools having the same level as PGs for storage; meanwhile, adding a random factor and an impact factor to guide the process in which the PGs select the OSDs; determining a PG large migration direction according to usage information about storage pools where the PGs are located and other storage pools when the usage of a single-point OSD in the total storage pool is too high, while adjusting the migration balance according to a combination of PG level, random factor and impact factor. According to the method, OSDs with excessively high usage within the Ceph storage system can reasonably migrate internal PGs so as to ensure that the system storage is balanced, and the system stability is improved.

Description

一种基于分级映射的Ceph存储系统自动均衡存储方法A Ceph Storage System Automatic Balance Storage Method Based on Hierarchical Mapping 技术领域technical field
本发明属于分布式存储技术领域,更具体地,涉及一种基于分级映射的Ceph存储系统自动均衡存储方法。The invention belongs to the technical field of distributed storage, and more particularly, relates to a Ceph storage system automatic balancing storage method based on hierarchical mapping.
背景技术Background technique
Ceph存储系统是一种对象存储系统(Object-Based Storage System,OBSS),但是与传统的OBSS不一样的是,Ceph存储系统并没有独立的元数据服务器来记录分片对象存储的OSD(Object Storage Device,对象存储设备)位置,而是使用CRUSH(Controlled Replication Under Scalable Hashing,可控复制的可扩展副本数据选择哈希算法)算法来确定对象和对象的副本备份的存储位置。当需要再次查找数据或者修改数据的时候,数据的读写寻址过程可以在每个OSD上独立完成,不存在单节点瓶颈。此种调度方式依托于软件而不是人工,在设备更换或者新增时,软件能自发计算对象的存储位置,实现数据恢复和扩容时的均衡,此过程不需要人工介入。Ceph的原有CRUSH算法的功能是通过传入的PG(Placement Group,置放群组)进行相应的哈希Hash运算,选出一个存储主节点和多个副本节点,所以当PG不变,所选择的OSD组合也不会发生变化,完成读写的初步寻址功能,同时其中的OSD若发生变化,能自发的从其它节点进行数据的恢复。存储服务请求分割成等大的小对象,小对象所集合产生的逻辑组PG能按照预设的OSD权重平均分配到每个OSD,从而让系统和运维人员无需去理会OSD的情况。但是,OSD本身的差异性并不能完全通过权重来精准反映,权重只是一个概率性选择问题,并不是一个确定的比例;同时,假设当PG宏观均衡的分配到每一个OSD上的时候,每个OSD上的PG数据假设一致,但 是PG的差异性却没有被考虑,PG虽然为对象的逻辑集合(并非数据实体),但是数据的迁移和存储的选择单位却是以PG为最小单位,对象映射到PG是通过Hash运算取余的结果,所以,并不是每个PG上所映射的对象都是一致的,PG大小也是不一致的。同时,若是存储分配不均衡,造成单节点的使用过载,就会使整个存储系统处于不可用的状态。The Ceph storage system is an Object-Based Storage System (OBSS), but unlike traditional OBSS, the Ceph storage system does not have an independent metadata server to record the OSD (Object Storage System) of sharded object storage. Device, object storage device) location, but use CRUSH (Controlled Replication Under Scalable Hashing, Controlled Replication Under Scalable Hashing) algorithm to determine the storage location of the object and the copy backup of the object. When it is necessary to search for data again or modify data, the process of reading, writing and addressing data can be completed independently on each OSD, and there is no single-node bottleneck. This scheduling method relies on software rather than manual labor. When equipment is replaced or added, the software can automatically calculate the storage location of objects to achieve a balance between data recovery and expansion. This process does not require manual intervention. The function of Ceph's original CRUSH algorithm is to perform the corresponding hash operation through the incoming PG (Placement Group, placement group), and select a storage master node and multiple replica nodes, so when the PG remains unchanged, all The selected OSD combination will not change, and the preliminary addressing function of reading and writing is completed. At the same time, if the OSD changes, it can spontaneously recover data from other nodes. The storage service request is divided into small objects of equal size, and the logical group PG generated by the collection of small objects can be evenly distributed to each OSD according to the preset OSD weight, so that the system and operation and maintenance personnel do not need to pay attention to the OSD situation. However, the difference of the OSD itself cannot be accurately reflected by the weight. The weight is only a probabilistic selection problem, not a definite ratio. At the same time, it is assumed that when PG is macro-balancedly allocated to each OSD, each The PG data on the OSD is assumed to be the same, but the difference of PG is not considered. Although PG is a logical collection of objects (not a data entity), the selection unit of data migration and storage is PG as the smallest unit. Object mapping To PG is the result of taking the remainder through Hash operation. Therefore, not every object mapped on PG is consistent, and the size of PG is also inconsistent. At the same time, if the storage allocation is unbalanced, causing a single node to be overloaded, the entire storage system will be in an unusable state.
因为Ceph的存储选择和映射过程与传统的使用MDS(MetaData Server,元数据服务器)的存储系统并不一样,现有的基于权重的调整手段无法准确控制迁出的数量和迁出的方向,同时也无法预测此次调整是否会引发数据雪崩(一个过载的OSD的调整数据迁出后,会引发更多的OSD过载)。因此,需要一种新的Ceph自动均衡存储方法,能够根据PG使用的真实情况进行实时的数据迁移,并能在迁移的同时保证此次的迁移对于系统的单节点使用率均衡具有良性作用。Because Ceph's storage selection and mapping process is different from the traditional storage system using MDS (MetaData Server, metadata server), the existing weight-based adjustment methods cannot accurately control the number and direction of migration, and at the same time It is also impossible to predict whether this adjustment will cause a data avalanche (after the adjustment data of an overloaded OSD is migrated out, it will cause more OSDs to be overloaded). Therefore, a new Ceph automatic balancing storage method is needed, which can perform real-time data migration according to the real situation of PG usage, and can ensure that this migration has a benign effect on the balance of single-node utilization of the system while migrating.
发明内容SUMMARY OF THE INVENTION
本发明要解决的技术问题:针对现有技术的上述问题,提供一种基于分级映射的Ceph存储系统自动均衡存储方法,本发明能够实现在基于Ceph存储系统的分布式工作任务的环境中的存储自动均衡,使得高负载的单节点能自主的进行均衡调整,精准控制数据的迁出方向和迁出数量,从而保证系统的稳定。The technical problem to be solved by the present invention: in view of the above-mentioned problems of the prior art, a Ceph storage system automatic balanced storage method based on hierarchical mapping is provided, and the present invention can realize the storage in the environment of distributed work tasks based on the Ceph storage system Automatic balancing enables a single node with high load to independently adjust the balancing, and precisely controls the direction and quantity of data migration, thereby ensuring the stability of the system.
为了解决上述技术问题,本发明采用的技术方案为:In order to solve the above-mentioned technical problems, the technical scheme adopted in the present invention is:
一种基于分级映射的Ceph存储系统自动均衡存储方法,实施步骤包括:A Ceph storage system automatic balancing storage method based on hierarchical mapping, the implementation steps include:
(1)给予PG和OSD新的分级属性,以相同等级的OSD聚合逻辑把整个存储池划分为了多个子存储池,并且PG分级与OSD分级一一对应,PG只能在相同等级的OSD存储池中进行选择,根据分级的变化给予PG能自由迁移的能力,同时加入随机因子作为Ceph存储系统原有的CRUSH算法的新参数,指导新OSD组合的选择结果,给予PG迁移的更多选择;(1) Give PG and OSD new classification attributes, divide the entire storage pool into multiple sub-storage pools with the same level of OSD aggregation logic, and the PG classification corresponds to the OSD classification one-to-one, PG can only be in the same level of OSD storage pools According to the change of classification, PG can be freely migrated, and random factors are added as new parameters of the original CRUSH algorithm of the Ceph storage system to guide the selection result of the new OSD combination, giving PG more choices for migration;
(2)在进行数据插入时,获取单个OSD使用率与系统平均使用率之 间差值,与事先设置好的阈值进行比较是否超过了阈值,若是超过阈值则转步骤(3)触发均衡存储策略,若没有超过阈值则正常插入数据;(2) When inserting data, obtain the difference between the usage rate of a single OSD and the average usage rate of the system, and compare it with the pre-set threshold to see if it exceeds the threshold. If it exceeds the threshold, go to step (3) to trigger the balanced storage strategy , if the threshold is not exceeded, the data will be inserted normally;
(3)获取该OSD内的根据PG大小进行排序的队列,选取其中大小处于中位数的PG进行分析,以此PG所在OSD子存储池和相邻等级的子存储池的使用率进行大小排序,以最低使用率等级的子存储池的等级作为PG的新等级;同时基于此新等级的配置,以等级作为种子生成多个随机数产生多个随机因子,随机因子作为CRUSH算法的参数,干扰OSD组合的选择结果,产生多个不同的OSD组合以供数据存储,并根据随机因子产生的OSD组合对系统的均衡性影响生成对应的影响因子;最后根据影响因子的排序,选择影响因子最小即对系统的均衡性影响最小的等级与影响因子组合,赋予PG新的分组属性。(3) Obtain the queue sorted according to the PG size in the OSD, select the PG with the median size for analysis, and sort the size based on the usage rate of the OSD sub-storage pool where the PG is located and the sub-storage pools of adjacent levels. , the level of the sub-storage pool with the lowest usage level is used as the new level of PG; at the same time, based on the configuration of this new level, multiple random numbers are generated with the level as the seed to generate multiple random factors, and the random factors are used as the parameters of the CRUSH algorithm. The selection result of the OSD combination generates a number of different OSD combinations for data storage, and generates the corresponding impact factor according to the balance of the OSD combination generated by the random factor on the system. The combination of level and influence factor that has the least impact on the balance of the system gives PG a new grouping attribute.
在步骤(1)的初始化过程中,主要的初始化步骤包括:初始化OSD的分级属性的时候,由人工进行初始化,初始化PG的分级属性的时候,根据一致性哈希算法把PG均匀的分布在各个存储池中,因为PG的大小是不可预测的,所以在初期根据数量来进行较为平均的分布,以避免系统刚刚开始使用时就伴随大量的均衡迁移。可选地,可以使用一致性hash算法来初始化PG等级。In the initialization process of step (1), the main initialization steps include: when initializing the hierarchical attribute of the OSD, manually initialize it; when initializing the hierarchical attribute of the PG, uniformly distribute the PGs in each In the storage pool, because the size of PGs is unpredictable, a relatively even distribution is carried out according to the number in the initial stage, so as to avoid a large number of balanced migrations when the system is just started to use. Optionally, a consistent hashing algorithm can be used to initialize the PG level.
步骤(1)中随机因子是用于指导CRUSH算法的输出结果,其作用是把原始的CRUSH算法选择过程改为:In step (1), the random factor is used to guide the output result of the CRUSH algorithm, and its function is to change the original CRUSH algorithm selection process to:
R i<OSD>=CRUSH(PGID,r i) R i <OSD>=CRUSH(PGID,r i )
上式中,R i<OSD>为选择出来的第i个OSD组合,调用CRUSH算法的输入参数为PGID和r i,PGID为PG的唯一标识,r i为随机因子。根据此算法,步骤(3)可以产生多组OSD组合,从中选择最适合、对系统的均衡性影响最低的OSD组合。 In the above formula, R i <OSD> is the selected i-th OSD combination, the input parameters for calling the CRUSH algorithm are PGID and r i , PGID is the unique identifier of PG, and r i is a random factor. According to this algorithm, in step (3), multiple sets of OSD combinations can be generated, and the most suitable OSD combination with the lowest impact on the balance of the system can be selected from them.
步骤(2)中生成触发均衡策略的过程是在插入数据的时候,也就是进行CRUSH算法的过程中进行判断触发,需要引入全局的监控从而实现。The process of generating the trigger balancing strategy in step (2) is to judge and trigger when inserting data, that is, during the process of performing the CRUSH algorithm, which needs to be implemented by introducing global monitoring.
步骤(3)中,影响因子的作用是衡量PG迁移前目标子存储池的均衡存储情况和PG若是按照此新等级和新影响因子迁移后目标子存储池的均衡存储情况,对于一个子存储池的均衡存储情况,其量化的表达式为:In step (3), the role of the impact factor is to measure the balanced storage situation of the target sub-storage pool before PG migration and the balanced storage situation of the target sub-storage pool after the PG is migrated according to the new level and the new impact factor, for a sub-storage pool The balanced storage situation of , its quantitative expression is:
Figure PCTCN2021094042-appb-000001
Figure PCTCN2021094042-appb-000001
上式中,M为该子存储池的平均使用率,x j为该子存储池中各个OSD的使用率,n为子存储池中OSD的数量。 In the above formula, M is the average usage rate of the sub-storage pool, x j is the usage rate of each OSD in the sub-storage pool, and n is the number of OSDs in the sub-storage pool.
使用某个PG某次迁移前的β r值与迁移后的β j值,可得此PG在本次迁移中对系统的存储均衡值的影响因子δ为: Using the β r value of a certain PG before a certain migration and the β j value after the migration, the influence factor δ of this PG on the storage equilibrium value of the system in this migration can be obtained as:
Figure PCTCN2021094042-appb-000002
Figure PCTCN2021094042-appb-000002
其中,若有一组的使用率在PG迁移后其中有一个超过了1,则此组的影响因子则为-1,从而保证此次PG的迁移不会造成新的OSD过载或者完全的不可用。Among them, if the usage rate of one group exceeds 1 after the PG migration, the impact factor of this group is -1, so as to ensure that the new OSD will not be overloaded or completely unavailable due to the migration of the PG.
步骤(1)还包括了对系统中硬件的规划和子存储池的配置:Step (1) also includes the planning of the hardware in the system and the configuration of the sub-storage pool:
①对现有的存储设备进行归类、整理,确保新划分的子存储池的大小合理,原则上因为PG的等级分配随机性、PG的数据写入随机性和各个存储池之间比较使用的是使用率作为参考,所以各个子存储池的大小接近最好。①Categorize and organize the existing storage devices to ensure that the size of the newly divided sub-storage pools is reasonable. The usage rate is used as a reference, so the size of each sub-storage pool is close to the best.
②对各个存储池进行配置,每个存储池可以有自己的阈值、随机因子个数。②Configure each storage pool, each storage pool can have its own threshold and the number of random factors.
步骤(3)完成后,若没有适合的迁移对象,则跳转至步骤(2)。After step (3) is completed, if there is no suitable migration object, jump to step (2).
与现有技术相比,本发明具有下述优点:均衡存储的时机为实时,而非出现过载行为以后才进行的均衡操作,同时也不需要消耗额外的计算资源和人力资源来进行监控;将一个存储集群的所有OSD新增一个等级属性,把它们划分为多个级别的子存储池,同时以OSD等级为基础为PG增加一 个等级属性,PG只能在相同等级的OSD子存储池中寻找OSD组合进行存储;同时PG加入随机因子指导选择OSD过程,使其产生更多的选择组合;加入影响因子来量化一个PG属性改变后因选择结果发生变化而对系统的均衡存储造成的影响;当总存储池中单点OSD出现使用率过高的时候,选择其中的中位数大小的PG,根据该子存储池与相邻子存储池的使用率信息确定PG迁移的方向,同时根据PG等级、随机因子产生对应的影响因子组合,选取最优等级与影响因子组合进行均衡调整。本发明利用了风险转移的思想,利用分级映射的原理把整个存储系统划分为一个个存储区,在出现局部存储设备过载的时候,能把存储数据转移到低风险(低使用率)的区域,使高负载的存储节点得到缓解,存储资源得到合理利用,同时系统更加稳定。Compared with the prior art, the present invention has the following advantages: the timing of balancing storage is real-time, rather than the balancing operation performed after overload behavior occurs, and at the same time, it does not need to consume extra computing resources and human resources for monitoring; All OSDs of a storage cluster add a level attribute, divide them into sub-storage pools of multiple levels, and add a level attribute to PG based on the OSD level. PG can only be found in OSD sub-storage pools of the same level OSD combinations are stored; at the same time, PG adds random factors to guide the selection process of OSDs, so that more selection combinations are generated; an impact factor is added to quantify the impact of a PG attribute change on the balanced storage of the system due to the change of the selection result; when When the usage rate of a single point OSD in the total storage pool is too high, select the PG with the median size, determine the direction of PG migration according to the usage information of the sub-storage pool and adjacent sub-storage pools, and determine the PG migration direction according to the PG level. , Random factors generate corresponding impact factor combinations, and select the optimal level and impact factor combination for balanced adjustment. The invention utilizes the idea of risk transfer, and divides the entire storage system into storage areas by using the principle of hierarchical mapping. When the local storage device is overloaded, the storage data can be transferred to the area with low risk (low usage rate), The high-load storage nodes are relieved, the storage resources are reasonably utilized, and the system is more stable.
附图说明Description of drawings
图1为本发明实施例方法的基本流程示意图;Fig. 1 is a basic flow diagram of a method according to an embodiment of the present invention;
图2为本发明实施例随机因子产生过程示意图;2 is a schematic diagram of a random factor generation process according to an embodiment of the present invention;
图3为本发明实施例影响因子的选择过程示意图。FIG. 3 is a schematic diagram of a selection process of an impact factor according to an embodiment of the present invention.
具体实施方式detailed description
为了使本发明的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本发明进行进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本发明,并不用于限定本发明。此外,下面所描述的本发明各个实施方式中所涉及到的技术特征只要彼此之间未构成冲突就可以相互组合。In order to make the objectives, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present invention, but not to limit the present invention. In addition, the technical features involved in the various embodiments of the present invention described below can be combined with each other as long as there is no conflict with each other.
如图1所示,本实施例基于分级映射的Ceph存储系统自动均衡存储的流程包括:As shown in FIG. 1 , the process of automatically balancing storage in the Ceph storage system based on hierarchical mapping in this embodiment includes:
(1)给予PG和OSD新的分级属性,以相同等级的OSD聚合逻辑上把整个存储池划分为了多个子存储池。(1) Give PG and OSD a new hierarchical attribute, and logically divide the entire storage pool into multiple sub-storage pools with the same level of OSD aggregation.
(2)同时PG分级与OSD分级一一对应,PG只能在相同等级的OSD存储池中进行选择,根据分级的变化,给予PG能自由迁移的能力,(2) At the same time, the PG classification corresponds to the OSD classification one-to-one. The PG can only be selected in the OSD storage pool of the same level. According to the change of the classification, the PG can be freely migrated.
(3)在进行数据插入时,根据OSD使用率与系统平均使用率之间的比较结果,是否超过了预先设置的阈值,若是超过阈值,触发本发明中的PG迁移方法以达到均衡目的,若未达到阈值,则正常写入数据。(3) When performing data insertion, according to the comparison result between the OSD usage rate and the system average usage rate, whether it exceeds the preset threshold, if it exceeds the threshold, trigger the PG migration method in the present invention to achieve the purpose of balance, if If the threshold is not reached, data is written normally.
(4)若单个OSD与系统的平均使用率相比较,超过了阈值,则触发本发明的均衡存储策略,该策略会均匀化系统的存储分布,降低局部OSD的突出使用率,若使用该策略后此OSD使用率为预设值(例如100),即已经满载,则写入拒绝。若不足100,则正常写入数据。(4) If a single OSD is compared with the average utilization rate of the system and exceeds the threshold, the balanced storage strategy of the present invention will be triggered, which will uniformize the storage distribution of the system and reduce the outstanding utilization rate of local OSDs. If this strategy is used After that, the OSD usage rate is a preset value (for example, 100), that is, it is fully loaded, and the writing is rejected. If it is less than 100, data is written normally.
如图2所示,本实施例中随机因子产生的步骤包括:As shown in Figure 2, the steps of generating random factors in this embodiment include:
(1)获取目标子存储池的配置,获取最大随机因子的个数,因为当划分子存储池后每次的计算规模会变小,为了不产生相同的组合,需要在此做个判断,较低的随机因子最大个数能快速的完成随机因子的选择,保证了均衡的高效,同时在较高的时候保证能进行足够多的随机测试,以保证系统的高可用。(1) Obtain the configuration of the target sub-storage pool and obtain the maximum number of random factors, because the calculation scale of each sub-storage pool will become smaller after the sub-storage pool is divided. In order not to generate the same combination, a judgment needs to be made here. A low maximum number of random factors can quickly complete the selection of random factors, ensuring efficient equilibrium, and at a high level, it can ensure that enough random tests can be performed to ensure high availability of the system.
(2)在本实施例中,随机因子的作用是作为参数干扰CRUSH算法的选择过程,所以随机因子的产生可以以PG的等级作为种子使用C语言的自带随机数产生方法即可,使用随机因子选择的OSD组合的算法过程如下:(2) In this embodiment, the role of the random factor is to interfere with the selection process of the CRUSH algorithm as a parameter, so the random factor can be generated by using the PG level as the seed and using the C language's own random number generation method. The algorithm process of the OSD combination for factor selection is as follows:
R i<OSD>=CRUSH(PGID,r i) R i <OSD>=CRUSH(PGID,r i )
上式中,R i<OSD>为选择出来的第i个OSD组合;调用CRUSH算法的输入参数为PGID和r i,PGID为PG的唯一标识;r i为随机因子。 In the above formula, R i <OSD> is the selected i-th OSD combination; the input parameters for calling the CRUSH algorithm are PGID and r i , and PGID is the unique identifier of PG; r i is a random factor.
(3)每选择出一个OSD组合的实例后,先判断此组合是否已经被选择出来过,若该组合已经存在选择结果里面,则跳过此次选择,再次进行OSD选择,若没有该组合,则保存。(3) After each instance of an OSD combination is selected, first determine whether the combination has been selected. If the combination already exists in the selection result, skip this selection and perform OSD selection again. If there is no such combination, is saved.
(4)若组合数量已经达到此OSD子存储池的要求,则结束OSD组合的选择过程,否则调至步骤(2)继续选择。(4) If the number of combinations has reached the requirement of the OSD sub-storage pool, end the selection process of the OSD combination, otherwise transfer to step (2) to continue the selection.
如图3所示在本实例中,影响因子的计算和选择指导了PG属性的变化,影响因子的作用是衡量PG迁移前目标子存储池的均衡存储情况,其指 导步骤包括As shown in Figure 3, in this example, the calculation and selection of the impact factor guides the change of PG attributes. The role of the impact factor is to measure the balanced storage situation of the target sub-storage pool before PG migration. The guiding steps include:
(1)载入PG的新等级和随机因子对应的OSD组合。(1) Load the OSD combination corresponding to the new level of the PG and the random factor.
(2)循环迭代这些OSD组合,若是全部计算完,则退出这个流程,若是还有组合没有计算,则跳至下一步。(2) Iterate these OSD combinations in a loop. If all the calculations are completed, exit this process. If there are still combinations that have not been calculated, skip to the next step.
(3)计算当前系统的均衡参数,对于一个子存储池的均衡存储情况,其量化的表达式为:(3) Calculate the balance parameters of the current system. For the balanced storage situation of a sub-storage pool, its quantitative expression is:
Figure PCTCN2021094042-appb-000003
Figure PCTCN2021094042-appb-000003
上式中,M为该子存储池的平均使用率,xj为该子存储池中各个OSD的使用率,n为子存储池中的OSD数量。In the above formula, M is the average usage rate of the sub-storage pool, xj is the usage rate of each OSD in the sub-storage pool, and n is the number of OSDs in the sub-storage pool.
(4)按照上式计算若改PG依照此时的随即因子迁移后的系统均衡参数β j (4) Calculate the system equilibrium parameter β j according to the above formula if the PG is changed according to the random factor migration at this time
(5)使用某个PG某次迁移前的β值与迁移后的β值,可得此PG在本次迁移中对系统的存储均衡值的影响因子δ为:(5) Using the β value of a certain PG before a certain migration and the β value after the migration, the influence factor δ of this PG on the storage equilibrium value of the system in this migration can be obtained as:
Figure PCTCN2021094042-appb-000004
Figure PCTCN2021094042-appb-000004
其中,若有一组的使用率在PG迁移后其中有一个超过了1,则此组的影响因子则为-1,从而保证此次PG的迁移不会造成新的OSD过载或者完全的不可用,在本实例中,若是为-1,则直接放弃该次计算,而不是用此次结果。Among them, if the usage rate of one group exceeds 1 after the PG migration, the impact factor of this group is -1, so as to ensure that the migration of the PG will not cause the new OSD to be overloaded or completely unavailable. In this example, if it is -1, the calculation will be abandoned directly instead of using the result.
本实施例基于分级映射的Ceph存储系统自动均衡存储方法旨在解决在整个存储集群中单个OSD过载会造成整个系统的不可用问题,因为Ceph存储系统的特性,数据会看似均匀的分布在各个OSD上,所以在权重值相同的情况下OSD的差异就可以模拟局部过载的情况。本实例使用多种OSD和多种子存储池划分方法进行初始化,以最大写入量作为评价标准,通过最大写入直至系统崩溃时总数据量来判别本发明的有效性。结果表明本发明 能够有效缓解因单个OSD过载而出现的系统崩溃的情况。This embodiment of the Ceph storage system automatic balance storage method based on hierarchical mapping aims to solve the problem that the overload of a single OSD in the entire storage cluster will cause the unavailability of the entire system. Because of the characteristics of the Ceph storage system, the data will appear to be evenly distributed in each On the OSD, the difference of the OSD can simulate the local overload situation with the same weight value. In this example, multiple OSDs and multiple sub-storage pool division methods are used for initialization, and the maximum write amount is used as the evaluation standard, and the effectiveness of the present invention is judged by the maximum write amount until the total data amount when the system crashes. The results show that the present invention can effectively alleviate the situation of system collapse caused by overloading of a single OSD.
本领域的技术人员容易理解,以上所述仅为本发明的较佳实施例而已,并不用以限制本发明,凡在本发明的精神和原则之内所作的任何修改、等同替换和改进等,均应包含在本发明的保护范围之内。Those skilled in the art can easily understand that the above are only preferred embodiments of the present invention, and are not intended to limit the present invention. Any modifications, equivalent replacements and improvements made within the spirit and principles of the present invention, etc., All should be included within the protection scope of the present invention.

Claims (8)

  1. 一种基于分级映射的Ceph存储系统自动均衡存储方法,其特征在于实施步骤包括:A Ceph storage system automatic balancing storage method based on hierarchical mapping, characterized in that the implementing steps include:
    (1)给予PG和OSD新的分级属性,以相同等级的OSD聚合逻辑把整个存储池划分为了多个子存储池,并且PG分级与OSD分级一一对应,PG只能在相同等级的OSD存储池中进行选择,根据分级的变化给予PG能自由迁移的能力,同时加入随机因子作为Ceph存储系统原有的CRUSH算法的新参数,指导新OSD组合的选择结果,给予PG迁移的更多选择;(1) Give PG and OSD new classification attributes, divide the entire storage pool into multiple sub-storage pools with the same level of OSD aggregation logic, and the PG classification corresponds to the OSD classification one-to-one, PG can only be in the same level of OSD storage pools According to the change of classification, PG can be freely migrated, and random factors are added as new parameters of the original CRUSH algorithm of the Ceph storage system to guide the selection result of the new OSD combination, giving PG more choices for migration;
    (2)在进行数据插入时,获取单个OSD使用率与系统平均使用率之间差值,与事先设置好的阈值进行比较是否超过了阈值,若是超过阈值则转步骤(3)触发均衡存储策略,若没有超过阈值则正常插入数据;(2) When inserting data, obtain the difference between the usage rate of a single OSD and the average usage rate of the system, and compare it with the pre-set threshold to see if it exceeds the threshold. If it exceeds the threshold, go to step (3) to trigger the balanced storage strategy , if the threshold is not exceeded, the data will be inserted normally;
    (3)获取该OSD内的根据PG大小进行排序的队列,选取其中大小处于中位数的PG进行分析,以此PG所在OSD子存储池和相邻等级的子存储池的使用率进行大小排序,以最低使用率等级的子存储池的等级作为PG的新等级;同时基于此新等级的配置,以等级作为种子生成多个随机数产生多个随机因子,随机因子作为CRUSH算法的参数,干扰OSD组合的选择结果,产生多个不同的OSD组合以供数据存储,并根据随机因子产生的OSD组合对系统的均衡性影响生成对应的影响因子;最后根据影响因子的排序,选择影响因子最小即对系统的均衡性影响最小的等级与影响因子组合,赋予PG新的分组属性。(3) Obtain the queue sorted according to the PG size in the OSD, select the PG with the median size for analysis, and sort the size based on the usage rate of the OSD sub-storage pool where the PG is located and the sub-storage pools of adjacent levels. , the level of the sub-storage pool with the lowest usage level is used as the new level of PG; at the same time, based on the configuration of this new level, multiple random numbers are generated with the level as the seed to generate multiple random factors, and the random factors are used as the parameters of the CRUSH algorithm. The selection result of the OSD combination generates a number of different OSD combinations for data storage, and generates the corresponding impact factor according to the balance of the OSD combination generated by the random factor on the system. The combination of level and influence factor that has the least impact on the balance of the system gives PG a new grouping attribute.
  2. 根据权利要求1所述的基于分级映射的Ceph存储系统自动均衡存储方法,其特征在于,在步骤(1)中初始化OSD的分级属性的时候,由人工进行初始化。The method for automatically balancing storage in a Ceph storage system based on hierarchical mapping according to claim 1, characterized in that, when initializing the hierarchical attribute of the OSD in step (1), the initialization is performed manually.
  3. 根据权利要求1或2所述的基于分级映射的Ceph存储系统自动均衡存储方法,其特征在于,在步骤(1)中初始化PG的分级属性的时候, 根据一致性哈希算法把PG均匀的分布在各个存储池中,因为PG的大小是不可预测的,所以在初期根据数量来进行较为平均的分布,以避免系统刚刚开始使用时就伴随大量的均衡迁移。The method for automatically balancing storage in a Ceph storage system based on hierarchical mapping according to claim 1 or 2, characterized in that, when initializing the hierarchical attribute of PG in step (1), the PG is evenly distributed according to a consistent hash algorithm In each storage pool, because the size of PGs is unpredictable, a relatively even distribution is performed according to the number in the initial stage, so as to avoid a large number of balanced migrations when the system is just started to use.
  4. 根据权利要求1或2所述的基于分级映射的Ceph存储系统自动均衡存储方法,其特征在于,步骤(1)中的随机因子是用于指导CRUSH算法的输出结果,其作用是把原始的CRUSH算法选择过程改为:The method for automatically balancing storage in a Ceph storage system based on hierarchical mapping according to claim 1 or 2, wherein the random factor in step (1) is used to guide the output result of the CRUSH algorithm, and its function is to convert the original CRUSH The algorithm selection process is changed to:
    R i<OSD>=CRUSH(PGID,r i) R i <OSD>=CRUSH(PGID,r i )
    上式中,R i<OSD>为选择出来的第i个OSD组合,调用CRUSH算法的输入参数为PGID和r i,PGID为PG的唯一标识,r i为随机因子。 In the above formula, R i <OSD> is the selected i-th OSD combination, the input parameters for calling the CRUSH algorithm are PGID and r i , PGID is the unique identifier of PG, and r i is a random factor.
  5. 根据权利要求1或2所述的基于分级映射的Ceph存储系统自动均衡存储方法,其特征在于,步骤(2)中生成触发均衡策略的过程是在插入数据的时候,也就是进行CRUSH算法的过程中进行判断触发,需要引入全局的监控从而实现。The Ceph storage system automatic balancing storage method based on hierarchical mapping according to claim 1 or 2, characterized in that, the process of generating the trigger balancing strategy in step (2) is when data is inserted, that is, the process of performing the CRUSH algorithm In order to make judgments and triggers, it is necessary to introduce global monitoring to achieve this.
  6. 根据权利要求2所述的基于分级映射的Ceph存储系统自动均衡存储方法,其特征在于,在步骤(3)中,影响因子用于衡量PG迁移前目标子存储池的均衡存储情况,以及PG若是按照此新等级和新影响因子迁移后目标子存储池的均衡存储情况,具体地:The method for automatically balancing storage in a Ceph storage system based on hierarchical mapping according to claim 2, wherein in step (3), the impact factor is used to measure the balanced storage situation of the target sub-storage pool before PG migration, and if the PG is The balanced storage situation of the target sub-storage pool after migration according to the new level and the new impact factor, specifically:
    对于一个子存储池的均衡存储情况,其量化的表达式为:For the balanced storage situation of a sub-storage pool, its quantitative expression is:
    Figure PCTCN2021094042-appb-100001
    Figure PCTCN2021094042-appb-100001
    其中,M为该子存储池的平均使用率,x j为该子存储池中各个OSD的使用率,n为子存储池中OSD的数量; Among them, M is the average usage rate of the sub-storage pool, x j is the usage rate of each OSD in the sub-storage pool, and n is the number of OSDs in the sub-storage pool;
    使用某个PG某次迁移前的β r值与迁移后的β j值,可得此PG在本次迁移中对系统的存储均衡值的影响因子δ为: Using the β r value of a certain PG before a certain migration and the β j value after the migration, the influence factor δ of this PG on the storage equilibrium value of the system in this migration can be obtained as:
    Figure PCTCN2021094042-appb-100002
    Figure PCTCN2021094042-appb-100002
    其中,若有一组的使用率在PG迁移后其中有一个超过了1,则此组的影响因子则为-1,从而保证此次PG的迁移不会造成新的OSD过载或者完全不可用。Among them, if the usage rate of one group exceeds 1 after the PG migration, the impact factor of this group is -1, so as to ensure that the new OSD will not be overloaded or completely unavailable due to the migration of the PG.
  7. 根据权利要求1所述的基于分级映射的Ceph存储系统自动均衡存储方法,其特征在于,步骤(1)还包括对存储系统中硬件的规划和子存储池的配置,具体为:The Ceph storage system automatic balancing storage method based on hierarchical mapping according to claim 1, is characterized in that, step (1) also includes the planning of hardware in the storage system and the configuration of sub-storage pools, specifically:
    对现有的存储设备进行归类、整理,确保新划分的子存储池的大小合理,PG的等级分配随机性、PG的数据写入随机性和各个存储池之间比较使用的是使用率作为参考,各个子存储池的大小接近;Categorize and organize existing storage devices to ensure that the size of the newly divided sub-storage pools is reasonable, the randomness of PG level assignment, the randomness of PG data writing, and the comparison between each storage pool are based on the usage rate as a reference , the size of each sub-storage pool is close;
    对各个子存储池进行配置,每个子存储池可以有自己的阈值、随机因子个数。Configure each sub-storage pool, each sub-storage pool can have its own threshold and the number of random factors.
  8. 根据权利要求1所述的基于分级映射的Ceph存储系统自动均衡存储方法,其特征在于,步骤(3)完成后,若没有适合的迁移对象,把该PG从排序队列中剔除,跳转至步骤(2)。The method for automatically balancing storage in a Ceph storage system based on hierarchical mapping according to claim 1, characterized in that, after step (3) is completed, if there is no suitable migration object, the PG is removed from the sorting queue and jumps to step (2).
PCT/CN2021/094042 2020-08-01 2021-05-17 Hierarchical mapping-based automatic balancing storage method for ceph storage system WO2022028033A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2023503089A JP2023536693A (en) 2020-08-01 2021-05-17 Automatic Balancing Storage Method for Ceph Storage Systems Based on Hierarchical Mapping

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010763358.5 2020-08-01
CN202010763358.5A CN111880747B (en) 2020-08-01 2020-08-01 Automatic balanced storage method of Ceph storage system based on hierarchical mapping

Publications (1)

Publication Number Publication Date
WO2022028033A1 true WO2022028033A1 (en) 2022-02-10

Family

ID=73205010

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/094042 WO2022028033A1 (en) 2020-08-01 2021-05-17 Hierarchical mapping-based automatic balancing storage method for ceph storage system

Country Status (3)

Country Link
JP (1) JP2023536693A (en)
CN (1) CN111880747B (en)
WO (1) WO2022028033A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115202589A (en) * 2022-09-14 2022-10-18 浪潮电子信息产业股份有限公司 Placement group member selection method, device, equipment and readable storage medium
CN115796636A (en) * 2022-10-19 2023-03-14 江苏领悟信息技术有限公司 Double random extraction method for detection and inspection
CN116737380A (en) * 2023-06-16 2023-09-12 深圳市青葡萄科技有限公司 Balanced storage method, device and equipment for distributed memory and storage medium
CN116761177A (en) * 2023-08-21 2023-09-15 云镝智慧科技有限公司 Data acquisition method based on 5G gateway and related device thereof

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111880747B (en) * 2020-08-01 2022-11-08 广西大学 Automatic balanced storage method of Ceph storage system based on hierarchical mapping
CN112463043B (en) * 2020-11-20 2023-01-10 苏州浪潮智能科技有限公司 Storage cluster capacity expansion method, system and related device
CN112231137B (en) * 2020-12-14 2021-03-30 广东睿江云计算股份有限公司 Rebalancing method and system for distributed storage data
CN115277736A (en) * 2022-07-25 2022-11-01 中国工商银行股份有限公司 Automatic data balancing method and device for distributed block storage

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106991170A (en) * 2017-04-01 2017-07-28 广东浪潮大数据研究有限公司 A kind of method and apparatus of distributed document capacity equilibrium
CN108509157A (en) * 2018-04-13 2018-09-07 郑州云海信息技术有限公司 A kind of data balancing method and device applied to distributed file system
CN109344143A (en) * 2018-10-25 2019-02-15 电子科技大学成都学院 A kind of distributed type assemblies Data Migration optimization method based on Ceph
WO2020107829A1 (en) * 2018-11-28 2020-06-04 平安科技(深圳)有限公司 Fault processing method, apparatus, distributed storage system, and storage medium
CN111880747A (en) * 2020-08-01 2020-11-03 广西大学 Automatic balanced storage method of Ceph storage system based on hierarchical mapping

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101074454B1 (en) * 2009-08-18 2011-10-18 연세대학교 산학협력단 Adaptive equalization device and equalizing method
US9116630B2 (en) * 2013-08-30 2015-08-25 Nimble Storage, Inc. Method and system for migrating data between storage devices of a storage array
CN103645860B (en) * 2013-11-27 2017-01-25 华为技术有限公司 Memory space management method and memory management device
CN106055277A (en) * 2016-05-31 2016-10-26 重庆大学 Decentralized distributed heterogeneous storage system data distribution method
US11119654B2 (en) * 2018-07-10 2021-09-14 International Business Machines Corporation Determining an optimal storage environment for data sets and for migrating data sets
US10713155B2 (en) * 2018-07-19 2020-07-14 Micron Technology, Inc. Biased sampling methodology for wear leveling
US10795810B2 (en) * 2018-09-10 2020-10-06 Micron Technology, Inc. Wear-leveling scheme for memory subsystems

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106991170A (en) * 2017-04-01 2017-07-28 广东浪潮大数据研究有限公司 A kind of method and apparatus of distributed document capacity equilibrium
CN108509157A (en) * 2018-04-13 2018-09-07 郑州云海信息技术有限公司 A kind of data balancing method and device applied to distributed file system
CN109344143A (en) * 2018-10-25 2019-02-15 电子科技大学成都学院 A kind of distributed type assemblies Data Migration optimization method based on Ceph
WO2020107829A1 (en) * 2018-11-28 2020-06-04 平安科技(深圳)有限公司 Fault processing method, apparatus, distributed storage system, and storage medium
CN111880747A (en) * 2020-08-01 2020-11-03 广西大学 Automatic balanced storage method of Ceph storage system based on hierarchical mapping

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115202589A (en) * 2022-09-14 2022-10-18 浪潮电子信息产业股份有限公司 Placement group member selection method, device, equipment and readable storage medium
CN115202589B (en) * 2022-09-14 2023-02-24 浪潮电子信息产业股份有限公司 Placement group member selection method, device and equipment and readable storage medium
CN115796636A (en) * 2022-10-19 2023-03-14 江苏领悟信息技术有限公司 Double random extraction method for detection and inspection
CN116737380A (en) * 2023-06-16 2023-09-12 深圳市青葡萄科技有限公司 Balanced storage method, device and equipment for distributed memory and storage medium
CN116737380B (en) * 2023-06-16 2024-02-09 深圳市青葡萄科技有限公司 Balanced storage method, device and equipment for distributed memory and storage medium
CN116761177A (en) * 2023-08-21 2023-09-15 云镝智慧科技有限公司 Data acquisition method based on 5G gateway and related device thereof
CN116761177B (en) * 2023-08-21 2023-10-20 云镝智慧科技有限公司 Data acquisition method based on 5G gateway and related device thereof

Also Published As

Publication number Publication date
CN111880747B (en) 2022-11-08
CN111880747A (en) 2020-11-03
JP2023536693A (en) 2023-08-29

Similar Documents

Publication Publication Date Title
WO2022028033A1 (en) Hierarchical mapping-based automatic balancing storage method for ceph storage system
CN109196459B (en) Decentralized distributed heterogeneous storage system data distribution method
CN100476742C (en) Load balancing method based on object storage device
CN108090225B (en) Database instance running method, device and system and computer readable storage medium
CN107734052B (en) Load balancing container scheduling method facing component dependence
US10372669B2 (en) Preferentially retaining memory pages using a volatile database table attribute
CN103688248B (en) A kind of management method of storage array, device and controller
Wei-guo et al. Research on kubernetes' resource scheduling scheme
CN103383653B (en) Cloud resource management and dispatching method and system
US20200348863A1 (en) Snapshot reservations in a distributed storage system
CN107436813A (en) A kind of method and system of meta data server dynamic load leveling
CN103139302A (en) Real-time copy scheduling method considering load balancing
CN106161120A (en) The distributed meta-data management method of dynamic equalization load
CN111381928B (en) Virtual machine migration method, cloud computing management platform and storage medium
CN106610903A (en) Tiered storage system, storage controller, and method for deduplication and storage tiering
CN106528270A (en) Automatic migration method and system of virtual machine based on OpenStack cloud platform
CN104731528B (en) A kind of construction method and system of cloud computing block storage service
US20200042392A1 (en) Implementing Affinity And Anti-Affinity Constraints In A Bundled Application
CN110502323B (en) Real-time scheduling method for cloud computing tasks
CN102857560A (en) Multi-service application orientated cloud storage data distribution method
WO2022257302A1 (en) Method, apparatus and system for creating training task of ai training platform, and medium
CN106569892A (en) Resource scheduling method and device
CN108694083B (en) Data processing method and device for server
CN107391039A (en) A kind of data object storage method and device
CN109358964B (en) Server cluster resource scheduling method

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21852182

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2023503089

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21852182

Country of ref document: EP

Kind code of ref document: A1