WO2016101751A1 - 一种分布式存储系统中的主从平衡方法和装置 - Google Patents

一种分布式存储系统中的主从平衡方法和装置 Download PDF

Info

Publication number
WO2016101751A1
WO2016101751A1 PCT/CN2015/095461 CN2015095461W WO2016101751A1 WO 2016101751 A1 WO2016101751 A1 WO 2016101751A1 CN 2015095461 W CN2015095461 W CN 2015095461W WO 2016101751 A1 WO2016101751 A1 WO 2016101751A1
Authority
WO
WIPO (PCT)
Prior art keywords
master
slave
machines
copy
same slice
Prior art date
Application number
PCT/CN2015/095461
Other languages
English (en)
French (fr)
Inventor
宋昭
陈宗志
王超
李明昊
陈营
Original Assignee
北京奇虎科技有限公司
奇智软件(北京)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京奇虎科技有限公司, 奇智软件(北京)有限公司 filed Critical 北京奇虎科技有限公司
Publication of WO2016101751A1 publication Critical patent/WO2016101751A1/zh

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L65/00Network arrangements, protocols or services for supporting real-time applications in data packet communication
    • H04L65/40Support for services or applications
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1097Protocols in which an application is distributed across nodes in the network for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1001Protocols in which an application is distributed across nodes in the network for accessing one among a plurality of replicated servers

Definitions

  • the present invention relates to the field of distributed storage technologies, and in particular, to a master-slave balancing method and apparatus in a distributed storage system.
  • data sharding can distribute the whole data on multiple machines.
  • the overall data is divided into 1024 shards.
  • the number of machines to be distributed is N, and the amount of data stored in each machine is 1024. /N fragmentation to meet the performance requirements of the above distributed storage system.
  • the master copy corresponding machine can be used to provide read and write services for the slice, and when the machine where the master copy is located is down, one of the slave copies can be switched to the master copy by the master-slave switch.
  • the traditional scheme will manually perform master-slave adjustment of the fragmentation when the master-slave is uneven, which undoubtedly increases the cost of manual operation and maintenance in the distributed environment; and the above-mentioned master-slave adjustment mainly includes: Migrating between multiple machines for all shards in all machines, all machines need to be suspended during this migration operation, so redundant migration operations are prone to occur if the master-slave adjustment scheme is unreasonable.
  • the migration operation may specifically include: a migration operation that is frequently performed on a certain fragment in the master-slave adjustment process, and the like, and the redundant migration operation will undoubtedly increase the service suspension time, thereby affecting the throughput of the distributed storage system, and the like. performance.
  • the present invention has been made in order to provide an overcoming of the above problems or at least partially A master-slave balancing method and apparatus in a distributed storage system that solves the above problems.
  • a master-slave balancing method in a distributed storage system including:
  • the master-slave distribution information of the same slice in the corresponding M machines is counted, wherein the same slice is a slice of the master/slave copy of the M machines. ;as well as
  • the master-slave adjustment of the copy of the same slice is performed.
  • a computer program comprising computer readable code, when said computer readable code is run on a computing device, causing said computing device to perform said distributed storage system
  • the master-slave balance method
  • a master-slave balancing apparatus in a distributed storage system including:
  • a combination generation module configured to generate a combination of M machines removed from all machines storing the shards; wherein M is equal to the number of corresponding copies of the shards;
  • the master-slave adjustment module is configured to perform master-slave adjustment on the copy of the same slice when the master-slave distribution information of the same slice meets the preset switching condition.
  • the root may calculate the master-slave distribution information of the same slice in the M machines in each combination, and in the master-slave distribution When the information meets the preset switching condition, the master-slave adjustment is performed on the copy of the same slice;
  • the combination of the embodiments of the present invention is to take out a combination of M machines from all the machines storing the pieces, M is equal to the number of copies corresponding to the pieces, so that the same pieces (for example, slice 1) in the M machines are specified. Only in a single combination, but not in other combinations, the embodiment of the present invention can ensure that the master 1 is adjusted in only one combination, thereby achieving no redundancy.
  • the master-slave adjustment which is compared with the redundant migration operation that the traditional scheme frequently performs on a certain fragment in the master-slave adjustment process, greatly reduces the part of the time when the service is suspended due to the redundant migration operation, thereby improving the distribution. Performance of the storage system such as throughput.
  • FIG. 1 is a flow chart showing the steps of a master-slave balancing method in a distributed storage system according to an embodiment of the present invention
  • FIG. 2 is a schematic diagram showing a storage structure of a slice in a distributed storage system according to an example of the present invention
  • FIG. 4 is a flow chart showing the steps of a master-slave balancing method in a distributed storage system according to an embodiment of the present invention
  • FIG. 5 is a flow chart showing the steps of a master-slave balancing method in a distributed storage system according to an embodiment of the present invention
  • FIG. 6 illustrates master-slave balance in a distributed storage system according to an embodiment of the present invention. Schematic diagram of the structure of the device
  • Figure 7 shows schematically a block diagram of a computing device for performing the method according to the invention
  • FIG. 1 a flow chart of steps of a master-slave balancing method in a distributed storage system according to an embodiment of the present invention is shown, which may include the following steps:
  • Step 101 Generate a combination of removing M machines from all machines storing the shards; wherein, M is equal to the number of corresponding copies of the shards;
  • Step 102 For each combination, count the master-slave distribution information of the same slice in the corresponding M machines, wherein the same slice is the master/slave copy of the M machine. Fragment;
  • Step 103 Perform master-slave adjustment on the copy of the same slice when the master-slave distribution information of the same slice meets the preset switching condition.
  • the embodiments of the present invention can be applied to various distributed storage systems.
  • a distributed storage system although the primary copy can be evenly distributed on all machines in the initialization phase, the main replica is located over time. Factors such as downtime can cause the distribution of the primary copy to become more and more uneven, and even the case where all the machines are the primary copy.
  • the embodiments of the present invention can be used to automatically balance the master-slave replicas allocated in the distributed storage system, thereby reducing the cost of manual operation and maintenance in the distributed environment, and reducing the number of machines that are suspended due to the master-slave balance. Improve the performance of distributed storage systems.
  • the master-slave distribution information of the same slice in the three machines can be counted. If the number of the same slice in the three machines is 100, then the number 100 needs to be counted.
  • the master-slave distribution information of the fragments that is, the master copy and the slave copy of the 100 fragments are respectively distributed on which of the three machines.
  • the master-slave distribution information of the same fragment may specifically include: a maximum number of master copies of the same slice in any one of the M machines and a corresponding first machine And the number of the minimum primary copies of the same fragment and the corresponding second machine; the master-slave distribution information of the same fragment according to the preset switching condition may specifically include: the maximum primary and secondary of the same fragment The difference between the number and the minimum number of primary copies of the same slice is greater than a threshold.
  • the number of master copies of the above 100 shards in the three machines can be sorted in descending order, and the maximum number of primary copies, intermediate primary copies, and minimum primary copies of the above 100 shards can be obtained.
  • the order that is, the order of 50, 30, and 20.
  • the value of the threshold may be 2, such that once the difference between the maximum number of primary copies of the same slice and the minimum number of primary copies of the same slice is greater than two, the Preset switching conditions, such as the maximum number of primary copies, intermediate primary copies, and minimum primary copies of the above 100 shards are 35, 33, and 32, respectively, since the difference between 35 and 32 is greater than 2, then it can be considered that the Preset switching conditions.
  • the master-slave balance of all the slices can be completed, thereby achieving the master-slave copy of all the slices on all machines.
  • the balance in the middle that is, to achieve a global balance.
  • the master-slave distribution information of the same slice in the M machines can be counted in each combination, and the master-slave distribution is When the information meets the preset switching condition, the master-slave adjustment is performed on the copy of the same slice;
  • the combination of the embodiments of the present invention is to take out a combination of M machines from all the machines storing the pieces, M is equal to the number of copies corresponding to the pieces, so that the same pieces (for example, slice 1) in the M machines are specified. Only in a single combination, but not in other combinations, the embodiment of the present invention can ensure that the master 1 is adjusted in only one combination, thereby achieving no redundancy.
  • the master-slave adjustment which is compared with the redundant migration operation that the traditional scheme frequently performs on a certain fragment in the master-slave adjustment process, greatly reduces the part of the time when the service is suspended due to the redundant migration operation, thereby improving the distribution. Performance of the storage system such as throughput.
  • FIG. 3 a schematic flowchart of a step of a master-slave balancing method in a distributed storage system according to an embodiment of the present invention is shown.
  • Step 301 Generate a combination of removing M machines from all machines storing the fragments; wherein, M is equal to the number of corresponding copies of the fragments;
  • Step 302 For each combination, count the master-slave distribution information of the same slice in the corresponding M machines, wherein the same slice is the master/slave copy of the M machine. Fragment
  • Step 303 Determine, when the master-slave distribution information of the same slice meets a preset switching condition, an optimal master-slave distribution sequence corresponding to the M machines;
  • Step 304 Perform a migration operation between the M machines for the copy of the same slice according to the optimal master-slave distribution sequence.
  • the step of performing master-slave adjustment on the copy of the same slice is embodied as: determining An optimal master-slave distribution sequence corresponding to the M machines, according to the optimal master-slave distribution sequence, performing a migration operation between the M machines for the master copy of the same slice; the optimal master The distributed sequence can ensure the master-slave adjustment of the same slice in each combination with a minimum of migration operations, thereby greatly reducing the time for service suspension due to master-slave adjustment, thereby further improving the throughput of the distributed storage system. Performance.
  • the optimal master-slave distribution sequence may be obtained according to an average of the same number of slices.
  • the number of identical fragments is 100
  • the optimal primary copy distribution sequence of the 100 fragments may be determined as: 34, 33, and 33, wherein 34 may correspond to the maximum number of primary copies.
  • the machines are responsive to ensure master-slave adjustments for the same shards in each combination with minimal migration.
  • the optimal master-slave distribution sequence may specifically include: an optimal master replica distribution sequence and an optimal slave replica distribution sequence;
  • the step of performing a migration operation between the M machines for the copy of the same slice may specifically include: performing, according to the optimal primary copy distribution sequence, a primary copy of the same slice Migration operation, and, according to the optimal from the copy distribution sequence, the migration operation from the copy for the same slice;
  • the migrating operation process for the primary copy of the same shard according to the optimal primary copy distribution sequence may include: selecting the primary copy of the same shard according to the optimal primary copy distribution sequence Machines with a large number of primary copies are migrated to machines with a small number of primary copies.
  • the number of machines with a large number of primary copies and a machine with a small number of primary copies may be determined by comparing the number of primary copies of the machine, or a machine with a large number of primary copies may be determined according to the optimal primary copy distribution sequence.
  • the number of machines having a large number of primary copies and the number of machines having a small number of primary copies are not limited.
  • the maximum number of primary copies, intermediate primary copies, and minimum primary copies are: 50, 30, and 20, respectively, then the maximum primary copy can be determined first based on the optimal primary copy distribution sequence of 34, 33, and 33 above.
  • the number corresponds to the first machine
  • the number of intermediate primary copies corresponds to the third machine
  • the number of minimum primary copies corresponds to the number of final primary copies of the second machine is 34, 33, and 33, and then the primary copy of the 100 fragments is taken from the maximum number of primary copies.
  • the number of intermediate primary copies corresponding to the third machine and the minimum number of primary copies corresponding to the second machine migration wherein the first migration from the maximum number of primary copies corresponding to the first machine to the intermediate primary copy number corresponding to the third machine
  • the termination condition of the operation is: the number of intermediate primary copies corresponds to the number of primary copies of the third machine reaches 33
  • the termination condition of the second migration operation corresponding to the second machine from the maximum number of primary copies corresponding to the first machine to the minimum number of primary copies is :
  • the minimum number of primary copies corresponds to the number of primary copies of the second machine reaching 33.
  • the migration operation from the copy for the same slice and the same slice may be utilized.
  • the reversibility between the migration operations performed by the primary copy performs the migration operation from the copy for the same slice.
  • the operation of migrating the master copy of the slice i from the first machine to the second machine and the operation of migrating the slave copy of the slice i from the second machine to the first machine may be mutually reversible operations, and the slice will be sharded
  • the operation of migrating the master copy of i from the first machine to the third machine and the operation of migrating the copy of the slice i from the third machine to the first machine may also be mutually reversible operations, and the like.
  • the migration operation from the copy for the same slice and the migration operation performed on the primary copy of the same slice may be performed in parallel or sequentially, and the embodiment of the present invention is directed to the same slice.
  • the operation of migrating the master copy of the slice i from the first machine to the second machine, and the operation of migrating the master copy of the slice i from the first machine to the third machine are merely examples, in fact, A person skilled in the art can also perform the operation of migrating the master copy of the slice i from the third machine to the second machine according to actual conditions, for example, assuming the maximum primary copy number, the intermediate primary copy number and the minimum primary copy of 100 fragments. The numbers are: 50, 40, and 10, respectively, then the operation of migrating the master copy of the slice i from the third machine to the second machine needs to be performed.
  • FIG. 4 a schematic flowchart of a step of a master-slave balancing method in a distributed storage system according to an embodiment of the present invention is shown.
  • Step 401 Generate a combination of removing M machines from all machines storing the fragments; wherein, M is equal to the number of corresponding copies of the fragments;
  • Step 402 Obtain master-slave distribution information of all fragments from metadata information maintained by each machine;
  • Step 403 Calculate, according to the master-slave distribution information of all the fragments, the master-slave distribution information of the same slice corresponding to each of the M machines, wherein the same slice is stored in the M machines. Have a shard of its master/slave copy;
  • Step 404 Perform master-slave adjustment on the copy of the same slice when the master-slave distribution information of the same slice meets the preset switching condition.
  • the embodiment will calculate the master-slave distribution information of the same slice in the corresponding M machines for each of the combinations, and the specific steps are: metadata maintained from each machine. Obtaining the master-slave distribution information of all the shards in the information; according to the master-slave distribution information of all the shards, the master-slave distribution information of the same shards corresponding to the M machines is counted; The metadata information maintains the master-slave distribution information of the latest fragments, so the master-slave distribution information of all the fragments is obtained from the metadata information maintained by each machine as the latest real-time information, so that all the fragments can be guaranteed.
  • the convenience of the master-slave distribution information improves the efficiency of acquisition, and ensures the timeliness of the distribution information of the master and slave of the fragment.
  • the method may further include: a step of each machine maintaining metadata information;
  • the step of maintaining the metadata information of each machine may specifically include:
  • Sub-step A1 when the machine changes its own state, the metadata information is updated according to the change of its own state
  • Sub-step A2 broadcasting the update of the metadata information to all machines.
  • each machine can maintain the master-slave distribution information of all the fragments through the metadata information.
  • the master-slave distribution information of the fragments may mainly include: three copies of the slice i, P 1, i , P 2 , i and P 3, i correspond to the identification information of the machine;
  • machine A originally stores the primary copy P 1, i of the slice i , and the machine A is down. After that , the machine corresponding to the main copy P 1, i can be updated to the machine B, and the machine corresponding to the P 2, i originally stored by the machine B is updated to the machine A;
  • the update of the metadata information can be broadcast to all machines, so that all machines perform corresponding synchronous updates, thereby ensuring consistency and timeliness of all machine maintenance metadata information.
  • FIG. 5 a flow chart of the steps of the master-slave balancing method in a distributed storage system according to an example of the present invention is shown, which may specifically include the following steps:
  • Step 501 Generate 20 combinations of 3 machines from 6 machines storing 1024 shards; wherein the number of corresponding copies of the shards is 3;
  • Step 502 For each of the combinations, the master-slave distribution information of the same slice in the three machines is counted, wherein the same slice is the master/slave of the M machines. Fragment of this;
  • Step 503 When the difference between the maximum number of primary copies of the same slice and the minimum number of primary copies of the same slice is greater than 1, determine an optimal primary copy distribution sequence corresponding to the three machines;
  • Step 504 Perform a migration operation on the primary replica of the same fragment according to the optimal primary replica distribution sequence
  • Step 505 Perform a migration operation from the copy of the same slice by using reversibility between the migration operation from the copy for the same slice and the migration operation for the primary copy of the same slice.
  • the master-slave distribution information of the same slice in the three machines can be counted, and if the number of the same slice in the three machines is 100, then The master-slave distribution information of the 100 shards needs to be counted, that is, the master copy of the 100 shards and the slave copy are respectively distributed on which of the three machines;
  • the number of master copies of the above 100 shards is 50, 30, and 20 in three machines, that is, the maximum number of primary copies, intermediate primary copies, and minimum primary copies of the above 100 shards are 50, respectively.
  • the optimal primary copy distribution sequence of the above 100 fragments may be obtained first according to the average of the same number of fragments: 34, 33 and 33, wherein 34 corresponds to machine A, 33 corresponds to machine B and 33 corresponds to machine C;
  • the master copy of the 100 fragments can be migrated from machine A to machine B and machine C, respectively, wherein the termination condition of the first migration operation from machine A to machine B is that the number of master copies of machine B has reached 33.
  • the termination condition of the second migration operation from the machine A to the machine C is that the number of primary copies of the machine C reaches 33; it should be noted that the foregoing first migration operation and the second migration operation may be performed in parallel to improve Efficiency of master-slave adjustment;
  • the third operation of migrating the master copy of the slice i from the machine A to the machine C and the fourth operation of migrating the copy of the slice i from the machine C to the machine A are mutually reversible operations, the same can be performed simultaneously a third operation and a fourth operation, or the fourth operation is performed immediately after the execution of the third operation is completed, or the third operation is performed immediately after the execution of the fourth operation is completed; Reversing from the copy of the same slice using the reversibility between the migration operation from the copy for the same slice and the migration operation for the primary copy of the same slice
  • the shift operation saves the operation of determining the optimal distribution sequence from the copy, and thus can improve the efficiency of master-slave adjustment.
  • FIG. 6 a schematic structural diagram of a master-slave balancing device in a distributed storage system according to an embodiment of the present invention is shown, which may specifically include the following modules:
  • the combination generation module 601 is configured to generate a combination of extracting M machines from all machines storing the fragments; wherein, M is equal to the number of corresponding copies of the fragments;
  • the distribution statistics module 602 is configured to collect, for each of the combinations, the master-slave distribution information of the same slice in the corresponding M machines, wherein the same slice is the master of the M machines. a copy of the copy/from the copy;
  • the master-slave adjustment module 603 is configured to perform master-slave adjustment on the copy of the same slice when the master-slave distribution information of the same slice meets the preset switching condition.
  • the master-slave distribution information of the same slice may specifically include: a maximum number of primary copies of the same slice in any one of the M machines, and the same The minimum number of primary copies of the slice;
  • the difference between the master-slave distribution information of the same fragment and the preset switching condition may include: the difference between the maximum number of primary replicas of the same fragment and the minimum number of primary replicas of the same fragment is greater than a threshold.
  • the master-slave adjustment module 603 may specifically include:
  • a migration submodule configured to perform a migration operation between the M machines for the copy of the same slice according to the optimal master-slave distribution sequence.
  • the optimal master-slave distribution sequence may specifically include Include: an optimal primary copy distribution sequence and an optimal secondary copy distribution sequence;
  • the migration submodule may further include:
  • a first migration unit configured to perform a migration operation for the primary copy of the same slice according to the optimal primary copy distribution sequence
  • a second migration unit configured to perform a migration operation on the slave copy of the same slice according to the optimal sequence from the replica distribution
  • the first migration unit may be specifically configured to migrate the primary copy of the same slice from a machine with a large number of primary copies to a machine with a small number of primary copies according to the optimal primary copy distribution sequence.
  • the distribution statistics module 602 may specifically include:
  • a distribution acquisition sub-module configured to obtain master-slave distribution information of all fragments from metadata information maintained by each machine
  • the statistic sub-module is configured to calculate the master-slave distribution information of the same shards in each of the M machines according to the master-slave distribution information of all the shards.
  • the device may further include:
  • a maintenance module configured to maintain metadata information through each machine
  • the maintenance module may specifically include:
  • Updating the submodule configured to update the metadata information according to a change in its state when the state of the machine changes
  • a broadcast submodule configured to broadcast an update of the metadata information to all machines.
  • the description is relatively simple, and the relevant parts can be referred to the description of the method embodiment.
  • the various component embodiments of the present invention may be implemented in hardware, or in a software module running on one or more processors, or in a combination thereof.
  • a microprocessor or digital signal processor may be used in practice to implement some or all of the master-slave balancing methods and apparatus in a distributed storage system in accordance with embodiments of the present invention. Some or all of the features.
  • the invention may also be implemented to perform the methods described herein A part or all of the device or device program (for example, a computer program and a computer program product).
  • Such a program implementing the invention may be stored on a computer readable medium or may be in the form of one or more signals. Such signals may be downloaded from an internet platform, provided on a carrier signal, or provided in any other form.
  • Figure 7 illustrates a computing device, such as a search engine server, that can implement the above described method in accordance with the present invention.
  • the computing device conventionally includes a processor 710 and a computer program product or computer readable medium in the form of a memory 730.
  • the memory 730 may be an electronic memory such as a flash memory, an EEPROM (Electrically Erasable Programmable Read Only Memory), an EPROM, a hard disk, or a ROM.
  • Memory 730 has a storage space 750 that stores program code 751 for performing any of the method steps described above.
  • storage space 750 storing program code may include various program code 751 for implementing various steps in the above methods, respectively.
  • the program code can be read from or written to one or more computer program products.
  • Such computer program products include program code carriers such as hard disks, compact disks (CDs), memory cards or floppy disks.
  • Such a computer program product is typically a portable or fixed storage unit such as that shown in FIG.
  • the storage unit may have storage segments, storage spaces, and the like that are similarly arranged to memory 730 in the computing device of FIG.
  • the program code can be compressed, for example, in an appropriate form.
  • the storage unit comprises computer readable code 751' for performing the steps of the method according to the invention, ie code that can be read by a processor such as 710, which when executed by the server causes the server to execute Each step in the described method.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Multimedia (AREA)
  • Hardware Redundancy (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明实施例提供了一种分布式存储系统中的主从平衡方法和装置,其中的方法具体包括:生成从存放分片的所有机器中取出M台机器的组合;其中,M等于分片对应副本的数量;针对每一种所述组合,统计其对应M台机器中相同分片的主从分布信息,其中,所述相同分片为所述M台机器中皆存有其主副本/从副本的分片;以及在所述相同分片的主从分布信息符合预置切换条件时,对所述相同分片的副本进行主从调整。本发明实施例降低了分布式环境中人工运维的成本,且减少了因冗迁移操作而导致服务暂停的那部分时间。

Description

一种分布式存储系统中的主从平衡方法和装置 技术领域
本发明涉及分布式存储技术领域,特别是涉及一种分布式存储系统中的主从平衡方法和装置。
背景技术
在分布式存储系统中,数据分片可以将整体数据分摊在多台机器上,假设将某整体数据划分为1024分片,待分摊的机器数量为N,则每台机器存储的数据量为1024/N分片,以此满足上述分布式存储系统的性能需求。
进一步,为了防止某台机器出现操作失误或者机器故障而导致数据丢失,还可以针对分片配置多个副本,假设分片对应副本的数量为M,则1个为主副本,M-1个为从副本,其中,主副本对应机器可用于提供对于分片的读写服务,而在主副本所在机器宕机时可通过主从切换将其中1个从副本切换为主副本。
在上述分布式存储系统提供对于分片的读写服务的过程中,会存在如下问题:某台机器上存储的副本全部为主副本,而其它机器上存储的副本全部为从副本,这样,容易出现该台机器负载高宕机而其它机器闲置的情况,无法做到多台机器的负载均衡。
为了实现多台机器的负载均衡,传统方案会在主从不均时人工进行分片的主从调整,这无疑增加了分布式环境中人工运维的成本;并且,上述主从调整主要包括:针对所有机器中所有分片在多台机器之间进行迁移操作,该迁移操作期间所有机器需要暂停服务,这样,如果主从调整方案不合理则容易出现冗余的迁移操作,所述冗余的迁移操作具体可以包括:在主从调整过程中对某个分片频繁地执行的迁移操作等等,这些冗余的迁移操作无疑会增加服务暂停的时间,从而影响分布式存储系统的吞吐量等性能。
发明内容
鉴于上述问题,提出了本发明以便提供一种克服上述问题或者至少部分 地解决上述问题的一种分布式存储系统中的主从平衡方法和装置。
依据本发明的一个方面,提供了一种分布式存储系统中的主从平衡方法,包括:
生成从存放分片的所有机器中取出M台机器的组合;其中,M等于分片对应副本的数量;
针对每一种所述组合,统计其对应M台机器中相同分片的主从分布信息,其中,所述相同分片为所述M台机器中皆存有其主副本/从副本的分片;以及
在所述相同分片的主从分布信息符合预置切换条件时,对所述相同分片的副本进行主从调整。
根据本发明的另一方面,提供了一种计算机程序,包括计算机可读代码,当所述计算机可读代码在计算设备上运行时,导致所述计算设备执行上文所述的分布式存储系统中的主从平衡方法。
根据本发明的再一方面,提供了一种计算机可读介质,其中存储了上文所述的计算机程序。
依据本发明的又一方面,提供了一种分布式存储系统中的主从平衡装置,包括:
组合生成模块,配置为生成从存放分片的所有机器中取出M台机器的组合;其中,M等于分片对应副本的数量;
分布统计模块,配置为针对每一种所述组合,统计其对应M台机器中相同分片的主从分布信息,其中,所述相同分片为所述M台机器中皆存有其主副本/从副本的分片;以及
主从调整模块,配置为在所述相同分片的主从分布信息符合预置切换条件时,对所述相同分片的副本进行主从调整。
根据本发明实施例的一种分布式存储系统中的主从平衡方法和装置,根可以在每一种组合中统计M台机器中相同分片的主从分布信息,并在所述主从分布信息符合预置切换条件时,对所述相同分片的副本进行主从调整;
首先,由于本发明实施例的主从平衡方案为自动执行,这相对于传统方案人工执行主从调整,大大降低了分布式环境中人工运维的成本;
另外,由于本发明实施例的组合为从存放分片的所有机器中取出M台机器的组合,M等于分片对应副本的数量,这样,M台机器中特定的相同分片(如分片1)仅仅会在唯一的一种组合中出现,而不会在其它组合中出现,因此,本发明实施例能够保证仅仅在一种组合中对分片1进行主从调整,从而最终实现无冗余的主从调整,这相对于传统方案在主从调整过程中对某个分片频繁地执行的冗余迁移操作,大大减少了因冗迁移操作而导致服务暂停的那部分时间,从而提高了分布式存储系统的吞吐量等性能。
上述说明仅是本发明技术方案的概述,为了能够更清楚了解本发明的技术手段,而可依照说明书的内容予以实施,并且为了让本发明的上述和其它目的、特征和优点能够更明显易懂,以下特举本发明的具体实施方式。
附图说明
通过阅读下文可选实施方式的详细描述,各种其他的优点和益处对于本领域普通技术人员将变得清楚明了。附图仅用于示出可选实施方式的目的,而并不认为是对本发明的限制。而且在整个附图中,用相同的参考符号表示相同的部件。在附图中:
图1示出了示出了根据本发明一个实施例的一种分布式存储系统中的主从平衡方法的步骤流程示意图;
图2示出了根据本发明一个示例的一种分布式存储系统中分片的存储结构示意图;
图3示出了示出了根据本发明一个实施例的一种分布式存储系统中的主从平衡方法的步骤流程示意图;
图4示出了示出了根据本发明一个实施例的一种分布式存储系统中的主从平衡方法的步骤流程示意图;
图5示出了示出了根据本发明一个实施例的一种分布式存储系统中的主从平衡方法的步骤流程示意图;
图6示出了根据本发明一个实施例的一种分布式存储系统中的主从平衡 装置的结构示意图;
图7示意性地示出了用于执行根据本发明的方法的计算设备的框图;以及
图8示意性地示出了用于保持或者携带实现根据本发明的方法的程序代码的存储单元。
具体实施方式
下面将参照附图更详细地描述本公开的示例性实施例。虽然附图中显示了本公开的示例性实施例,然而应当理解,可以以各种形式实现本公开而不应被这里阐述的实施例所限制。相反,提供这些实施例是为了能够更透彻地理解本公开,并且能够将本公开的范围完整的传达给本领域的技术人员。
参照图1,示出了根据本发明一个实施例的一种分布式存储系统中的主从平衡方法的步骤流程示意图,具体可以包括如下步骤:
步骤101、生成从存放分片的所有机器中取出M台机器的组合;其中,M等于分片对应副本的数量;
步骤102、针对每一种所述组合,统计其对应M台机器中相同分片的主从分布信息,其中,所述相同分片为所述M台机器中皆存有其主副本/从副本的分片;以及
步骤103、在所述相同分片的主从分布信息符合预置切换条件时,对所述相同分片的副本进行主从调整。
本发明实施例可以应用于各种分布式存储系统,在分布式存储系统中,虽然可以在初始化阶段通过设置将主副本均匀分布在所有机器上,然而,随着时间的推移,主副本所在机器宕机等因素会导致主副本的分布越来越不均匀,甚至出现某台机器中全部为主副本的情形。而本发明实施例能够用于对分布式存储系统中分配的主从副本进行自动平衡,以降低分布式环境中人工运维的成本,以及减少因主从平衡所导致服务暂停的机器的数量,提高分布式存储系统的性能。
参照图2,示出了根据本发明一个示例的一种分布式存储系统中分片的 存储结构示意图,该示例涉及采用6台机器存储1024个分片,其中,每个分片具有3个副本,也即1个主副本和2个从副本;假设6台机器分别记为机器A、机器B、机器C、机器D、机器E和机器F,分片i的3个副本分别记为P1,i、P2,i和P3,i,其中,1≤i≤1024,那么每台机器所存储分片的数量为1024*3/6=512;
对于上述示例,从存放分片的所有机器中取出M台机器的组合也即从6台机器中取出3台机器的组合,记为C(6,3)=20。
对于上述示例,每个分片均有3个副本分布在不同的机器上,这样,从全局来看任意3台机器中都会有相同分片,因此,针对每一种组合进行主从调整能够将任意3台机器中相同分片的主从副本进行调整,达到主从副本在所有机器中的平衡,也即达到全局的平衡。其中,所述相同分片为所述M台机器中皆存有其主副本/从副本的分片。例如,分片1的3个副本P1,1、P2,1和P3,1分别存储在机器A、机器B和机器C上,则机器A、机器B和机器C对应组合中存在相同分片1;又如,分片501的3个副本P1,501、P2,501和P3,501分别存储在机器C、机器D和机器E上,则机器C、机器D和机器E对应组合中存在相同分片501。
以机器A、机器B和机器C对应组合为例,可以统计该3台机器中相同分片的主从分布信息,假设该3台机器中相同分片的数量为100,那么,需要统计该100个分片的主从分布信息,也即,该100个分片的主副本和从副本分别分布在该3台机器中的哪台机器上。
本发明实施例中,所述预置切换条件可用于表示该3台机器中相同分片这个局部范围内主从分布不均导致切换的各种条件,例如,上述100个分片的主副本在3台机器中的数量分别为50、30和20,显然,上述100个分片的主从分布是不均的,此种情况下可以认为满足了所述预置切换条件。
在本发明的一种可选实施例中,所述相同分片的主从分布信息具体可以包括:所述M台机器中任一台中所述相同分片的最大主副本数及对应第一机器,以及所述相同分片的最小主副本数及对应第二机器;则所述相同分片的主从分布信息符合预置切换条件具体可以包括:所述相同分片的最大主副 本数与所述相同分片的最小主副本数的差值大于阈值。
例如,可以对上述100个分片的主副本在3台机器中的数量按照从大到小的顺序进行排序,得到上述100个分片的最大主副本数、中间主副本数和最小主副本数的顺序,也即50、30和20的顺序。例如,所述阈值的值可以为2,这样,一旦发生所述相同分片的最大主副本数与所述相同分片的最小主副本数的差值大于2的情况,就可以满足了所述预置切换条件,如上述100个分片的最大主副本数、中间主副本数和最小主副本数分别为35、33和32,由于35与32的差值大于2,则可以认为满足了所述预置切换条件。当然,上述1只是作为一种示例,本领域技术人员可以根据实际情况确定所述阈值的值,例如,在对所述相同分片的副本进行主从调整的频率不作限制时,可以采用较小的阈值,如1或2;又如,考虑到对所述相同分片的副本进行主从调整会引起服务暂停,可以希望对所述相同分片的副本进行主从调整的频率不太频繁,因此,可以采用较大的阈值,如3或4或5等等。
采用上述步骤101-步骤103的流程,遍历所有组合,并对所有组合中相同分片的副本进行调整,即可完成所有分片的主从平衡,从而达到所有分片的主从副本在所有机器中的平衡,也即达到全局的平衡。
综上,根据本发明实施例的一种分布式存储系统中的主从平衡方案,可以在每一种组合中统计M台机器中相同分片的主从分布信息,并在所述主从分布信息符合预置切换条件时,对所述相同分片的副本进行主从调整;
首先,由于本发明实施例的主从平衡方案为自动执行,这相对于传统方案人工执行主从调整,大大降低了分布式环境中人工运维的成本;
另外,由于本发明实施例的组合为从存放分片的所有机器中取出M台机器的组合,M等于分片对应副本的数量,这样,M台机器中特定的相同分片(如分片1)仅仅会在唯一的一种组合中出现,而不会在其它组合中出现,因此,本发明实施例能够保证仅仅在一种组合中对分片1进行主从调整,从而最终实现无冗余的主从调整,这相对于传统方案在主从调整过程中对某个分片频繁地执行的冗余迁移操作,大大减少了因冗迁移操作而导致服务暂停的那部分时间,从而提高了分布式存储系统的吞吐量等性能。
参照图3,示出了根据本发明一个实施例的一种分布式存储系统中的主从平衡方法的步骤流程示意图,具体可以包括如下步骤:
步骤301、生成从存放分片的所有机器中取出M台机器的组合;其中,M等于分片对应副本的数量;
步骤302、针对每一种所述组合,统计其对应M台机器中相同分片的主从分布信息,其中,所述相同分片为所述M台机器中皆存有其主副本/从副本的分片;
步骤303、在所述相同分片的主从分布信息符合预置切换条件时,确定所述M台机器对应的最优主从分布序列;以及
步骤304、依据所述最优主从分布序列,针对所述相同分片的副本在所述M台机器之间进行迁移操作。
相对于图1所示实施例,本实施例将在所述相同分片的主从分布信息符合预置切换条件时,对所述相同分片的副本进行主从调整的步骤具体化为:确定所述M台机器对应的最优主从分布序列,依据所述最优主从分布序列,针对所述相同分片的主副本在所述M台机器之间进行迁移操作;所述最优主从分布序列能够保证以最少的迁移操作实现每一种组合中相同分片的主从调整,因此能够大大减少了因主从调整导致服务暂停的时间,从而进一步提高了分布式存储系统的吞吐量等性能。
在本发明的一种应用示例中,可以依据所述相同分片数量的平均数得到所述最优主从分布序列。例如,上述示例中相同分片的数量为100,那么,可以将该100个分片的最优主副本分布序列确定为:34、33和33,其中,34可以与最大主副本数对应第一机器相应,以保证使用最少的迁移操作实现每一种组合中相同分片的主从调整。
在本发明的一种可选实施例中,所述最优主从分布序列具体可以包括:最优主副本分布序列和最优从副本分布序列;则所述依据所述最优主从分布序列,针对所述相同分片的副本在所述M台机器之间进行迁移操作的步骤,具体可以包括:依据所述最优主副本分布序列,针对相同分片的主副本进行 的迁移操作,以及,依据所述最优从副本分布序列,针对相同分片的从副本进行迁移操作;
其中,所述依据所述最优主副本分布序列,针对相同分片的主副本进行的迁移操作过程具体可以包括:依据所述最优主副本分布序列,将所述相同分片的主副本从主副本数多的机器迁移到主副本数少的机器。
在具体实现中,可以通过机器问主副本数的比较确定主副本数多的机器和主副本数少的机器,或者,可以依据所述最优主副本分布序列确定主副本数多的机器和主副本数少的机器,本发明实施例对主副本数多的机器和主副本数少的机器的确定方法不加以限制。
上例中,最大主副本数、中间主副本数和最小主副本数分别为:50、30和20,那么,可以首先依据上述34、33和33的最优主副本分布序列,确定最大主副本数对应第一机器、中间主副本数对应第三机器和最小主副本数对应第二机器的最终主副本数为34、33和33,然后将该100个分片的主副本从最大主副本数对应第一机器分别向中间主副本数对应第三机器和最小主副本数对应第二机器迁移,其中,从最大主副本数对应第一机器向中间主副本数对应第三机器进行的第一迁移操作的终止条件是:中间主副本数对应第三机器的主副本数达到了33,从最大主副本数对应第一机器向最小主副本数对应第二机器进行的第二迁移操作的终止条件是:最小主副本数对应第二机器的主副本数达到了33。需要说明的是,上述第一迁移操作和第二迁移操作可以并行执行,以提高主从调整的效率。
需要说明的是,除了依据所述最优从副本分布序列,针对相同分片的从副本进行迁移操作外,还可以利用所述针对相同分片的从副本进行的迁移操作与针对相同分片的主副本进行的迁移操作之间的可逆性进行所述针对相同分片的从副本进行的迁移操作。例如,将分片i的主副本从第一机器迁移到第二机器的操作与将分片i的从副本从第二机器迁移到第一机器的操作可以互为可逆操作,以及,将分片i的主副本从第一机器迁移到第三机器的操作与将分片i的从副本从第三机器迁移到第一机器的操作也可以互为可逆操作,等等。
在具体实现中,所述针对相同分片的从副本进行的迁移操作与针对相同分片的主副本进行的迁移操作可以并行执行,也可以按照次序先后执行,本发明实施例对针对相同分片的从副本进行的迁移操作与针对相同分片的主副本进行的迁移操作之间的执行顺序不加以限制。
需要说明的是,将分片i的主副本从第一机器迁移到第二机器的操作,和将分片i的主副本从第一机器迁移到第三机器的操作只是作为示例,实际上,本领域技术人员还可以根据实际情况执行将分片i的主副本从第三机器迁移到第二机器的操作,例如,假设100个分片的最大主副本数、中间主副本数和最小主副本数分别为:50、40和10,那么,需要执行将分片i的主副本从第三机器迁移到第二机器的操作。
参照图4,示出了根据本发明一个实施例的一种分布式存储系统中的主从平衡方法的步骤流程示意图,具体可以包括如下步骤:
步骤401、生成从存放分片的所有机器中取出M台机器的组合;其中,M等于分片对应副本的数量;
步骤402、从各机器所维护的元数据信息中获取所有分片的主从分布信息;
步骤403、依据所有分片的主从分布信息,统计每一种所述组合对应M台机器中相同分片的主从分布信息,其中,所述相同分片为所述M台机器中皆存有其主副本/从副本的分片;
这里,可以采用各种统计方法统计得到每一种所述组合对应M台机器中相同分片的主从分布信息,本发明实施例对具体的统计方法不加以限制。
步骤404、在所述相同分片的主从分布信息符合预置切换条件时,对所述相同分片的副本进行主从调整。
相对于图1所示实施例,本实施例将针对每一种所述组合,统计其对应M台机器中相同分片的主从分布信息的步骤具体化为:从各机器所维护的元数据信息中获取所有分片的主从分布信息;依据所有分片的主从分布信息,统计所述组合对应M台机器中相同分片的主从分布信息;由于各机器通过 元数据信息维护最新的分片的主从分布信息,故从各机器所维护的元数据信息中获取所有分片的主从分布信息为实时最新的信息,这样,既能够保证获取所有分片的主从分布信息的便利性,提高获取效率,又能够保证分片的主从分布信息的及时性。
在本发明的一种可选实施例中,所述方法还可以包括:各机器维护元数据信息的步骤;
其中,所述各机器维护元数据信息的步骤具体可以包括:
子步骤A1、机器在自身状态发生变化时,依据自身状态的变化更新所述元数据信息;
子步骤A2、向所有机器广播所述元数据信息的更新.
在实际应用中,每台机器可以通过元数据信息维护所有分片的主从分布信息,所述分片的主从分布信息主要可以包括:分片i的3个副本P1,i、P2,i和P3,i对应机器的标识信息;
这样,当一台机器因宕机等原因发生变化后,可以首先更新自身所维护的元数据信息,例如,机器A原本存储有分片i的主副本P1,i,而在机器A宕机后可以将主副本P1,i对应的机器更新为机器B,以及将机器B原本存储的P2,i对应的机器更新为机器A;
并且,在更新完成后,可以向所有机器广播所述元数据信息的更新,使得所有机器进行相应的同步更新,从而能够保证所有机器维护元数据信息的一致性和及时性。
为使本领域技术人员更好地理解本发明,参照图5,示出了根据本发明一个示例的一种分布式存储系统中的主从平衡方法的步骤流程示意图,具体可以包括如下步骤:
步骤501、生成从存放1024个分片的6台机器中取出3台机器的20种组合;其中,所述分片对应副本的数量为3;
步骤502、针对每一种所述组合,统计其对应3台机器中相同分片的主从分布信息,其中,所述相同分片为所述M台机器中皆存有其主副本/从副 本的分片;
步骤503、所述相同分片的最大主副本数与所述相同分片的最小主副本数的差值大于1时,确定所述3台机器对应的最优主副本分布序列;
步骤504、依据所述最优主副本分布序列,针对所述相同分片的主副本进行迁移操作;以及
步骤505、利用所述针对相同分片的从副本进行的迁移操作与针对相同分片的主副本进行的迁移操作之间的可逆性,针对所述相同分片的从副本进行迁移操作。
以图2中的机器A、机器B和机器C对应组合为例,可以统计该3台机器中相同分片的主从分布信息,假设该3台机器中相同分片的数量为100,那么,需要统计该100个分片的主从分布信息,也即,该100个分片的主副本和从副本分别分布在该3台机器中的哪台机器上;
进一步假设上述100个分片的主副本在3台机器中的数量分别为50、30和20,也即上述100个分片的最大主副本数、中间主副本数和最小主副本数分别为50、30和20,由于50与20的差值大于1,故可以首先依据所述相同分片数量的平均数得到上述100个分片的最优主副本分布序列:34、33和33,其中,34对应机器A、33对应机器B及33对应机器C;
然后,可以将该100个分片的主副本从机器A分别向机器B和机器C迁移,其中,从机器A向机器B进行的第一迁移操作的终止条件是机器B的主副本数达到了33,从机器A向机器C进行的第二迁移操作的终止条件是:机器C的主副本数达到了33;需要说明的是,上述第一迁移操作和第二迁移操作可以并行执行,以提高主从调整的效率;
由于将分片i的主副本从机器A迁移到机器C的第三操作与将分片i的从副本从机器C迁移到机器A的第四操作互为可逆操作,故可以同时执行所述第三操作和第四操作,或者,在所述第三操作执行完成后紧接着执行所述第四操作,或者,在所述第四操作执行完成后紧接着执行所述第三操作;可以看出,利用所述针对相同分片的从副本进行的迁移操作与针对相同分片的主副本进行的迁移操作之间的可逆性,针对所述相同分片的从副本进行迁 移操作,节省了确定最优从副本分布序列的操作,因此能够提高主从调整的效率。
对于方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本发明实施例并不受所描述的动作顺序的限制,因为依据本发明实施例,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于可选实施例,所涉及的动作并不一定是本发明实施例所必须的。
参照图6,示出了根据本发明一个实施例的一种分布式存储系统中的主从平衡装置的结构示意图,具体可以包括如下模块:
组合生成模块601,配置为生成从存放分片的所有机器中取出M台机器的组合;其中,M等于分片对应副本的数量;
分布统计模块602,配置为针对每一种所述组合,统计其对应M台机器中相同分片的主从分布信息,其中,所述相同分片为所述M台机器中皆存有其主副本/从副本的分片;以及
主从调整模块603,配置为在所述相同分片的主从分布信息符合预置切换条件时,对所述相同分片的副本进行主从调整。
在本发明的一种可选实施例中,所述相同分片的主从分布信息具体可以包括:所述M台机器中任一台中所述相同分片的最大主副本数、以及所述相同分片的最小主副本数;
则所述相同分片的主从分布信息符合预置切换条件具体可以包括:所述相同分片的最大主副本数与所述相同分片的最小主副本数的差值大于阈值。
在本发明的另一种可选实施例中,所述主从调整模块603,具体可以包括:
确定子模块,配置为确定所述M台机器对应的最优主从分布序列;及
迁移子模块,配置为依据所述最优主从分布序列,针对所述相同分片的副本在所述M台机器之间进行迁移操作。
在本发明的又一种可选实施例中,所述最优主从分布序列具体可以包 括:最优主副本分布序列和最优从副本分布序列;
则所述迁移子模块,可以进一步包括:
第一迁移单元,配置为依据所述最优主副本分布序列,针对相同分片的主副本进行的迁移操作;以及
第二迁移单元,配置为依据所述最优从副本分布序列,针对相同分片的从副本进行迁移操作;
其中,所述第一迁移单元,可具体配置为依据所述最优主副本分布序列,将所述相同分片的主副本从主副本数多的机器迁移到主副本数少的机器。
在本发明的再一种可选实施例中,所述分布统计模块602,具体可以包括:
分布获取子模块,配置为从各机器所维护的元数据信息中获取所有分片的主从分布信息;及
统计子模块,配置为依据所有分片的主从分布信息,统计每一种所述组合对应M台机器中相同分片的主从分布信息。
在本发明的一种可选实施例中,所述装置还可以包括:
维护模块,配置为通过各机器维护元数据信息;
其中,所述维护模块具体可以包括:
更新子模块,配置为在机器状态发生变化时,依据自身状态的变化更新所述元数据信息;及
广播子模块,配置为向所有机器广播所述元数据信息的更新。
对于装置实施例而言,由于其与方法实施例基本相似,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。
本发明的各个部件实施例可以以硬件实现,或者以在一个或者多个处理器上运行的软件模块实现,或者以它们的组合实现。本领域的技术人员应当理解,可以在实践中使用微处理器或者数字信号处理器(DSP)来实现根据本发明实施例的分布式存储系统中的主从平衡方法和装置中的一些或者全部部件的一些或者全部功能。本发明还可以实现为用于执行这里所描述的方 法的一部分或者全部的设备或者装置程序(例如,计算机程序和计算机程序产品)。这样的实现本发明的程序可以存储在计算机可读介质上,或者可以具有一个或者多个信号的形式。这样的信号可以从因特网平台上下载得到,或者在载体信号上提供,或者以任何其他形式提供。
例如,图7示出了可以实现根据本发明上述方法的计算设备,例如搜索引擎服务器。该计算设备传统上包括处理器710和以存储器730形式的计算机程序产品或者计算机可读介质。存储器730可以是诸如闪存、EEPROM(电可擦除可编程只读存储器)、EPROM、硬盘或者ROM之类的电子存储器。存储器730具有存储用于执行上述方法中的任何方法步骤的程序代码751的存储空间750。例如,存储程序代码的存储空间750可以包括分别用于实现上面的方法中的各种步骤的各个程序代码751。这些程序代码可以从一个或者多个计算机程序产品中读出或者写入到这一个或者多个计算机程序产品中。这些计算机程序产品包括诸如硬盘,紧致盘(CD)、存储卡或者软盘之类的程序代码载体。这样的计算机程序产品通常为例如图8所示的便携式或者固定存储单元。该存储单元可以具有与图7的计算设备中的存储器730类似布置的存储段、存储空间等。程序代码可以例如以适当形式进行压缩。通常,存储单元包括用于执行根据本发明的方法步骤的计算机可读代码751’,即可以由诸如710之类的处理器读取的代码,当这些代码由服务器运行时,导致该服务器执行上面所描述的方法中的各个步骤。
本文中所称的“一个实施例”、“实施例”或者“一个或者多个实施例”意味着,结合实施例描述的特定特征、结构或者特性包括在本发明的至少一个实施例中。此外,请注意,这里“在一个实施例中”的词语例子不一定全指同一个实施例。
在此处所提供的说明书中,说明了大量具体细节。然而,能够理解,本发明的实施例可以在没有这些具体细节的情况下被实践。在一些实例中,并未详细示出公知的方法、结构和技术,以便不模糊对本说明书的理解。
应该注意的是上述实施例对本发明进行说明而不是对本发明进行限制,并且本领域技术人员在不脱离所附权利要求的范围的情况下可设计出替换 实施例。在权利要求中,不应将位于括号之间的任何参考符号构造成对权利要求的限制。单词“包含”不排除存在未列在权利要求中的元件或步骤。位于元件之前的单词“一”或“一个”不排除存在多个这样的元件。本发明可以借助于包括有若干不同元件的硬件以及借助于适当编程的计算机来实现。在列举了若干装置的单元权利要求中,这些装置中的若干个可以是通过同一个硬件项来具体体现。单词第一、第二、以及第三等的使用不表示任何顺序。可将这些单词解释为名称。
此外,还应当注意,本说明书中使用的语言主要是为了可读性和教导的目的而选择的,而不是为了解释或者限定本发明的主题而选择的。因此,在不偏离所附权利要求书的范围和精神的情况下,对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。对于本发明的范围,对本发明所做的公开是说明性的,而非限制性的,本发明的范围由所附权利要求书限定。

Claims (14)

  1. 一种分布式存储系统中的主从平衡方法,包括:
    生成从存放分片的所有机器中取出M台机器的组合;其中,M等于分片对应副本的数量;
    针对每一种所述组合,统计其对应M台机器中相同分片的主从分布信息,其中,所述相同分片为所述M台机器中皆存有其主副本/从副本的分片;以及
    在所述相同分片的主从分布信息符合预置切换条件时,对所述相同分片的副本进行主从调整。
  2. 如权利要求1所述的方法,其中,所述相同分片的主从分布信息包括:所述M台机器中任一台中所述相同分片的最大主副本数以及所述相同分片的最小主副本数;
    则所述相同分片的主从分布信息符合预置切换条件包括:所述相同分片的最大主副本数与所述相同分片的最小主副本数的差值大于阈值。
  3. 如权利要求1所述的方法,其中,所述在所述相同分片的主从分布信息符合预置切换条件时,对所述相同分片的副本进行主从调整的步骤,包括:
    确定所述M台机器对应的最优主从分布序列;
    依据所述最优主从分布序列,针对所述相同分片的副本在所述M台机器之间进行迁移操作。
  4. 如权利要求3所述的方法,其中,所述最优主从分布序列包括:最优主副本分布序列和最优从副本分布序列;
    则所述依据所述最优主从分布序列,针对所述相同分片的副本在所述M台机器之间进行迁移操作的步骤,包括:依据所述最优主副本分布序列,针对相同分片的主副本进行的迁移操作,以及,依据所述最优从副本分布序列,针对相同分片的从副本进行迁移操作;
    其中,所述依据所述最优主副本分布序列,针对相同分片的主副本进行的迁移操作过程包括:依据所述最优主副本分布序列,将所述相同分片的主副本从主副本数多的机器迁移到主副本数少的机器。
  5. 如权利要求1或2或3或4所述的方法,其中,所述针对每一种所述组合,统计其对应M台机器中相同分片的主从分布信息的步骤,包括:
    从各机器所维护的元数据信息中获取所有分片的主从分布信息;
    依据所有分片的主从分布信息,统计每一种所述组合对应M台机器中相同分片的主从分布信息。
  6. 如权利要求5所述的方法,其中,还包括:各机器维护元数据信息的步骤;
    其中,所述各机器维护元数据信息的步骤包括:
    机器在自身状态发生变化时,依据自身状态的变化更新所述元数据信息;
    向所有机器广播所述元数据信息的更新。
  7. 一种计算机程序,包括计算机可读代码,当所述计算机可读代码在计算设备上运行时,导致所述计算设备执行根据权利要求1至6中的任一项所述的分布式存储系统中的主从平衡方法。
  8. 一种计算机可读介质,其中存储了如权利要求7所述的计算机程序。
  9. 一种分布式存储系统中的主从平衡装置,包括:
    组合生成模块,配置为生成从存放分片的所有机器中取出M台机器的组合;其中,M等于分片对应副本的数量;
    分布统计模块,配置为针对每一种所述组合,统计其对应M台机器中相同分片的主从分布信息,其中,所述相同分片为所述M台机器中皆存有其主副本/从副本的分片;以及
    主从调整模块,配置为在所述相同分片的主从分布信息符合预置切换条件时,对所述相同分片的副本进行主从调整。
  10. 如权利要求9所述的装置,其中,所述相同分片的主从分布信息包括:所述M台机器中任一台中所述相同分片的最大主副本数以及所述相同分片的最小主副本数;
    则所述相同分片的主从分布信息符合预置切换条件包括:所述相同分片 的最大主副本数与所述相同分片的最小主副本数的差值大于阈值。
  11. 如权利要求9所述的装置,其中,所述主从调整模块,包括:
    确定子模块,配置为确定所述M台机器对应的最优主从分布序列;及
    迁移子模块,配置为依据所述最优主从分布序列,针对所述相同分片的副本在所述M台机器之间进行迁移操作。
  12. 如权利要求11所述的装置,其中,所述最优主从分布序列包括:最优主副本分布序列和最优从副本分布序列;
    则所述迁移子模块,包括:
    第一迁移单元,配置为依据所述最优主副本分布序列,针对相同分片的主副本进行的迁移操作;及
    第二迁移单元,配置为依据所述最优从副本分布序列,针对相同分片的从副本进行迁移操作;
    其中,所述第一迁移单元,具体配置为依据所述最优主副本分布序列,将所述相同分片的主副本从主副本数多的机器迁移到主副本数少的机器。
  13. 如权利要求9或10或11或12所述的装置,其中,所述分布统计模块,包括:
    分布获取子模块,配置为从各机器所维护的元数据信息中获取所有分片的主从分布信息;及
    统计子模块,配置为依据所有分片的主从分布信息,统计每一种所述组合对应M台机器中相同分片的主从分布信息。
  14. 如权利要求13所述的装置,其中,所述装置还包括:
    维护模块,配置为通过各机器维护元数据信息;
    其中,所述维护模块包括:
    更新子模块,配置为在机器状态发生变化时,依据自身状态的变化更新所述元数据信息;及
    广播子模块,配置为向所有机器广播所述元数据信息的更新。
PCT/CN2015/095461 2014-12-27 2015-11-24 一种分布式存储系统中的主从平衡方法和装置 WO2016101751A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201410832191.8 2014-12-27
CN201410832191.8A CN104580427B (zh) 2014-12-27 2014-12-27 一种分布式存储系统中的主从平衡方法和装置

Publications (1)

Publication Number Publication Date
WO2016101751A1 true WO2016101751A1 (zh) 2016-06-30

Family

ID=53095584

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2015/095461 WO2016101751A1 (zh) 2014-12-27 2015-11-24 一种分布式存储系统中的主从平衡方法和装置

Country Status (2)

Country Link
CN (1) CN104580427B (zh)
WO (1) WO2016101751A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114398371A (zh) * 2022-01-13 2022-04-26 九有技术(深圳)有限公司 数据库集群系统多副本分片方法、装置、设备及存储介质

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104580427B (zh) * 2014-12-27 2018-09-04 北京奇虎科技有限公司 一种分布式存储系统中的主从平衡方法和装置
CN106302702B (zh) 2016-08-10 2020-03-20 华为技术有限公司 数据的分片存储方法、装置及系统
CN110007866B (zh) * 2019-04-11 2020-03-31 苏州浪潮智能科技有限公司 一种存储单元性能优化方法、装置、存储设备及存储介质
CN112711376B (zh) * 2019-10-25 2022-12-23 北京金山云网络技术有限公司 对象存储系统中对象主副本文件的确定方法及装置
CN113867928A (zh) * 2020-06-30 2021-12-31 北京金山云网络技术有限公司 负载均衡的方法、装置及服务器

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090066291A1 (en) * 2007-09-10 2009-03-12 Jenn-Yang Tien Distributed energy storage control system
CN102984184A (zh) * 2011-09-05 2013-03-20 上海可鲁系统软件有限公司 一种分布式系统的服务负载均衡方法及装置
CN103023932A (zh) * 2011-09-21 2013-04-03 鸿富锦精密工业(深圳)有限公司 服务器负载平衡方法及系统
CN104580427A (zh) * 2014-12-27 2015-04-29 北京奇虎科技有限公司 一种分布式存储系统中的主从平衡方法和装置

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101419600A (zh) * 2007-10-22 2009-04-29 深圳市亚贝电气技术有限公司 基于面向对象文件系统的数据副本映射方法及装置
US8893131B2 (en) * 2008-04-11 2014-11-18 Yahoo! Inc. System and/or method for bulk loading of records into an ordered distributed database
CN103294787A (zh) * 2013-05-21 2013-09-11 成都市欧冠信息技术有限责任公司 分布式数据库系统的多副本存储方法和系统
CN103838860A (zh) * 2014-03-19 2014-06-04 华存数据信息技术有限公司 一种基于动态副本策略的文件存储系统及其存储方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090066291A1 (en) * 2007-09-10 2009-03-12 Jenn-Yang Tien Distributed energy storage control system
CN102984184A (zh) * 2011-09-05 2013-03-20 上海可鲁系统软件有限公司 一种分布式系统的服务负载均衡方法及装置
CN103023932A (zh) * 2011-09-21 2013-04-03 鸿富锦精密工业(深圳)有限公司 服务器负载平衡方法及系统
CN104580427A (zh) * 2014-12-27 2015-04-29 北京奇虎科技有限公司 一种分布式存储系统中的主从平衡方法和装置

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114398371A (zh) * 2022-01-13 2022-04-26 九有技术(深圳)有限公司 数据库集群系统多副本分片方法、装置、设备及存储介质
CN114398371B (zh) * 2022-01-13 2024-06-04 深圳九有数据库有限公司 数据库集群系统多副本分片方法、装置、设备及存储介质

Also Published As

Publication number Publication date
CN104580427B (zh) 2018-09-04
CN104580427A (zh) 2015-04-29

Similar Documents

Publication Publication Date Title
WO2016101751A1 (zh) 一种分布式存储系统中的主从平衡方法和装置
CN110287197B (zh) 一种数据存储方法、迁移方法及装置
CN104516678B (zh) 用于数据存储的方法和设备
US20160048342A1 (en) Reducing read/write overhead in a storage array
WO2019001017A1 (zh) 集群间数据迁移方法、系统、服务器及计算机存储介质
WO2013090640A1 (en) Load balancing in cluster storage systems
CN106933823B (zh) 数据同步方法及装置
WO2017101642A1 (zh) 分布式系统的数据节点升级方法及装置
JP6805816B2 (ja) 情報処理装置、情報処理システム、情報処理方法及びプログラム
US11474715B2 (en) Storage system configuration change tracking for root cause/troubleshooting
KR20160100216A (ko) 대량 오디오 지문 데이터베이스의 온라인 실시간 업데이트를 구축하는 방법과 장치
CN106775470B (zh) 一种数据存储的方法及系统
CN106909556B (zh) 内存集群的存储均衡方法及装置
WO2015154415A1 (zh) 一种实现升级包制作的方法及装置
CN105528381A (zh) 数据库数据迁移方法及系统
WO2018107887A1 (zh) 机顶盒Flash数据存储方法、系统和电子设备
US11250001B2 (en) Accurate partition sizing for memory efficient reduction operations
US11934927B2 (en) Handling system-characteristics drift in machine learning applications
US9600517B2 (en) Convert command into a BULK load operation
US8700583B1 (en) Dynamic tiermaps for large online databases
US9053100B1 (en) Systems and methods for compressing database objects
US9747299B2 (en) Heterogeneous storing server and file storing method thereof
WO2016091072A1 (zh) 分布式数据存储方法及分布式数据集群系统
US10712959B2 (en) Method, device and computer program product for storing data
CN107391755A (zh) 一种数据分布调节、查询的方法和装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15871820

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15871820

Country of ref document: EP

Kind code of ref document: A1