WO2017162179A1 - 用于存储系统的负载再均衡方法及装置 - Google Patents

用于存储系统的负载再均衡方法及装置 Download PDF

Info

Publication number
WO2017162179A1
WO2017162179A1 PCT/CN2017/077758 CN2017077758W WO2017162179A1 WO 2017162179 A1 WO2017162179 A1 WO 2017162179A1 CN 2017077758 W CN2017077758 W CN 2017077758W WO 2017162179 A1 WO2017162179 A1 WO 2017162179A1
Authority
WO
WIPO (PCT)
Prior art keywords
storage
node
nodes
storage node
managed
Prior art date
Application number
PCT/CN2017/077758
Other languages
English (en)
French (fr)
Inventor
王东临
金友兵
莫仲华
齐宇
Original Assignee
北京书生国际信息技术有限公司
书生云公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京书生国际信息技术有限公司, 书生云公司 filed Critical 北京书生国际信息技术有限公司
Publication of WO2017162179A1 publication Critical patent/WO2017162179A1/zh
Priority to US16/139,712 priority Critical patent/US10782898B2/en
Priority to US16/378,076 priority patent/US20190235777A1/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1097Protocols in which an application is distributed across nodes in the network for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/061Improving I/O performance
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0646Horizontal data movement in storage systems, i.e. moving data in between storage devices or systems
    • G06F3/0647Migration mechanisms
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/16Threshold monitoring
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L65/00Network arrangements, protocols or services for supporting real-time applications in data packet communication
    • H04L65/40Support for services or applications
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1001Protocols in which an application is distributed across nodes in the network for accessing one among a plurality of replicated servers
    • H04L67/1004Server selection for load balancing
    • H04L67/1008Server selection for load balancing based on parameters of servers, e.g. available memory or workload

Definitions

  • the present invention relates to the technical field of data storage systems, and more particularly to a load re-equalization method and apparatus for a storage system.
  • FIG. 1 shows a schematic diagram of the architecture of a prior art storage system.
  • each storage node S is connected to a TCP/IP network (through a core switch) through an access network switch.
  • Each storage node is a separate physical server, and each server has its own storage medium.
  • Each storage node is connected by a storage network such as an IP network to form a storage pool.
  • each compute node C is also connected to the TCP/IP network (through the core network switch) through the access network switch to access the entire storage pool over the TCP/IP network.
  • the data when data is written by a user, the data may be evenly distributed to the storage node, and the storage node load and the data occupation are relatively balanced.
  • data imbalance will occur:
  • Capacity expansion Generally, the capacity is expanded by adding a new node. At this time, the load of the newly added storage node is 0. The data of the existing storage node must be physically migrated to a part of the expansion node to achieve load rebalancing between the storage nodes.
  • FIG. 2 is a diagram showing the data migration in the process of implementing load rebalancing between storage nodes in the conventional TCP/IP network-based storage system 1.
  • the partial data stored in the storage node S1 with higher load is migrated to the storage node S2 with lower load, specifically the data migration between the storage media of the two storage nodes, such as the dotted arrow 201. Shown. It can be seen that in the process of realizing the load rebalance between the storage nodes of the TCP/IP network, a large amount of disk read and write performance and network bandwidth are occupied, which affects the read and write performance of normal service data.
  • one of the objects of embodiments of the present invention is to provide an efficient load re-equalization scheme for a storage system.
  • the storage system may include a storage network, at least two storage nodes, and at least one storage device, the at least two storage nodes and the at least one storage device being respectively connected to the storage network,
  • Each of the at least one storage device includes at least one storage medium, wherein all storage media included in the storage system form a storage pool, and the storage network is configured such that each storage node can be used without Each storage medium accesses each storage medium, and divides the storage pool into at least two storage areas in units of storage medium, and each storage node is responsible for managing zero to a plurality of storage areas.
  • a load re-equalization method for the aforementioned storage system includes: monitoring a load status between the at least two storage nodes; and storing storage managed by an associated one of the at least two storage nodes when a load of one storage node is monitored to exceed a predetermined threshold The area is adjusted.
  • a load re-equalization apparatus for the aforementioned storage system.
  • the device includes: a monitoring module, configured to monitor a load status between the at least two storage nodes; and an adjustment module, configured to, when the unbalanced state of the monitored load exceeds a predetermined threshold, the at least two The storage area managed by the relevant storage node in the storage node is adjusted.
  • monitoring the load status between the at least two storage nodes may include monitoring one or more of the following performance parameters of the at least two storage nodes: an IOPS request number of the storage node; a throughput of the storage node The CPU usage of the storage node; the memory usage of the storage node; and the storage space usage of the storage medium managed by the storage node.
  • the predetermined threshold may be represented by a combination of one or more of the respective specified thresholds of the performance parameters.
  • the respective specified thresholds of the performance parameters may include: a deviation between a storage node having the highest parameter value of each performance parameter and a parameter value of a storage node having the lowest parameter value of the performance parameter; each performance parameter The deviation between the parameter value of the storage node with the highest parameter value and the average value of the parameter of each storage node; or the specified value for each performance parameter.
  • the predetermined threshold may be set to one or more of the following: a deviation between the number of IOPS requests of the storage node having the largest number of IOPS and the number of IOPS requests of the storage node having the smallest number of IOPS The deviation is 30% of the number of IOPS requests of the storage node having the smallest IOPS number; the deviation between the deviation between the number of IOPS requests of the storage node having the largest IOPS number and the average value of the number of IOPS requests of each storage node is the average value 20%; the storage space usage rate of any storage medium is 0%; the storage space usage rate of any storage medium is 90%; or any storage The difference between the storage space usage rate of the storage medium with the highest storage space managed by the storage node and the storage medium with the lowest storage usage rate is greater than 20%.
  • each of the at least two storage areas is composed of at least one storage block, one storage block is a complete storage medium, or one storage block is a part of a storage medium.
  • the adjusting the storage area may include: adjusting a configuration table of a storage area managed by an associated storage node, where the at least two storage nodes determine the managed table according to the configuration table. Storage area.
  • each of the at least two storage areas is composed of at least one storage block, one storage block is a complete storage medium, and wherein adjusting the storage area may include: One of the at least two storage areas is exchanged with one of the first storage areas; one storage medium is deleted from the first storage area, and the deleted storage medium is added Going into the second storage area; adding a new storage medium or a new storage area accessing the storage network to the at least two storage areas on average; or storing a part of the at least two storage areas The areas are merged.
  • adjusting the storage area managed by the relevant storage node of the at least two storage nodes comprises: manually determining, by the administrator of the storage system, how the storage area managed by the relevant storage node is adjusted
  • the configuration file mode is used to determine the adjustment manner of the storage area managed by the relevant storage node; or the storage area managed by the relevant storage node is determined according to the load condition of the storage node.
  • the adjustment method can include the portion of the storage area to be migrated and the target storage node to be migrated to.
  • the storage network may include at least one storage switching device, and all at least two storage nodes and the at least one storage medium are connected to the storage switching device through a storage channel.
  • the storage channel can be a SAS channel or a PCI/e channel
  • the storage switching device can be a SAS switch or a PCI/e switch.
  • the storage device may be a JBOD; and/or the storage medium may be a hard disk, a flash memory, an SRAM, or a DRAM.
  • the interface of the storage medium may be a SAS interface, a SATA interface, a PCI/e interface, a DIMM interface, an NVMe interface, a SCSI interface, and an AHCI interface.
  • each storage node may correspond to one or more compute nodes, and each storage node is located at the same server as its corresponding compute node.
  • the storage node may be a virtual machine of the server, a container or a module running directly on the physical operating system of the server; and/or the computing node may be a virtual one of the servers Machine, a container, or a module running directly on the physical operating system of the server.
  • the management of the storage area managed by the storage node may include: each storage node can only read and write the storage area managed by itself; or each storage node can only write the storage area managed by itself, but You can read your own managed storage area and storage areas managed by other storage nodes.
  • a computer program product embodied in a computer readable storage medium having computer readable program code portions stored therein, the computer readable program code The portion is configured to perform according to the aforementioned method.
  • the computer readable program code portion includes: a first executable portion for monitoring a load status between the at least two storage nodes; and a second executable portion for monitoring a storage node When the load exceeds a predetermined threshold, the storage area managed by the relevant storage node among the at least two storage nodes is adjusted.
  • a storage node load rebalancing scheme supporting migration of a storage area is provided, and load re-equalization of the storage node is directly realized by reallocating control rights of the storage area between the storage nodes, thereby avoiding Impact on normal business data during the migration process.
  • FIG. 1 is a schematic diagram showing the architecture of a prior art storage system
  • FIG. 2 is a schematic diagram showing the principle of implementing load rebalancing between storage nodes in a storage system of the prior art
  • 3A is a block diagram showing the architecture of a specific storage system constructed in accordance with an embodiment of the present invention.
  • 3B is a block diagram showing the architecture of a specific storage system constructed in accordance with another embodiment of the present invention.
  • FIG. 4 shows a flow chart of a load re-equalization method for a storage system in accordance with one embodiment of the present invention
  • FIG. 5 is a schematic diagram showing the principle of realizing load re-equalization according to an embodiment of the present invention.
  • FIG. 6 is a schematic diagram showing the principle of implementing load re-equalization according to another embodiment of the present invention.
  • Figure 7 shows a block diagram of a load re-equalization apparatus for a storage system in accordance with one embodiment of the present invention.
  • FIG. 3A shows a block diagram of a storage system in accordance with an embodiment of the present invention.
  • the storage system includes a storage network; a storage node coupled to the storage network; and a storage device coupled to the storage network.
  • Each storage device includes at least one storage medium.
  • a storage device commonly used by the inventors can place 45 storage media.
  • the storage network is configured such that each storage node can access all storage media without resorting to other storage nodes.
  • the storage network is illustrated in FIG. 3A as a SAS switch, but it should be understood that the storage network may also be a SAS set, or other form that will be discussed later.
  • FIG. 3A shows a block diagram of a storage system in accordance with an embodiment of the present invention.
  • the storage system includes a storage network; a storage node coupled to the storage network; and a storage device coupled to the storage network.
  • Each storage device includes at least one storage medium.
  • a storage device commonly used by the inventors can place 45 storage media.
  • the storage network
  • FIG. 3A schematically shows three storage nodes, namely a storage node S1, a storage node S2, and a storage node S3, which are directly connected to a SAS switch, respectively.
  • the storage system shown in FIG. 3A includes physical servers 31, 32, and 33 that are respectively connected to storage devices through a storage network.
  • the physical server 31 includes compute nodes C11, C12 and storage nodes S1 that are co-located therein
  • the physical server 32 includes compute nodes C21, C22 and storage nodes S2 that are co-located therein
  • the physical server 33 includes compute nodes C31, C32 and Storage node S3.
  • the 3A includes storage devices 34, 35, and 36, and the storage device 34 includes a storage medium 1, a storage medium 2, and a storage medium 3, which are co-located therein, and the storage device 35 includes a storage medium 1 and a storage medium co-located therewith. 2 and the storage medium 3, the storage device 36 includes a storage medium 1, a storage medium 2, and a storage medium 3 which are co-located therewith.
  • each storage node can access all storage media without using other storage nodes, so that all storage media of the present invention are actually shared by all storage nodes, thereby realizing global storage.
  • the effect of the pool That is, the storage network is configured such that each storage node can access all storage media without the aid of other storage nodes. Further, the storage network is configured such that each storage node is only responsible for managing a fixed storage medium at the same time, and ensures that one storage medium is not written by multiple storage nodes at the same time, resulting in data corruption, thereby enabling each storage node to be implemented.
  • the storage medium managed by it can be accessed without resorting to other storage nodes, and the integrity of the data stored in the storage system can be guaranteed.
  • the constructed storage pool can be divided into at least two storage areas, and each storage node is responsible for managing zero to multiple storage areas.
  • FIG. 3A a situation in which a storage area managed by a storage node is schematically illustrated using different background patterns, wherein a storage medium included in the same storage area, and a storage node responsible for managing it are represented by the same background pattern.
  • the storage node S1 is responsible for managing the first storage area, which includes the storage medium 1 in the storage device 34, the storage medium 1 in the storage device 35, and the storage medium 1 in the storage device 36;
  • the storage node S2 is responsible for managing the first Two storage areas, including a storage medium 2 at the storage device 34, a storage medium 2 at the storage device 35, and a storage medium 2 at the storage device 36;
  • the storage node S3 is responsible for managing the third storage area, including at the storage device 34 Storage medium 3, storage medium 3 at storage device 35, and storage device 36 Storage medium 3.
  • the storage medium is a built-in disk of the physical machine where the storage node is located
  • the storage The physical machine where the node is located is independent of the storage device, and the storage device is more used as a channel for connecting the storage medium to the storage network.
  • the storage node side further includes a computing node, and the computing node and the storage node are disposed in a physical server, and the physical server is connected to the storage device through the storage network.
  • the aggregated storage system in which the computing node and the storage node are located in the same physical machine constructed by using the embodiment of the present invention can reduce the number of physical devices required, thereby reducing the cost.
  • the compute node can also access the storage resources it wishes to access locally.
  • the data exchange between the two can be as simple as shared memory, and the performance is particularly excellent.
  • the length of the I/O data path between the computing node and the storage medium includes: (1) the storage medium to the storage node; and (2) the storage node to the computing node aggregated in the same physical server. (CPU bus path).
  • the I/O data path length between the compute node and the storage medium includes: (1) storage medium to storage node; (2) storage node to storage Network access network switch; (3) storage network access network switch to core network switch; (4) core network switch to computing network access network switch; and (5) computing network access network switch to computing node.
  • the total data path of the storage system of the embodiment of the present invention is only close to item (1) of the conventional storage system. That is, the storage system provided by the embodiment of the present invention can greatly improve the I/O channel performance of the storage system by extremely compressing the I/O data path length, and the actual running effect is very close to the I/O of the local hard disk. O channel.
  • the storage node may be a virtual machine of a physical server, a container, or a module directly running on a physical operating system of the server, and the computing node may also be a virtual machine of the same physical machine server, A container or a module running directly on the physical operating system of the server.
  • each storage node may correspond to one or more compute nodes.
  • one physical server may be divided into multiple virtual machines, one of which is used as a storage node, and the other virtual machine is used as a computing node; or a module on the physical OS is used as a storage node, so as to implement Better performance.
  • the virtualization technology forming the virtual machine may be KVM or Zen or VMware or Hyper-V virtualization technology
  • the container technology forming the container may be Docker or Rocket or Odin or Chef or LXC or Vagrant. Or Ansible or Zone or Jail or Hyper-V container technology.
  • each storage node is only responsible for managing a fixed storage medium at the same time, and one storage medium is not simultaneously written by multiple storage nodes to avoid data conflict, thereby enabling each storage node to be able to implement each storage node.
  • the storage medium managed by it is accessed without resorting to other storage nodes, and the integrity of the data stored in the storage system can be guaranteed.
  • all the storage media in the system may be divided according to storage logic.
  • the storage pool of the entire system may be divided into a logical storage hierarchy structure such as a storage area, a storage group, and a storage block.
  • the storage block is the smallest storage unit.
  • the storage pool may be divided into at least two storage areas.
  • each storage area may be divided into at least one storage group. In a preferred embodiment, each storage area is divided into at least two storage groups.
  • the storage area and the storage group can be merged such that one level can be omitted in the storage hierarchy.
  • each storage area may be composed of at least one storage block, wherein the storage block may be a complete storage medium or a part of a storage medium.
  • each storage area may be composed of at least two storage blocks, and when any one of the storage blocks fails, the complete storage block may be calculated from the remaining storage blocks in the group.
  • the data is stored.
  • the redundant storage mode can be multi-copy mode, independent redundant disk array (RAID) mode, and erasure code mode.
  • the redundant storage mode can be established by the ZFS file system.
  • the plurality of storage blocks included in each storage area (or storage group) are not located in the same storage medium, or even located in the same storage medium. In the storage device. In an embodiment of the invention, any two storage blocks included in each storage area (or storage group) are not located in the same storage medium/storage device. In another embodiment of the present invention, the number of storage blocks located in the same storage medium/storage device in the same storage area (or storage group) is preferably less than or equal to the redundancy of the redundant storage.
  • the redundant storage redundancy is 1, and the number of storage blocks in the same storage group of the same storage device is at most 1; for RAID 6, the redundant storage is With a redundancy of 2, the number of memory blocks in the same storage group on the same storage device is up to 2.
  • each storage node can only read and write its own managed storage area. Since the read operations of the same storage block by multiple storage nodes do not conflict with each other, and multiple storage nodes write one storage block at the same time, conflicts are easily generated. Therefore, in another embodiment, each storage node can only Write the storage area managed by yourself, but you can read the storage area managed by yourself and the storage area managed by other storage nodes, that is, the write operation is local, but the read operation can be global.
  • the storage system may further include a storage control node coupled to the storage network for determining a storage area managed by each storage node.
  • Each storage node may include a storage allocation module for determining a storage area managed by the storage node, which may be implemented by a communication and coordination processing algorithm between respective storage allocation modules included in each storage node.
  • other or all of the storage nodes may be configured such that the storage nodes take over the storage area previously managed by the failed storage node.
  • one of the storage nodes may take over a storage area managed by the failed storage node, or may be taken over by at least two other storage nodes, wherein each storage node takes over a portion of the storage area managed by the failed storage node, For example, at least two other storage nodes respectively take over different storage groups in the storage area.
  • the storage medium may include, but is not limited to, a hard disk, a flash memory, an SRAM, a DRAM, an NVME, or the like.
  • the access interface of the storage medium may include, but is not limited to, a SAS interface, a SATA interface, a PCI/e interface, a DIMM interface, NVMe interface, SCSI interface, and AHCI interface.
  • the storage network may include at least one storage switching device, and the storage node accesses the storage medium through data exchange between the storage switching devices included therein.
  • the storage node and the storage medium are respectively connected to the storage switching device through the storage channel.
  • the storage switching device may be a SAS switch or a PCI/e switch.
  • the storage channel may be a SAS (Serial Attached SCSI) channel or a PCI/e channel.
  • the SAS-based switching solution has the advantages of high performance, large bandwidth, and a large number of disks per device.
  • HBA host adapter
  • the storage provided by the SAS system can be easily accessed by multiple servers connected simultaneously.
  • the SAS switch is connected to the storage device through a SAS line, and the storage device and the storage medium are also connected by a SAS interface.
  • the storage device internally connects the SAS channel to each storage medium (may be in the storage device) Internally set a SAS switch chip). Since the bandwidth of a SAS network can reach 24Gb or 48Gb, which is several times that of Gigabit Ethernet, and several times that of an expensive 10 Gigabit Ethernet; at the same time, the link layer SAS has an order of magnitude improvement over the IP network, in transmission. Layer, because the TCP protocol three-way handshake is closed four times, the overhead is very high, and the TCP delay acknowledgement mechanism and slow start sometimes cause a delay of 100 milliseconds.
  • SAS networks offer significant advantages in terms of bandwidth and latency over Ethernet-based TCP/IP. Those skilled in the art will appreciate that the performance of the PCI/e channel can also be adapted to the needs of the system.
  • the storage network may include at least two storage switching devices, each of which may be connected to any one of the storage devices through any one of the storage switching devices, thereby being connected to the storage medium.
  • the storage node reads and writes storage through other storage switching devices. The data on the device.
  • the storage devices in the storage system 30 are constructed as a plurality of JBODs 307-310, which are respectively connected to two SAS switches 305 and 306 through SAS data lines, which constitute the switching core of the storage network included in the storage system.
  • the front end is at least two servers 301 and 302, each of which is connected to the two SAS switches 305 and 306 via an HBA device (not shown) or a SAS interface on the motherboard.
  • Each server has a storage node that manages some or all of the disks in all JBOD disks using information obtained from the SAS links.
  • the storage area, the storage group, and the storage block described above in the application file may be used to divide the JBOD disk into different storage groups.
  • Each storage node manages one or more sets of such storage groups.
  • redundant storage is used inside each storage group, redundantly stored metadata can exist on the disk, so that redundant storage can be directly recognized from the disk by other storage nodes.
  • the storage node can install a monitoring and management module that is responsible for monitoring the status of local storage and other servers.
  • a JBOD is abnormal overall or a disk on the JBOD is abnormal, data reliability is ensured by redundant storage.
  • the management module in the storage node on another pre-configured server will locally identify and take over the disk managed by the storage node of the failed server according to the data on the disk.
  • the storage node originally provided by the storage node of the faulty server will also be extended on the storage node on the new server. So far, a new highly available global storage pool structure has been implemented.
  • the exemplary storage system 30 is constructed to provide a multi-point, controllable, globally accessible storage pool.
  • the hardware uses multiple servers to provide external services, and uses JBOD to store disks.
  • Multiple JBODs are connected to two SAS switches, and the two switches are respectively connected to the server's HBA cards, thereby ensuring that all disks on the JBOD can be accessed by all servers.
  • the SAS redundant link also ensures high availability on the link.
  • each server uses redundant storage technology to select redundant disks from each JBOD to avoid redundant data loss.
  • the module that monitors the overall state will schedule another server to access the disks managed by the storage node of the failed server through the SAS channel, and quickly take over the disks that the other party is responsible for, achieving high-available global storage.
  • JBOD storage disk is illustrated in FIG. 3 as an example, it should be understood that the embodiment of the present invention as shown in FIG. 3 also supports a storage device other than JBOD.
  • the above is an example in which one storage medium (entire) is used as one storage block, and the same applies to a case where a part of one storage medium is used as one storage block.
  • FIG. 4 shows a flow diagram of an access control method 40 for an exemplary storage system in accordance with an embodiment of the present invention.
  • step S401 monitoring a load shape between at least two storage nodes included in the storage system state.
  • step S402 when it is detected that the load of one storage node exceeds a predetermined threshold, the storage area managed by the relevant storage node of the at least two storage nodes is adjusted.
  • the associated storage node may be a storage node that causes an unbalanced state of the load, and may be determined depending on an adjustment policy of the storage area.
  • the adjustment of the storage area may be that the storage blocks involved are reallocated among the storage nodes, or may be addition, merge, or deletion of the storage areas.
  • the configuration table of the storage area managed by the relevant storage node may be adjusted, and the at least two storage nodes determine the storage area they manage according to the configuration table.
  • the adjustment of the foregoing configuration table may be performed by a storage control node included in the foregoing storage system or a storage allocation module included in the storage node.
  • monitoring the load status between the at least two storage nodes may be performed for one or more of the following performance parameters: number of read and write operations per second (IOPS) requests for the storage node, storage node Throughput, CPU usage of the storage node, memory usage of the storage node, and occupancy of storage media managed by the storage node.
  • IOPS read and write operations per second
  • each node can periodically monitor its own performance parameters, and periodically query data of other nodes, and then dynamically generate a global unified re-equalization scheme through a predefined re-equalization scheme or through an algorithm, and finally each node.
  • the storage system includes monitoring nodes independent of the storage node S1, the storage node S2, and the storage node S3, or the foregoing storage control node or storage allocation module, to monitor performance parameters of the respective storage nodes.
  • the determination of the imbalance may be achieved by a predefined threshold (configurable), such as triggering a rebalancing mechanism when the deviation of the number of IOPS between the various nodes exceeds a certain range.
  • a predefined threshold configurable
  • the number of IOPS of the storage node with the largest number of IOPS can be compared with the number of IOPS of the storage node with the smallest number of IOPS, and when the deviation between the two is greater than 30% of the latter, the storage area is triggered. Adjustment.
  • one storage medium managed by a storage node having the largest IOPS number is exchanged with a storage medium managed by a storage node having the smallest IOPS number, for example, the storage medium and IOPS with the highest occupancy rate managed by the storage node having the largest IOPS number are selected.
  • the storage medium with the highest occupancy rate managed by the smallest storage node is exchanged with a storage medium managed by a storage node having the smallest IOPS number, for example, the storage medium and IOPS with the highest occupancy rate managed by the storage node having the largest IOPS number are selected.
  • the storage medium with the highest occupancy rate managed by the smallest storage node is exchanged with a storage medium managed by a storage node having the smallest IOPS number, for example, the storage medium and IOPS with the highest occupancy rate managed by the storage node having the largest IOPS number are selected.
  • the storage medium with the highest occupancy rate managed by the smallest storage node is exchanged with a storage medium managed by a storage node having the
  • the IOPS number of the storage node with the largest IOPS number can be compared with the average value of the IOPS number of each storage node, and when the deviation between the two is greater than 20% of the latter, the storage area is triggered to be adjusted. So that the adjusted storage area allocation scheme does not trigger rebalancing immediately.
  • predetermined thresholds 20%, 30% for indicating the imbalance state of the load are merely exemplary, and additional thresholds may be defined depending on the application and the requirements for the requirements.
  • additional thresholds may be defined depending on the application and the requirements for the requirements.
  • a predefined definition is also used to trigger the storage node. The threshold for the load to be rebalanced.
  • the predetermined threshold for the determination of the imbalance discussed above may be passed Representing a specified threshold of a respective one of a plurality of performance parameters, such as an IOPS number, but the inventors envisioned the predetermined threshold, which may also pass a plurality of specified thresholds of respective ones of the plurality of performance parameters. Combined to represent. For example, load rebalancing of a storage node is triggered when the number of IOPS of the storage node reaches its specified threshold and the throughput of the storage node reaches its specified threshold.
  • the storage medium managed by the storage node with high load may be allocated to the storage area managed by the storage node with low load, for example, may include exchange of storage medium, or Adding from a storage area managed by a storage node with a high load and an increase in a storage area managed by a storage node with a low load, or adding a new storage medium or a new storage area accessing the storage network to the average At least two storage areas (for example, storage system expansion), or a partial storage area of at least two storage areas (for example, one storage node failure).
  • a dynamic algorithm may be developed, for example, weighting various storage media of each storage medium and each storage node to obtain a single load indicator, and then calculating a rebalancing Solution, by moving the minimum number of disk groups, so that the system no longer exceeds the predetermined threshold.
  • each storage node may periodically monitor performance parameters of the storage medium that it manages, and periodically query performance parameters of the storage medium managed by other nodes, and define performance variables for the storage medium to represent the load.
  • the threshold of the unbalanced state for example, the threshold may be 0% of the storage space usage of any storage medium (with a new disk join), and the storage space usage rate of any storage medium is 90% (the disk space will be full) The difference between the storage medium with the highest storage space usage rate and the storage medium with the lowest storage space usage is greater than 20% of the latter. It should be understood that the aforementioned predetermined thresholds 0%, 90%, 30% for indicating the imbalance state of the load are also merely exemplary.
  • FIG. 5 is a schematic diagram showing the principle of implementing load re-equalization in the storage system shown in FIG. 3A according to an embodiment of the present invention. It is assumed that at a certain moment, the load of the storage node S1 in the storage system is high, and the storage medium managed by the storage medium includes the storage medium 1 at the storage device 34, the storage medium 1 at the storage device 35, and the storage device. The storage medium 1 at 36 (as shown in FIG. 3A), and its total storage space will be used up quickly, while the load of the storage node 3 is low, and the storage space in the storage medium it manages is large.
  • each storage node can only access storage areas that are directly connected to itself. Therefore, in the process of rebalancing, the data on the heavily loaded storage node needs to be copied to the light load node. In this process, a large number of data replication operations occur, causing additional load on the storage area and the network, affecting normal service data.
  • IO access For example, it is necessary to read data from one or more storage media managed by the storage node 1, and then write the read data to one or more managed by the storage node 3, and finally release the storage medium managed by the storage node 1 to store the data.
  • the disk space of the data to achieve load balancing.
  • each storage node S1, S2, and S3 included in the storage system can access all the storage areas through the storage network
  • the storage medium access right is implemented to implement the migration of the storage area between the storage nodes, that is, the storage areas managed by the relevant storage nodes can be regrouped.
  • the data in each storage area no longer needs to be copied. For example, as shown in FIG. 5, the storage medium 2 located at the storage device 34 and originally managed by the storage node 3 is allocated to the storage node 1 for management, and the storage node 1 at the storage device 34 is managed by the storage node 1 at the same time.
  • the storage medium 1 is allocated to the storage node 3 for management, thereby realizing load balancing of the remaining storage space between the storage node 1 and the storage node 3. In this process, only the configuration of the storage node 1 and the storage node 3 needs to be modified, which can be completed in a short time without affecting the user data read and write performance.
  • FIG. 6 is a schematic diagram showing the principle of implementing load re-equalization in the storage system shown in FIG. 3A according to another embodiment of the present invention.
  • the storage medium 2 managed by the storage node 2 located at the storage device 35 can be allocated to the storage.
  • the node 1 manages, and simultaneously stores the storage medium 1 located at the storage device 34 and managed by the storage node 1 to the storage node 2, thereby implementing load balancing of the remaining storage space between the storage node 1 and the storage node 2.
  • the average of the newly added storage medium may be allocated to and managed by each storage node, for example, in the order of joining, thereby maintaining the storage node between Load balancing.
  • the above two embodiments are to schedule storage media between different storage nodes to implement load rebalancing, they may also be applicable to scheduling storage areas between storage nodes to achieve load rebalancing, for example, In the case of storage medium expansion, when it is detected that a storage area is added, the joined storage areas may be allocated to the respective storage nodes in the order of joining.
  • the configuration between the computing node and the storage node in the storage system may be modified so that the data is originally stored by the storage node S1.
  • One or more of the at least one compute node, such as C12 may store data through other storage nodes, such as storage node S2.
  • the compute node may need to access the storage node where the physical server it is located in order to store the data, instead of physically moving the compute node, but accessing the storage on the remote storage node through a remote access protocol, such as the iSCSI protocol.
  • the area (as shown in Figure 5); or, the computing node can be migrated while the adjustment of the storage area managed by the relevant storage node (as shown in Figure 6), in which the process may need to be closed first. Compute node.
  • the storage system may include at least two storages. a node, a storage network, and at least one storage device connected to the at least two storage nodes through a storage network, each of the at least one storage device may include at least one storage medium, and the storage network may be configured such that each storage Nodes are able to access all storage media without the need for additional storage nodes.
  • each storage area is stored by one of a plurality of storage nodes Managed by the node, when the storage node is started, the storage node automatically connects to the storage area managed by it, and then imports it. After the completion, the storage service can be provided to the upper computing node.
  • determining the portion of the storage area that needs to be migrated there are a variety of implementations for determining the portion of the storage area that needs to be migrated.
  • the management personnel can manually determine which storage areas need to be migrated.
  • a profile mode may be adopted, that is, a migration priority is pre-configured for each storage area, and one or more storage blocks with the highest priority among the storage areas currently managed by the storage node are selected when migration is required. , storage group or storage medium for migration.
  • the migration may be performed according to a storage block, a storage group, or a storage medium included in the storage area; for example, each storage node may monitor a storage block, a storage group, or a storage group of the storage area managed by the storage area;
  • the load of the storage medium such as collecting IOPS, throughput, IO delay, etc., all the information is weighted and combined to select the part of the storage area that needs to be migrated.
  • the migration to the storage node can be manually determined by the administrator.
  • a profile mode may be adopted, that is, a migration target list is pre-configured for each storage area, such as a storage node list arranged according to priorities, and after determining that the storage area (or part) needs to be migrated, according to the target The list selects the migration destination in turn. It should be noted that in this way, it should be ensured that the target storage node is not overloaded after the migration.
  • the storage node to be migrated may be selected according to the load condition of the storage node, and the load status of each storage node may be monitored, for example, information such as CPU usage, memory usage, and network bandwidth usage is collected, and all the information is collected. Perform a weighted synthesis to select the storage nodes to which the storage area needs to be migrated. For example, each storage node may report its own load status to other storage nodes periodically or irregularly.
  • the storage node that needs to migrate data preferentially selects the other storage node with the lowest load as the target storage node for migration.
  • the storage system administrator can confirm and start the specific migration process, or the migration process can be started by the program.
  • the migration process needs to minimize the impact on the upper compute nodes, for example, you can choose to migrate when the application load is minimal, such as at midnight (assuming the load is minimal during the time period); in the case of determining that the compute node needs to be shut down during the migration process Under the circumstance, the low usage rate of the computing node should be used as much as possible; the migration strategy can be pre-configured to handle the order of migration in the case of determining that multiple storage areas or multiple parts of one storage area need to be migrated.
  • the relevant storage node can be configured to write or read the relevant storage area to ensure data integrity, such as writing all cached data to disk.
  • the storage node needs to perform the necessary initialization work on the storage node, and then the storage area can be accessed by the upper-layer computing node.
  • the load should be monitored again to confirm whether the load is balanced.
  • the storage system may include a storage control node connected to the storage network for determining a storage area managed by each of the at least two storage nodes; or the storage node may further A storage allocation module is configured to determine a storage area managed by the storage node, and data can be shared between the storage allocation modules.
  • the storage control node or the storage allocation module records a list of storage areas that each storage node is responsible for. After the storage node starts, it queries the storage control node or the storage allocation module for the storage area managed by itself, and then scans the storage areas to complete the initialization. When it is determined that the storage area migration needs to occur, the storage control node or the storage allocation module modifies the storage area list of the relevant storage node, and then notifies the storage node to complete the actual switching work as required.
  • the migration process may include the following steps:
  • the storage node B scans all the storage media in the storage area 1 to complete the initialization work
  • the application accesses the data in the storage area 1 through the storage node B.
  • FIG. 7 shows a block diagram of a load re-equalization device 70 for a storage system in accordance with one embodiment of the present invention.
  • the load re-equalization device 70 may include: a monitoring module 701, configured to monitor a load status between the at least two storage nodes; and an adjustment module 702, configured to: when it is detected that the unbalanced state of the load exceeds a predetermined threshold, Adjusting a storage area managed by an associated storage node among the at least two storage nodes.
  • each of the modules recited in device 70 corresponds to each of the steps in method 40 described with reference to FIG.
  • the operations and features described above with respect to FIG. 4 are equally applicable to the apparatus 70 and the modules contained therein, and the repeated content is not described herein.
  • apparatus 70 may be implemented at each storage node or in a scheduling device of a plurality of storage nodes.
  • the teachings of the present invention may also be embodied as a computer program product of a computer readable storage medium, comprising computer program code which, when executed by a processor, enables a processor to be implemented in accordance with a method of an embodiment of the present invention A load re-equalization scheme for a storage system as described in the embodiments herein.
  • the computer storage medium can be any tangible medium such as a floppy disk, CD-ROM, DVD, hard drive, or even network media.
  • a storage node load rebalancing scheme supporting migration of a storage medium or a storage area is provided, and rebalancing is achieved directly by reallocating control of a storage medium or a storage area between storage nodes. It avoids the impact on normal business data during the migration process, and significantly improves the efficiency of load node rebalance.
  • an implementation form of the embodiments of the present invention described above may be a computer program product
  • the method or apparatus of the embodiments of the present invention may be implemented in software, hardware, or a combination of software and hardware.
  • the hardware portion can be implemented using dedicated logic; the software portion can be stored in memory and executed by a suitable instruction execution system, such as a microprocessor or dedicated design hardware.
  • a suitable instruction execution system such as a microprocessor or dedicated design hardware.
  • processor control code such as a carrier medium such as a magnetic disk, CD or DVD-ROM, such as a read only memory.
  • Such code is provided on a programmable memory (firmware) or on a data carrier such as an optical or electronic signal carrier.
  • the method and apparatus of the present invention may be implemented by hardware circuits such as very large scale integrated circuits or gate arrays, semiconductors such as logic chips, transistors, etc., or programmable hardware devices such as field programmable gate arrays, programmable logic devices, etc., also It can be implemented by software executed by various types of processors, or by a combination of the above-described hardware circuits and software such as firmware.

Abstract

本发明涉及用于存储系统的负载再均衡方法及装置。该方法包括:监测至少两个存储节点之间的负载状态;以及在监测到一个存储节点的负载超出预定阈值时,对至少两个存储节点中的相关存储节点所管理的存储区域进行调整。根据本发明的实施方式,可以在存储区域之间进行负载再均衡时避免数据的真实迁移过程,从而不会对正常业务数据造成影响。

Description

用于存储系统的负载再均衡方法及装置 技术领域
本发明涉及数据存储系统的技术领域,更具体地,涉及用于存储系统的负载再均衡方法及装置。
背景技术
随着计算机应用规模越来越大,对存储空间的需求也与日俱增。对应的,将复数设备的存储资源(比如存储介质)统合为一体作为一个存储池来提供存储服务成为了现在的主流。在传统的存储系统中,该存储系统通常是由TCP/IP网络连接多个分布式存储节点组成的。图1示出现有技术的存储系统的架构示意图。如图1所示,在传统的存储系统中,各存储节点S通过接入网交换机连接到TCP/IP网络(通过核心交换机实现)。每个存储节点都是单独一台物理服务器,每台服务器都有自己的若干存储介质。各存储节点通过如IP网络这样的存储网络连接起来,构成一个存储池。
在核心交换机的另一侧,各计算节点C也通过接入网交换机连接到TCP/IP网络(通过核心网交换机实现),以通过TCP/IP网络访问整个存储池。
然而,在该传统的存储系统中,一旦涉及到动态平衡时,需要对存储节点上物理数据进行迁移,以达到平衡目的。
更进一步地,在该传统的存储系统中,通常当用户写入数据时,这些数据可能被平均地分配到存储节点上,此时存储节点负载和数据占用都是比较均衡。但是在以下情况,会出现数据的不均衡:
(1)由于数据分配算法和用户数据本身的特点,导致数据未能平均分配到不同存储节点,表现为有的存储节点负载高,有的存储节点负载低;
(2)扩容操作:通常是通过增加新的节点来实现扩容,此时新加入的存储节点负载为0。必须将现有存储节点的数据物理地迁移一部分到扩容节点,实现存储节点之间的负载再均衡。
图2示出了传统的基于TCP/IP网络的存储系统1中的实现存储节点之间的负载再均衡的过程中的数据迁移的示意图。在该示例中,将负载较高的存储节点S1中存储的部分数据向负载较低的存储节点S2中进行迁移,具体涉及该两个存储节点的存储介质之间的数据迁移,如虚线箭头201所示。可见,在实现TCP/IP网络的存储节点之间的负载再均衡的过程中,会占用大量的磁盘读写性能和网络带宽,影响正常业务数据的读写性能。
发明内容
有鉴于此,本发明实施方式的目的之一在于提供一种用于存储系统的高效负载再均衡方案。
根据本发明的实施方式,所述存储系统可以包括存储网络、至少两个存储节点以及至少一个存储设备,所述至少两个存储节点和所述至少一个存储设备分别连接至所述存储网络,所述至少一个存储设备中的每个存储设备包括至少一个存储介质,其中将所述存储系统所包括的所有存储介质构成一个存储池,所述存储网络被配置为使得每一个存储节点都能够无需借助其他存储节点而访问每个存储介质,并且将所述存储池以存储介质为单位划分成至少两个存储区域,每个存储节点负责管理零到多个存储区域。
根据本发明的一个方面,提供一种用于前述存储系统的负载再均衡方法。所述方法包括:监测所述至少两个存储节点之间的负载状态;以及在监测到一个存储节点的负载超出预定阈值时,对所述至少两个存储节点中的相关存储节点所管理的存储区域进行调整。
根据本发明的另一个方面,提供一种用于前述存储系统的负载再均衡装置。所述装置包括:监测模块,用于监测所述至少两个存储节点之间的负载状态;以及调整模块,用于在监测到负载的不均衡状态超出预定阈值的情况下,对所述至少两个存储节点中的相关存储节点所管理的存储区域进行调整。
进一步地,监测所述至少两个存储节点之间的负载状态可以包括监测所述至少两个存储节点的以下性能参数中的一项或多项:存储节点的IOPS请求数;存储节点的吞吐量;存储节点的CPU使用率;存储节点的内存使用率;以及存储节点管理的存储介质的存储空间使用率。
进一步地,预定阈值可以通过所述性能参数的各自的指定阈值的一项或者多项的组合来表示。
进一步地,性能参数的各自的指定阈值可以包括:每项性能参数的参数值最高的存储节点的与该项性能参数的参数值最低的存储节点的参数值之间的偏差;每项性能参数的参数值最高的存储节点的该项参数值与各个存储节点的该项参数的平均值之间的偏差;或者针对每项性能参数的指定值。
在一个实施例中,预定阈值可以被设置为以下各项中的一项或多项:IOPS数最大的存储节点的IOPS请求数与IOPS数最小的存储节点的IOPS请求数之间的偏差之间的偏差为IOPS数最小的存储节点的IOPS请求数的30%;IOPS数最大的存储节点的IOPS请求数与各个存储节点的IOPS请求数的平均值之间的偏差之间的偏差为该平均值的20%;任一存储介质的存储空间使用率为0%;任一存储介质的存储空间使用率为90%;或者任一存 储节点所管理的存储空间使用率最高的存储介质与存储空间使用使用率最低的存储介质之间的存储空间使用率之差大于20%。
根据本发明的实施方式,所述至少两个存储区域中的每个存储区域由至少一个存储块组成,一个存储块是一个完整的存储介质,或者一个存储块是一个存储介质的一部分。
在一个实施例中,对存储区域进行的所述调整可以包括:对相关存储节点所管理的存储区域的配置表进行调整,所述至少两个存储节点根据所述配置表来确定其所管理的存储区域。
在一个实施例中,所述至少两个存储区域中的每个存储区域由至少一个存储块组成,一个存储块是一个完整的存储介质,并且其中对存储区域进行的调整可以包括:将所述至少两个存储区域中的第一存储区域中的一个存储介质和第一存储区域中的一个存储介质相交换;从所述第一存储区域中删除一个存储介质,并且将该删除的存储介质添加到所述第二存储区域中;将接入存储网络的新的存储介质或新的存储区域平均地加入到所述至少两个存储区域中;或者将所述至少两个存储区域中的部分存储区域进行合并。
在一个实施例中,对所述至少两个存储节点中的相关存储节点所管理的存储区域进行调整包括:由所述存储系统的管理人员人工地确定相关存储节点所管理的存储区域的调整方式;采用配置文件方式来确定相关存储节点所管理的存储区域的调整方式;或者根据存储节点的负载情况来确定相关存储节点所管理的存储区域的调整方式。调整方式可以包括要迁移的存储区域的部分和要迁移到的目标存储节点。
进一步地,存储网络可以包括至少一个存储交换设备,所有至少两个存储节点和所述至少一个存储介质都通过存储通道与存储交换设备连接。存储通道可以是SAS通道或PCI/e通道,存储交换设备可以是SAS交换机或PCI/e交换机。
进一步地,存储设备可以为JBOD;和/或存储介质可以是硬盘、闪存、SRAM或DRAM。
进一步地,存储介质的接口可以是SAS接口、SATA接口、PCI/e接口、DIMM接口、NVMe接口、SCSI接口、AHCI接口。
根据本发明的实施方式,每个存储节点可以对应一个或多个计算节点,并且每个存储节点与其对应的计算节点都位于同一服务器。
根据本发明的实施方式,存储节点可以是所述服务器的一个虚拟机、一个容器或直接运行在所述服务器的物理操作系统上的一个模块;和/或计算节点可以是所述服务器的一个虚拟机、一个容器或直接运行在所述服务器的物理操作系统上的一个模块。
根据本发明的实施方式,存储节点对其所管理的存储区域的管理可以包括:每个存储节点只能读写自己管理的存储区域;或每个存储节点只能写自己管理的存储区域,但可以读自己管理的存储区域以及其它存储节点管理的存储区域。
根据本发明的又一个方面,提供一种在计算机可读存储介质中实现的计算机程序产品,所述计算机可读存储介质具有存储于其中的计算机可读程序代码部分,所述计算机可读程序代码部分被配置为执行根据前述方法。比如,所述计算机可读程序代码部分包括:第一可执行部分,用于监测所述至少两个存储节点之间的负载状态;以及第二可执行部分,用于在监测到一个存储节点的负载超出预定阈值时,对所述至少两个存储节点中的相关存储节点所管理的存储区域进行调整。
根据本发明的实施方式,提供了一种支持存储区域的迁移的存储节点负载再均衡方案,直接通过在各个存储节点之间重新分配存储区域的控制权来实现存储节点的负载再均衡,避免了迁移过程中对正常业务数据的影响。
从下文结合附图所做出的详细描述中,本发明的这些和其他优点和特征将变得明显,其中在整个下文描述的若干附图中,类似的元件将具有类似的编号。
附图说明
图1示出现有技术的存储系统的架构示意图;
图2示出现有技术的存储系统中的实现存储节点之间的负载再均衡的原理示意图;
图3A示出根据本发明的一个实施方式所构建的一个具体的存储系统的架构示意图;
图3B示出根据本发明的另一个实施方式所构建的一个具体的存储系统的架构示意图;
图4示出根据本发明的一个实施方式的用于存储系统的负载再均衡方法的流程图;
图5示出根据本发明一种实施方式的中实现负载再均衡的原理示意图;
图6示出根据本发明另一种实施方式的中实现负载再均衡的原理示意图;以及
图7示出根据本发明的一个实施方式的用于存储系统的负载再均衡装置的框图。
具体实施方式
下文将参考附图更完整地描述本公开内容,其中在附图中显示了本公开内容的实施方式。但是这些实施方式可以用许多不同形式来实现并且不 应该被解释为限于本文所述的实施方式。相反地,提供这些实例以使得本公开内容将是透彻和完整的,并且将全面地向本领域的熟练技术人员表达本公开内容的范围。
下面结合附图以示例的方式详细描述本发明的各种实施方式。
图3A示出根据本发明的实施方式的存储系统的架构示意图。该存储系统包括存储网络;存储节点,连接至所述存储网络;以及存储设备,同样连接至所述存储网络。每个存储设备包括至少一个存储介质。例如,发明人常用的存储设备可以放置45块存储介质。其中,所述存储网络被配置为使得每一个存储节点都能够无需借助其他存储节点而访问所有存储介质。图3A中将存储网络示意为SAS交换机,但是应当理解,存储网络还可以是SAS集合、或者将在后文中讨论的其他形式。图3A示意性地示出了三个存储节点,即存储节点S1、存储节点S2和存储节点S3,分别直接与SAS交换机相连。图3A所示的存储系统包括物理服务器31、32和33,这些物理服务器分别通过存储网络与存储设备连接。物理服务器31包括共处于其的计算节点C11、C12和存储节点S1,物理服务器32包括共处于其的计算节点C21、C22和存储节点S2,物理服务器33包括共处于其的计算节点C31、C32和存储节点S3。图3A所示的存储系统包括存储设备34、35和36,存储设备34包括共处于其的存储介质1、存储介质2和存储介质3,存储设备35包括共处于其的存储介质1、存储介质2和存储介质3,存储设备36包括共处于其的存储介质1、存储介质2和存储介质3。
利用本发明实施例提供的存储系统,每一个存储节点都能够无需借助其他存储节点而访问所有存储介质,从而使得本发明所有的存储介质都实际上被所有的存储节点共享,进而实现了全局存储池的效果。也就是说,存储网络被配置为使得每一个存储节点都能够无需借助其他存储节点而访问所有存储介质。进一步地,存储网络被配置为使得各个存储节点同时只负责管理固定的存储介质,并且保证一个存储介质不会被同时多个存储节点进行写入,导致数据损坏,从而能够实现每一个存储节点都能够无需借助其他存储节点而访问由其管理的存储介质,并且能够保证存储系统中存储的数据的完整性。此外,可以将所构建的存储池划分成至少两个存储区域,每个存储节点负责管理零到多个存储区域。参考图3A,其利用不同背景图案、示意性示出了存储节点管理的存储区域的情形,其中对相同的存储区域包括的存储介质、以及负责管理其的存储节点以相同的背景图案进行表示。具体而言,存储节点S1负责管理第一存储区域,其包括处于存储设备34的存储介质1、处于存储设备35的存储介质1、以及处于存储设备36的存储介质1;存储节点S2负责管理第二存储区域,其包括处于存储设备34的存储介质2、处于存储设备35的存储介质2、以及处于存储设备36的存储介质2;存储节点S3负责管理第三存储区域,其包括处于存储设备34的存储介质3、处于存储设备35的存储介质3、以及处于存储设备36的 存储介质3。
同时,从上述的描述可以看出,相比于现有技术(其中存储节点位于存储介质侧,或者严格来说,存储介质是存储节点所在物理机的内置盘),本发明实施例中,存储节点所在的物理机独立于存储设备,存储设备更多作为连接存储介质与存储网络的一个通道。
这样的方式,使得在需要进行动态平衡时,无需将物理数据在不同的存储介质中进行迁移,只需要通过配置平衡不同的存储节点所管理的存储区域(或者存储介质)即可。
在本发明另一实施例中,存储节点侧进一步包括计算节点,并且计算节点和存储节点设置在一台物理服务器中,该物理服务器通过存储网络与存储设备连接。利用本发明实施方式所构建的将计算节点和存储节点位于同一物理机的聚合式存储系统,从整体结构而言,可以减少所需物理设备的数量,从而降低成本。同时,计算节点也可以在本地访问到其希望访问的存储资源。另外,由于将计算节点和存储节点聚合在同一台物理服务器上,两者之间数据交换可以简单到仅仅是共享内存,性能特别优异。
本发明实施例提供的存储系统中,计算节点到存储介质之间的I/O数据路径长度包括:(1)存储介质到存储节点;以及(2)存储节点到聚合在同一物理服务器的计算节点(CPU总线通路)。而相比之下,图1所示现有技术的存储系统,其计算节点到存储介质之间的I/O数据路径长度包括:(1)存储介质到存储节点;(2)存储节点到存储网络接入网交换机;(3)存储网络接入网交换机到核心网交换机;(4)核心网交换机到计算网络接入网交换机;以及(5)计算网络接入网交换机到计算节点。显然,本发明实施方式的存储系统的总数据路径只接近于传统存储系统的第(1)项。即,本发明实施例提供的存储系统,通过对I/O数据路径长度的极致压缩能够极大地提高了存储系统的I/O通道性能,其实际运行效果非常接近于读写本地硬盘的I/O通道。
在本发明一实施例中,存储节点可以是物理服务器的一个虚拟机、一个容器或直接运行在服务器的物理操作系统上的一个模块,计算节点也可以是同一个物理机服务器的一个虚拟机、一个容器或直接运行在所述服务器的物理操作系统上的一个模块。在一个实施例中,每个存储节点可以对应一个或多个计算节点。
具体而言,可以将一台物理服务器分成多个虚拟机,其中一台虚拟机做存储节点用,其它虚拟机做计算节点用;也可是利用物理OS上的一个模块做存储节点用,以便实现更好的性能。
在本发明一实施例中,形成虚拟机的虚拟化技术可以是KVM或Zen或VMware或Hyper-V虚拟化技术,形成所述容器的容器技术可以是Docker或Rockett或Odin或Chef或LXC或Vagrant或Ansible或Zone或Jail或 Hyper-V容器技术。
在本发明一实施例中,各个存储节点同时只负责管理固定的存储介质,并且一个存储介质不会同时被多个存储节点进行写入,以避免数据冲突,从而能够实现每一个存储节点都能够无需借助其他存储节点而访问由其管理的存储介质,并且能够保证存储系统中存储的数据的完整性。
在本发明一实施例中,可以将系统中所有的存储介质按照存储逻辑进行划分,具体而言,可以将整个系统的存储池划分为存储区域、存储组、存储块这样的逻辑存储层级架构,其中,存储块为最小存储单位。在本发明一实施例中,可以将存储池划分成至少两个存储区域。
在本发明一实施例中,每一个存储区域可以分为至少一个存储组。在一个较优的实施例中,每个存储区域至少被划分为两个存储组。
在一些实施例中,存储区域和存储组是可以合并的,从而可以在该存储层级架构中省略一个层级。
在本发明一实施例中,每个存储区域(或者存储组)可以由至少一个存储块组成,其中存储块可以是一个完整的存储介质、也可以是一个存储介质的一部分。为了在存储区域内部构建冗余存储,每个存储区域(或者存储组)可以由至少两个存储块组成,当其中任何一个存储块出现故障时,可以从该组中其余存储块中计算出完整的被存储数据。冗余存储方式可以为多副本模式、独立冗余磁盘阵列(RAID)模式、纠删码(erase code)模式。在本发明一实施例中,冗余存储方式可以通过ZFS文件系统建立。在本发明一实施例中,为了对抗存储设备/存储介质的硬件故障,每个存储区域(或者存储组)所包含的多个存储块不会位于同一个存储介质中,甚至也不位于同一个存储设备中。在本发明一实施例中,每个存储区域(或者存储组)所包含的任何两个存储块都不会位于同一个存储介质/存储设备中。在本发明另一实施例中,同一存储区域(或者存储组)中位于同一存储介质/存储设备的存储块数量最好小于或等于冗余存储的冗余度。举例说明,当存储冗余采取的RAID 5方式时,其冗余存储的冗余度为1,那么位于同一存储设备的同一存储组的存储块数量最多为1;对RAID6,其冗余存储的冗余度为2,那么位于同一存储设备的同一存储组的存储块数量最多为2。
在本发明一实施例中,每个存储节点都只能读和写自己管理的存储区域。由于多个存储节点对同一个存储块的读操作并不会互相冲突,而多个存储节点同时写一个存储块容易发生冲突,因此,在另一个实施例中,可以是每个存储节点只能写自己管理的存储区域,但是可以读自己管理的存储区域以及其它存储节点管理的存储区域,即写操作是局域性的,但读操作可以是全局性。
在一个实施方式中,存储系统还可以包括存储控制节点,其连接至存储网络,用于确定每个存储节点管理的存储区域。在另一个实施方式中, 每个存储节点可以包括存储分配模块,用于确定该存储节点所管理的存储区域,这可以通过每个存储节点所包括的各个存储分配模块之间的通信和协调处理算法来实现。
在一个实施例中,在监测到一个存储节点发生故障时,可以对其他部分或全部存储节点进行配置,使得这些存储节点接管之前由所述发生故障的存储节点管理的存储区域。例如,可以由其中一个存储节点接管出现故障的存储节点管理的存储区域,或者,可以由其它至少两个存储节点进行接管,其中每个存储节点接管出现故障的存储节点管理的部分的存储区域,比如其他至少两个存储节点分别接管该存储区域内的不同存储组。
在一个实施例中,存储介质可以包括但不限于硬盘、闪存、SRAM、DRAM、NVME或其它形式,存储介质的访问接口可以包括但不限于SAS接口、SATA接口、PCI/e接口、DIMM接口、NVMe接口、SCSI接口、AHCI接口。
在本发明一实施例中,存储网络可以包括至少一个存储交换设备,通过其中包括的存储交换设备之间的数据交换来实现存储节点对存储介质的访问。具体而言,存储节点和存储介质分别通过存储通道与存储交换设备连接。
在本发明一实施例中,存储交换设备可以是SAS交换机或PCI/e交换机,对应地,存储通道可以是SAS(串行连接SCSI)通道或PCI/e通道。
以SAS通道为例,相比传统的基于IP协议的存储方案,基于SAS交换的方案,拥有着性能高,带宽大,单台设备磁盘数量多等优点。在与主机适配器(HBA)或者服务器主板上的SAS接口结合使用后,SAS体系所提供的存储能够很容易的被连接的多台服务器同时访问。
具体而言,SAS交换机到存储设备之间通过一根SAS线连接,存储设备与存储介质之间也是由SAS接口连接,比如,存储设备内部将SAS通道连到每个存储介质(可以在存储设备内部设置一个SAS交换芯片)。由于SAS网络的带宽可以达到24Gb或48Gb,是千兆以太网的几十倍,以及昂贵的万兆以太网的数倍;同时在链路层SAS比IP网有大约一个数量级的提升,在传输层,由于TCP协议三次握手四次关闭,开销很高且TCP的延迟确认机制和慢启动有时会导致100毫秒级的延时,SAS协议的延时只有TCP的几十分之一,性能有更大的提升。总之,SAS网络比基于以太网的TCP/IP在带宽、延时性方面具有巨大优势。本领域技术人员可以理解,PCI/e通道的性能也可以适应系统的需求。
在本发明一实施例中,存储网络可以包括至少两个存储交换设备,所述每个存储节点都可以通过任意一个存储交换设备连接到任何一个存储设备,进而连接至存储介质。当任何一个存储交换设备或连接到一个存储交换设备的存储通道出现故障时,存储节点通过其它存储交换设备读写存储 设备上的数据。
参考图3B,其示出了根据本发明一个实施方式所构建的一个具体的存储系统30。存储系统30中的存储设备被构建成多台JBOD 307-310,分别通过SAS数据线连接至两个SAS交换机305和306,这两个SAS交换机构成了存储系统所包括的存储网络的交换核心。前端为至少两个服务器301和302,每台服务器通过HBA设备(未示出)或主板上SAS接口连接至这两个SAS交换机305和306。服务器之间存在基本的网络连接用来监控和通信。每台服务器中都有一个存储节点,利用从SAS链路获取的信息,管理所有JBOD磁盘中的部分或全部磁盘。具体而言,可以利用本申请文件以上描述的存储区域、存储组、存储块来将JBOD磁盘划分成不同的存储组。每个存储节点都管理一组或多组这样的存储组。当每个存储组内部采用冗余存储的方式时,可以将冗余存储的元数据存在于磁盘之上,使得冗余存储能够被其他存储节点直接从磁盘识别。
在所示的示例性存储系统30中,存储节点可以安装监控和管理模块,负责监控本地存储和其它服务器的状态。当某台JBOD整体异常,或者JBOD上某个磁盘异常时,数据可靠性由冗余存储来确保。当某台服务器故障时,另一台预先设定好的服务器上的存储节点中的管理模块,将按照磁盘上的数据,在本地识别并接管原来由故障服务器的存储节点所管理的磁盘。故障服务器的存储节点原本对外提供的存储服务,也将在新的服务器上的存储节点得到延续。至此,实现了一种全新的高可用的全局存储池结构。
可见,所构建的示例性存储系统30提供了一种多点可控的、全局访问的存储池。硬件方面使用多台服务器来对外提供服务,使用JBOD来存放磁盘。将多台JBOD各自连接两台SAS交换机,两台交换机再分别连接服务器的HBA卡,从而确保JBOD上所有磁盘,能够被所有服务器访问。SAS冗余链路也确保了链路上的高可用性。
在每台服务器本地,利用冗余存储技术,从每台JBOD上选取磁盘组成冗余存储,避免单台JBOD的损失造成数据不可用。当一台服务器失效时,对整体状态进行监控的模块将调度另一台服务器,通过SAS通道访问失效服务器的存储节点所管理的磁盘,快速接管对方负责的这些磁盘,实现高可用的全局存储。
虽然在图3中是以JBOD存放磁盘为例进行了说明,但是应当理解,如图3所示的本发明的实施方式还支持JBOD以外的存储设备。另外,以上是以一块存储介质(整个的)作为一个存储块为例,也同样适用于将一个存储介质的一部分作为一个存储块的情形。
图4示出根据本发明的实施方式的用于示例性存储系统的访问控制方法40的流程图。
在步骤S401,监测存储系统所包括的至少两个存储节点之间的负载状 态。
在步骤S402,在监测到一个存储节点的负载超出预定阈值时,对至少两个存储节点中的相关存储节点所管理的存储区域进行调整。相关存储节点可以是引起该负载的不均衡状态的存储节点,可能依赖于存储区域的调整策略而确定。对存储区域的调整可以是将涉及到的存储块在存储节点间重新分配,或者可以是存储区域的增加、合并、或者删除等。可以对相关存储节点所管理的存储区域的配置表进行调整,所述至少两个存储节点根据所述配置表来确定其所管理的存储区域。对前述配置表的调整可以通过前述的存储系统包括的存储控制节点、或者存储节点包括的存储分配模块进行。
在一个实施方式中,对至少两个存储节点之间的负载状态的监测可以针对如下性能参数中的一项或多项进行:存储节点的每秒读写操作次数(IOPS)请求数、存储节点的吞吐量、存储节点的CPU使用率、存储节点的内存使用率、以及存储节点管理的存储介质的占用率。
在一个实施方式中,可以使每个节点定期监控自己的性能参数,同时定期查询其他节点的数据,然后通过预先定义的再均衡方案或者通过算法动态产生一个全局统一的再均衡方案,最后各个节点执行该方案。在另一个实施方式,存储系统中包括独立于存储节点S1、存储节点S2和存储节点S3的监控节点、或者前述的存储控制节点或者存储分配模块,来监控各个存储节点的性能参数。
在一个实施例中,对于不均衡的判断可以通过预先定义的阀值(可配置)来实现,比如当各个节点之间的IOPS数的偏差超过一定范围则触发再均衡机制。例如,就IOPS而言,可以将IOPS数最大的存储节点的与IOPS数最小的存储节点的IOPS数相比较,在确定二者之间的偏差大于后者的30%时,触发对存储区域进行调整。例如,将IOPS数最大的存储节点所管理的一个存储介质与IOPS数最小的存储节点所管理的一个存储介质相交换,比如选择IOPS数最大的存储节点所管理的占用率最高的存储介质与IOPS数最小的存储节点所管理的占用率最高的存储介质。
备选地,可以将IOPS数最大的存储节点的IOPS数与各个存储节点的IOPS数的平均值相比较,在确定二者之间的偏差大于后者的20%时,触发对存储区域进行调整,使得调整后的存储区域分配方案不会立即触发再均衡。
应当理解,前述的用于表示负载的不均衡状态的预定阈值20%、30%仅是示例性的,还可以根据应用场合和用于需求的不同定义另外的阈值。类似地,对于其他的性能参数,比如存储节点的吞吐量、存储节点的CPU使用率、存储节点的内存使用率、以及存储节点管理的存储介质的占用率,也定义预先定义用于触发存储节点间负载再均衡的阈值。
还应当理解,虽然前述讨论的对于不均衡的判断的预定阈值可以通过 多项性能参数中的各自的指定阈值的一项指定阈值、比如IOPS数来表示,但是发明人预想到该预定阈值其也可以通过多项性能参数中的各自的指定阈值的多项指定阈值的组合来表示。例如,在存储节点的IOPS数达到其指定阈值并且存储节点的吞吐量达到其指定阈值时,才触发存储节点的负载再均衡。
在一个实施方式,对于存储区域的调整(再均衡),可以将负载高的存储节点所管理的存储介质分配到负载低的存储节点所管理的存储区域中,例如可以包括存储介质的交换、或者从负载高的存储节点所管理的存储区域中的删除和在负载低的存储节点所管理的存储区域中的增加、或者将接入存储网络的新的存储介质或新的存储区域平均地加入到至少两个存储区域中(比如,存储系统扩容)、或者将至少两个存储区域中的部分存储区域进行合并(比如,一个存储节点故障)。在一个实施方式,对于存储区域的调整(再均衡),可以开发动态算法,例如,将各个存储介质和各个存储节点的各种负载数据进行加权得到一个单一的负载指标,然后计算出一个再均衡方案,通过移动最少数量的磁盘组,使系统不再超出预定阀值。
在一个实施方式中,可以使每个存储节点定期监控自己所管理的存储介质的性能参数,同时定期查询其他节点所管理的存储介质的性能参数,针对存储介质的性能参数定义用于表示负载的不均衡状态的阈值,例如,该阈值可以为任一存储介质的存储空间使用率为0%(有新的磁盘加入)、任一存储介质的存储空间使用率为90%(有磁盘空间将满)、或者存储系统中存储空间使用率最高的存储介质与存储空间使用率最低的存储介质之差大于后者的20%。应当理解,前述的用于表示负载的不均衡状态的预定阈值0%、90%、30%也仅是示例性的。
图5示出根据本发明一种实施方式的、在图3A所示的存储系统中实现负载再均衡的原理示意图。假设在某一时刻,该存储系统中的存储节点S1的负载很高,其所管理的存储介质包括位于存储设备34处的存储介质1、位于存储设备35处的存储介质1、和位于存储设备36处的存储介质1(如图3A所示),并且其总的存储空间将很快被使用完,同时存储节点3的负载很低,其所管理的存储介质内的存储空间大。
在传统的存储网络中,各个存储节点只能访问直接连接到本身的存储区域。因此在再平衡过程中,需要将重负载的存储节点上的数据复制到轻负载节点上,在此过程中,会出现大量数据复制操作,对存储区域和网络造成额外的负载,影响正常业务数据的IO访问。例如,需要从存储节点1管理的一个或多个存储介质读取数据,然后将读取的数据写入到存储节点3管理的一个或多个,最后释放存储节点1管理的存储介质中存储该数据的磁盘空间,实现负载均衡。
然而,根据本发明的实施方式,由于存储系统所包括中的各个存储节点S1、S2和S3都可以通过存储网络访问所有存储区域,因此,可以通过转 移存储介质访问权的方式来实现存储区域在各个存储节点的之间的迁移,即可以对相关存储节点所管理的存储区域重新分组。在再平衡过程中,各个存储区域中的数据不再需要做复制操作。比如,如图5所示的,将位于存储设备34处的、原先有存储节点3管理的存储介质2划分给存储节点1管理,同时将位于存储设备34处的、原先有存储节点1管理的存储介质1划分给存储节点3管理,以此实现存储节点1和存储节点3之间的剩余存储空间的负载均衡。在此过程中,只需要对存储节点1和存储节点3的配置进行修改,可以在很短时间内完成,不会对用户的业务数据读写性能造成影响。
图6示出根据本发明另一种实施方式的、在图3A所示的存储系统中实现负载再均衡的原理示意图。与图5不同,在图6中,在监测到存储节点S1的负载而存储节点S2的负载较低时,可以将位于存储设备35处的、原先有存储节点2管理的存储介质2划分给存储节点1管理,同时将位于存储设备34处的、原先有存储节点1管理的存储介质1划分给存储节点2管理,以此实现存储节点1和存储节点2之间的剩余存储空间的负载均衡。
在监测到是存储介质扩容的另一种实施方式中,例如,可以将新增存储介质的平均分配到各个存储节点上并由其管理,比如按照加入的顺序,以此维持存储节点之间的负载均衡。
应当理解,虽然上述两个实施方式以将存储介质在不同存储节点之间进行调度以实现负载再均衡,但是其还可以适用于在存储节点之间调度存储区域以实现负载再均衡,例如,在存储介质扩容的情形下,监测到加入的是一个存储区域的情形时,可以将加入的存储区域按加入顺序分配到各个存储节点。
附加地,如图5和图6所示,在监测到存储节点S1的负载已经很高,还可以修改存储系统中的计算节点和存储节点之间的配置,使得原先通过存储节点S1存储数据的至少一个计算节点中的一个或多个计算节点、比如C12,可以通过其他存储节点、比如存储节点S2,来存储数据。此时,计算节点可以需要访问其所处的物理服务器之处的存储节点以便存储数据,则可以不在物理上移动计算节点,而是通过远程访问协议、比如iSCSI协议来访问远程存储节点上的存储区域(如图5所示);或者,可以在对相关存储节点所管理的存储区域进行的调整的同时,将计算节点进行迁移(如图6所示),这个过程中可能需要先关闭待移动的计算节点。
应当理解,前述参考图3-图6讨论的存储系统所包括的存储节点、存储设备、存储介质和存储区域的数目仅是示意性的,根据本发明实施方式的存储系统可以包括至少两个存储节点、存储网络以及与至少两个存储节点通过存储网络连接的至少一个存储设备,所述至少一个存储设备中的每个存储设备可以包括至少一个存储介质,存储网络可以被配置为使得每一个存储节点都能够无需借助其他存储节点而访问所有存储介质。
根据本发明的实施方式,每个存储区域被多个存储节点中的一个存储 节点所管理,当存储节点启动后,存储节点自动连接受它管理的存储区域,然后进行导入,完成之后就可以向上层计算节点提供存储服务。
当监测到存储节点间出现负载不均衡状态时,需要确定对于负载较高的存储节点、需要迁移的存储区域的部分,以及需要将该存储区域迁移到的存储节点。
对于需要迁移的存储区域的部分的确定,可以有多种实施方式。在一个实施方式中,可以由管理人员人工判断需要迁移哪些存储区域。在一个实施方式中,可以采用配置文件方式,即针对每个存储区域预先配置迁移优先级,当需要迁移的时候选择该存储节点当前管理的存储区域中的优先级最高的一个或者多个存储块、存储组或者存储介质来进行迁移。在一个实施方式中,可以根据存储区域所包括的存储块、存储组或者存储介质的负载情况进行迁移;例如,各个存储节点可以监控受其管理的存储区域的所包括的存储块、存储组或者存储介质的负载情况,比如收集IOPS、吞吐量、IO延时等信息,将所有这些信息进行加权综合,以便选择需要迁移的存储区域部分。
对于需要将该存储区域迁移到的存储节点的确定,可以有多种实施方式。在一个实施方式中,可以由管理人员人工判断迁移到的存储节点。在一个实施方式中,可以采用配置文件方式,即针对每个存储区域预先配置迁移目标列表,比如按照优先级排列的存储节点列表,当确定该存储区域(或者部分)需要被迁移后,按照目标列表依次选择迁移目的地。应当注意,采用此种方式,应当保证迁移后不会造成目标存储节点负载过高。在一个实施方式中,可以根据存储节点的负载情况选择要迁移到的存储节点,可以监控各个存储节点负载情况,例如收集CPU使用率、内存使用率、网络带宽使用率等信息,将所有这些信息进行加权综合,以便选择需要将存储区域迁移到的存储节点。例如,各个存储节点可以定期或者不定期地向其他存储节点报告自身的负载情况,当需要迁移的时候,需要迁移数据的存储节点优先选择负载最低的其他存储节点作为目标存储节点进行迁移。
在确定了需要迁移的存储区域(或者其部分)和其管理权迁移到的目标存储节点后,可以由存储系统的管理人员确认并启动具体迁移过程,或者也可以由程序开启该迁移过程。应当注意,迁移过程需要尽量减少对上层计算节点的影响,例如可以选择在应用负载最小的时候迁移,比如在午夜进行(假设该时间段负载最小);在确定在迁移过程需要关闭计算节点的情况下,应当尽量在该计算节点的低使用率的情况下进行;可以预先配置迁移策略,以便处理在确定需要对多个存储区域或者一个存储区域的多个部分进行迁移的情况下的迁移的顺序和并发数量的控制;在开始对存储区域进行迁移之际,可以对相关存储节点对相关存储区域的写或者读操作进行必要的配置,以便保证数据的完整性,例如将所有缓存数据写入磁盘;在 存储区域迁移到目标存储节点后,存储节点需要对该存储节点进行必要的初始化工作,然后该存储区域才可被上层计算节点访问;在迁移过程完成后应当再次监控负载情况,确认负载是否平衡。
如前所述,存储系统可以包括存储控制节点,其连接至所述存储网络,用于确定所述至少两个存储节点中的每个存储节点管理的存储区域;或者,所述存储节点还可以包括存储分配模块,用于确定所述存储节点所管理的存储区域,存储分配模块之间可以共享数据。
在一个实施方式中,存储控制节点或者存储分配模块,记录了各个存储节点负责的存储区域列表。存储节点启动后向存储控制节点或者存储分配模块查询自己管理的存储区域,然后扫描这些存储区域,完成初始化工作。当确定需要发生存储区域迁移时,存储控制节点或者存储分配模块修改相关存储节点的存储区域列表,然后通知存储节点按照要求完成实际的切换工作。
举例而言,假设在SAS存储系统30中需要将存储区域1从存储节点A迁移到存储节点B,则迁移过程可以包括如下步骤:
1)从存储节点A的已管理存储区域列表中删除存储区域1;
2)在存储节点A上将所有缓存数据强制刷入存储区域1;
3)在存储节点A上通过SAS指令关闭(或者重置)存储节点A和存储区域1中所有存储介质之间的SAS链接;
4)在存储节点B上的已管理存储区域列表中添加存储区域1;
5)在存储节点B上通过SAS指令打开(或者重置)存储节点B和存储区域1中所有存储介质之间的SAS链接;
6)存储节点B扫描存储区域1中的所有存储介质,完成初始化工作;以及
7)应用程序通过存储节点B访问存储区域1中的数据。
应当注意,尽管出于简化说明的目的将本发明所述的方法表示和描述为一连串动作,但是应理解和认识到要求保护的主题内容将不受这些动作的执行顺序所限制,因为一些动作可以按照与这里示出和描述的顺序不同的顺序出现或者与其它动作并行地出现,同时一些动作还可能包括若干子步骤,而这些子步骤之间可能出现时序上交叉执行的可能。另外,可能并非所有图示的动作是实施根据所附权利要求书所述的方法所必须的。再者,前述步骤的描述不排除该方法还可以包括可能取得附加效果的附加步骤。还应当理解,不同的实施方式或者流程中描述的方法步骤可以相互组合或者替换。
图7示出根据本发明的一个实施方式的用于存储系统的负载再均衡装置70的框图。负载再均衡装置70可以包括:监测模块701,用于监测所述至少两个存储节点之间的负载状态;以及调整模块702,用于在监测到负载的不均衡状态超出预定阈值的情况下,对所述至少两个存储节点中的相关存储节点所管理的存储区域进行调整。
应当理解,装置70中记载的每个模块与参考图4描述的方法40中的每个步骤相对应。由此,上文针对图4描述的操作和特征同样适用于装置70及其中包含的模块,重复的内容在此不再赘述。
根据本发明的实施方式,装置70可以被实现在每个存储节点处,也可以被实现在多个存储节点的调度装置中。
本发明的教导还可以实现为一种计算机可读存储介质的计算机程序产品,包括计算机程序代码,当计算机程序代码由处理器执行时,其使得处理器能够按照本发明实施方式的方法来实现如本文实施方式所述的用于存储系统的负载再均衡方案。计算机存储介质可以为任何有形媒介,例如软盘、CD-ROM、DVD、硬盘驱动器、甚至网络介质等。
根据本发明的实施方式,提供了一种支持存储介质或者存储区域的迁移的存储节点负载再均衡方案,直接通过在各个存储节点之间重新分配存储介质或者存储区域的控制权来实现再均衡,避免了迁移过程中对正常业务数据的影响,显著地提升了存储节点负载再均衡的效率。
应当理解,虽然以上描述了本发明实施方式的一种实现形式可以是计算机程序产品,但是本发明的实施方式的方法或装置可以被依软件、硬件、或者软件和硬件的结合来实现。硬件部分可以利用专用逻辑来实现;软件部分可以存储在存储器中,由适当的指令执行系统,例如微处理器或者专用设计硬件来执行。本领域的普通技术人员可以理解上述的方法和设备可以使用计算机可执行指令和/或包含在处理器控制代码中来实现,例如在诸如磁盘、CD或DVD-ROM的载体介质、诸如只读存储器(固件)的可编程的存储器或者诸如光学或电子信号载体的数据载体上提供了这样的代码。本发明的方法和装置可以由诸如超大规模集成电路或门阵列、诸如逻辑芯片、晶体管等的半导体、或者诸如现场可编程门阵列、可编程逻辑设备等的可编程硬件设备的硬件电路实现,也可以用由各种类型的处理器执行的软件实现,也可以由上述硬件电路和软件的结合例如固件来实现。
应当理解,尽管在上文的详细描述中提及了装置的若干模块或子模块,但是这种划分仅仅是示例性而非强制性的。实际上,根据本发明的示例性实施方式,上文描述的两个或更多模块的特征和功能可以在一个模块中实现。反之,上文描述的一个模块的特征和功能可以进一步划分为由多个模块来实现。
还应当理解,为了不模糊本发明的实施方式,说明书仅对一些关键、未必必要的技术和特征进行了描述,而可能未对一些本领域技术人员能够实现的特征做出说明。
以上所述仅为本发明的较佳实施例而已,并不用以限制本发明,凡在本发明的精神和原则之内,所作的任何修改、等同替换等,均应包含在本发明的保护范围之内。

Claims (18)

  1. 一种用于存储系统的负载再均衡方法,所述存储系统包括存储网络、至少两个存储节点以及至少一个存储设备,所述至少两个存储节点和所述至少一个存储设备分别连接至所述存储网络,所述至少一个存储设备中的每个存储设备包括至少一个存储介质,其中将所述存储系统所包括的所有存储介质构成一个存储池,所述存储网络被配置为使得每一个存储节点都能够无需借助其他存储节点而访问所有存储介质,并且将所述存储池划分成至少两个存储区域,每个存储节点负责管理零到多个存储区域,
    所述方法包括:
    监测所述至少两个存储节点之间的负载状态;以及
    在监测到一个存储节点的负载超出预定阈值时,对所述至少两个存储节点中的相关存储节点所管理的存储区域进行调整。
  2. 根据权利要求1所述的方法,其中,所述存储系统还包括:
    存储控制节点,连接至所述存储网络,用于确定所述至少两个存储节点中的每个存储节点管理的存储区域;或
    所述存储节点还包括:
    存储分配模块,用于确定所述存储节点所管理的存储区域。
  3. 根据权利要求2所述的方法,其中,所述存储控制节点或者所述存储分配模块记录了所述至少两个存储节点中的每个存储节点管理的存储区域的存储区域列表,并且所述对所述至少两个存储节点中的相关存储节点所管理的存储区域进行调整包括:
    修改相关存储节点的所述存储区域列表。
  4. 根据权利要求1所述的方法,其中,所述监测所述至少两个存储节点之间的负载状态包括监测所述至少两个存储节点的以下性能参数中的一项或多项:
    存储节点的IOPS请求数;
    存储节点的吞吐量;
    存储节点的CPU使用率;
    存储节点的内存使用率;以及
    存储节点管理的存储介质的存储空间使用率。
  5. 根据权利要求4所述的方法,其中,所述预定阈值通过所述性能参数的各自的指定阈值的一项或者多项的组合来表示。
  6. 根据权利要求5所述的方法,其中,所述性能参数的各自的指定阈值包括:
    每项性能参数的参数值最高的存储节点的与该项性能参数的参数值最 低的存储节点的参数值之间的偏差;
    每项性能参数的参数值最高的存储节点的该项参数值与各个存储节点的该项参数的平均值之间的偏差;或者
    针对每项性能参数的指定值。
  7. 根据权利要求1所述的方法,其中,所述至少两个存储区域中的每个存储区域由至少一个存储块组成,一个存储块是一个完整的存储介质,或者一个存储块是一个存储介质的一部分。
  8. 根据权利要求7所述的方法,其中,对存储区域进行的所述调整包括:对相关存储节点所管理的存储区域的配置表进行调整,所述至少两个存储节点根据所述配置表来确定其所管理的存储区域。
  9. 根据权利要求1所述的方法,其中,所述至少两个存储区域中的每个存储区域由至少一个存储块组成,一个存储块是一个完整的存储介质,并且其中对存储区域进行的所述调整包括:
    将所述至少两个存储区域中的第一存储区域中的一个存储介质和第二存储区域中的一个存储介质相交换;
    从所述第一存储区域中删除一个存储介质,并且将该删除的存储介质添加到所述第二存储区域中;
    将接入存储网络的新的存储介质或新的存储区域平均地加入到所述至少两个存储区域中;或者
    将所述至少两个存储区域中的部分存储区域进行合并。
  10. 根据权利要求1-9中任一项所述的方法,其中,所述对所述至少两个存储节点中的相关存储节点所管理的存储区域进行调整包括:由所述存储系统的管理人员人工地确定相关存储节点所管理的存储区域的调整方式;
    采用配置文件方式来确定相关存储节点所管理的存储区域的调整方式;或者
    根据存储节点的负载情况来确定相关存储节点所管理的存储区域的调整方式,
    其中,所述调整方式包括要迁移的存储区域的部分和要迁移到的目标存储节点。
  11. 根据权利要求1-9中任一项所述的方法,其中,所述存储网络包括至少一个存储交换设备,所有至少两个存储节点和所述至少一个存储介质都通过存储通道与存储交换设备连接。
  12. 根据权利要求11所述的方法,其中,所述存储通道是SAS通道或PCI/e通道,所述存储交换设备是SAS交换机或PCI/e交换机。
  13. 根据权利要求1-9中任一项所述的方法,其中,所述存储设备为JBOD;和/或
    所述存储介质是硬盘、闪存、SRAM或DRAM;和/或所述存储介质的接口是SAS接口、SATA接口、PCI/e接口、DIMM接口、NVMe接口、SCSI接口、AHCI接口。
  14. 根据权利要求1-9中任一项所述的方法,其中,每个存储节点对应一个或多个计算节点,并且每个存储节点与其对应的计算节点都位于同一服务器。
  15. 根据权利要求14所述的方法,其中,所述存储节点是所述服务器的一个虚拟机、一个容器或直接运行在所述服务器的物理操作系统上的一个模块;和/或
    所述计算节点是所述服务器的一个虚拟机、一个容器或直接运行在所述服务器的物理操作系统上的一个模块。
  16. 根据权利要求1-9中任一项所述的方法,其中,存储节点对其所管理的存储区域的管理包括:
    每个存储节点只能读写自己管理的存储区域;或
    每个存储节点只能写自己管理的存储区域,但可以读自己管理的存储区域以及其它存储节点管理的存储区域。
  17. 一种用于存储系统的负载再均衡装置,所述存储系统包括存储网络、至少两个存储节点以及至少一个存储设备,所述至少两个存储节点和所述至少一个存储设备分别连接至所述存储网络,
    所述至少一个存储设备中的每个存储设备包括至少一个存储介质,其中所述存储系统所包括的所有存储介质构成一个存储池,所述存储网络被配置为使得每一个存储节点都能够无需借助其他存储节点而访问所有存储介质,并且将所述存储池划分成至少两个存储区域,每个存储节点负责管理零到多个存储区域,
    所述负载再均衡装置包括:
    监测模块,用于监测所述至少两个存储节点之间的负载状态;以及
    调整模块,用于在监测到一个存储节点的负载超出预定阈值时,对所述至少两个存储节点中的相关存储节点所管理的存储区域进行调整。
  18. 一种在计算机可读存储介质中实现的计算机程序产品,所述计算机可读存储介质具有存储于其中的计算机可读程序代码部分,所述计算机可读程序代码部分被配置为执行根据权利要求1-16所述的方法。
PCT/CN2017/077758 2011-10-11 2017-03-22 用于存储系统的负载再均衡方法及装置 WO2017162179A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
US16/139,712 US10782898B2 (en) 2016-02-03 2018-09-24 Data storage system, load rebalancing method thereof and access control method thereof
US16/378,076 US20190235777A1 (en) 2011-10-11 2019-04-08 Redundant storage system

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610173784.7 2016-03-23
CN201610173784.7A CN105657066B (zh) 2016-03-23 2016-03-23 用于存储系统的负载再均衡方法及装置

Related Parent Applications (2)

Application Number Title Priority Date Filing Date
PCT/CN2017/077753 Continuation-In-Part WO2017162176A1 (zh) 2011-10-11 2017-03-22 存储系统、存储系统的访问方法和存储系统的访问装置
PCT/CN2017/077757 Continuation-In-Part WO2017162178A1 (zh) 2011-10-11 2017-03-22 对存储系统的访问控制方法及装置

Related Child Applications (2)

Application Number Title Priority Date Filing Date
PCT/CN2017/071830 Continuation-In-Part WO2017133483A1 (zh) 2011-10-11 2017-01-20 存储系统
PCT/CN2017/077754 Continuation-In-Part WO2017162177A1 (zh) 2011-10-11 2017-03-22 冗余存储系统、冗余存储方法和冗余存储装置

Publications (1)

Publication Number Publication Date
WO2017162179A1 true WO2017162179A1 (zh) 2017-09-28

Family

ID=56495388

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/077758 WO2017162179A1 (zh) 2011-10-11 2017-03-22 用于存储系统的负载再均衡方法及装置

Country Status (2)

Country Link
CN (1) CN105657066B (zh)
WO (1) WO2017162179A1 (zh)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111078153A (zh) * 2019-12-20 2020-04-28 同方知网(北京)技术有限公司 一种基于文件的分布式存储方法
CN111290699A (zh) * 2018-12-07 2020-06-16 杭州海康威视系统技术有限公司 数据迁移方法、装置及系统
CN111464602A (zh) * 2020-03-24 2020-07-28 平安银行股份有限公司 流量处理方法、装置、计算机设备和存储介质
CN113986522A (zh) * 2021-08-29 2022-01-28 中盾创新数字科技(北京)有限公司 一种基于负载均衡的分布式存储服务器扩容系统
CN117041256A (zh) * 2023-10-08 2023-11-10 深圳市连用科技有限公司 一种网络数据传输存储方法及系统

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105657066B (zh) * 2016-03-23 2019-06-14 天津书生云科技有限公司 用于存储系统的负载再均衡方法及装置
CN107423301B (zh) * 2016-05-24 2021-02-23 华为技术有限公司 一种数据处理的方法、相关设备及存储系统
CN106375427A (zh) * 2016-08-31 2017-02-01 浪潮(北京)电子信息产业有限公司 一种分布式san存储系统链路冗余优化方法
CN108111566B (zh) * 2016-11-25 2020-11-06 杭州海康威视数字技术股份有限公司 一种云存储系统扩容方法、装置及云存储系统
CN106990919A (zh) * 2017-03-04 2017-07-28 郑州云海信息技术有限公司 自动隔离故障磁盘的存储管理方法及装置
CN107193502B (zh) * 2017-05-27 2021-04-06 郑州云海信息技术有限公司 一种存储服务质量保障方法及装置
CN109788006B (zh) * 2017-11-10 2021-08-24 阿里巴巴集团控股有限公司 数据均衡方法、装置及计算机设备
WO2020086145A1 (en) * 2018-10-22 2020-04-30 Commscope Technologies Llc Load measurement and load balancing for packet processing in a long term evolution evolved node b
CN111381766B (zh) * 2018-12-28 2022-08-02 杭州海康威视系统技术有限公司 一种磁盘动态加载的方法和云存储系统
CN113190167A (zh) * 2020-01-14 2021-07-30 伊姆西Ip控股有限责任公司 用于管理计算设备的方法、电子设备和计算机存储介质
US11061571B1 (en) * 2020-03-19 2021-07-13 Nvidia Corporation Techniques for efficiently organizing and accessing compressible data
CN111552441B (zh) * 2020-04-29 2023-02-28 重庆紫光华山智安科技有限公司 数据存储方法和装置、主节点及分布式系统
CN112747688A (zh) * 2020-12-24 2021-05-04 山东大学 一种基于超声检测定位的离散制造业外观质量信息收集装置及其应用
CN113741819A (zh) * 2021-09-15 2021-12-03 第四范式(北京)技术有限公司 数据分级存储的方法和装置

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101582013A (zh) * 2009-06-10 2009-11-18 成都市华为赛门铁克科技有限公司 一种在分布式存储中处理存储热点的方法、装置及系统
CN101827120A (zh) * 2010-02-25 2010-09-08 浪潮(北京)电子信息产业有限公司 一种集群存储方法及系统
US20130223216A1 (en) * 2012-02-26 2013-08-29 Palo Alto Research Center Incorporated QoS AWARE BALANCING IN DATA CENTERS
CN103503414A (zh) * 2012-12-31 2014-01-08 华为技术有限公司 一种计算存储融合的集群系统
CN104238955A (zh) * 2013-06-20 2014-12-24 杭州迪普科技有限公司 一种存储资源虚拟化按需分配的装置和方法
CN105657066A (zh) * 2016-03-23 2016-06-08 天津书生云科技有限公司 用于存储系统的负载再均衡方法及装置

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4681374B2 (ja) * 2005-07-07 2011-05-11 株式会社日立製作所 ストレージ管理システム
CN104657316B (zh) * 2015-03-06 2018-01-19 北京百度网讯科技有限公司 服务器
CN104850634A (zh) * 2015-05-22 2015-08-19 中国联合网络通信集团有限公司 一种数据存储节点调整方法及系统

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101582013A (zh) * 2009-06-10 2009-11-18 成都市华为赛门铁克科技有限公司 一种在分布式存储中处理存储热点的方法、装置及系统
CN101827120A (zh) * 2010-02-25 2010-09-08 浪潮(北京)电子信息产业有限公司 一种集群存储方法及系统
US20130223216A1 (en) * 2012-02-26 2013-08-29 Palo Alto Research Center Incorporated QoS AWARE BALANCING IN DATA CENTERS
CN103503414A (zh) * 2012-12-31 2014-01-08 华为技术有限公司 一种计算存储融合的集群系统
CN104238955A (zh) * 2013-06-20 2014-12-24 杭州迪普科技有限公司 一种存储资源虚拟化按需分配的装置和方法
CN105657066A (zh) * 2016-03-23 2016-06-08 天津书生云科技有限公司 用于存储系统的负载再均衡方法及装置

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111290699A (zh) * 2018-12-07 2020-06-16 杭州海康威视系统技术有限公司 数据迁移方法、装置及系统
CN111290699B (zh) * 2018-12-07 2023-03-14 杭州海康威视系统技术有限公司 数据迁移方法、装置及系统
CN111078153A (zh) * 2019-12-20 2020-04-28 同方知网(北京)技术有限公司 一种基于文件的分布式存储方法
CN111078153B (zh) * 2019-12-20 2023-08-01 同方知网数字出版技术股份有限公司 一种基于文件的分布式存储方法
CN111464602A (zh) * 2020-03-24 2020-07-28 平安银行股份有限公司 流量处理方法、装置、计算机设备和存储介质
CN111464602B (zh) * 2020-03-24 2023-04-18 平安银行股份有限公司 流量处理方法、装置、计算机设备和存储介质
CN113986522A (zh) * 2021-08-29 2022-01-28 中盾创新数字科技(北京)有限公司 一种基于负载均衡的分布式存储服务器扩容系统
CN117041256A (zh) * 2023-10-08 2023-11-10 深圳市连用科技有限公司 一种网络数据传输存储方法及系统
CN117041256B (zh) * 2023-10-08 2024-02-02 深圳市连用科技有限公司 一种网络数据传输存储方法及系统

Also Published As

Publication number Publication date
CN105657066B (zh) 2019-06-14
CN105657066A (zh) 2016-06-08

Similar Documents

Publication Publication Date Title
WO2017162179A1 (zh) 用于存储系统的负载再均衡方法及装置
US8898385B2 (en) Methods and structure for load balancing of background tasks between storage controllers in a clustered storage environment
US10642704B2 (en) Storage controller failover system
WO2017133483A1 (zh) 存储系统
US7913037B2 (en) Computer system for controlling allocation of physical links and method thereof
WO2017162177A1 (zh) 冗余存储系统、冗余存储方法和冗余存储装置
JP5840548B2 (ja) データセンタ内のリソースの使用効率を改善するための方法及び装置
US8850152B2 (en) Method of data migration and information storage system
US7636801B1 (en) Coordination of quality of service in a multi-layer virtualized storage environment
US7386662B1 (en) Coordination of caching and I/O management in a multi-layer virtualized storage environment
US10782898B2 (en) Data storage system, load rebalancing method thereof and access control method thereof
WO2017162176A1 (zh) 存储系统、存储系统的访问方法和存储系统的访问装置
WO2017162178A1 (zh) 对存储系统的访问控制方法及装置
JP5523468B2 (ja) 直接接続ストレージ・システムのためのアクティブ−アクティブ・フェイルオーバー
US20190235777A1 (en) Redundant storage system
WO2017167106A1 (zh) 存储系统
US20150293708A1 (en) Connectivity-Aware Storage Controller Load Balancing
US9747040B1 (en) Method and system for machine learning for write command selection based on technology feedback
US20180107409A1 (en) Storage area network having fabric-attached storage drives, san agent-executing client devices, and san manager
WO2016190893A1 (en) Storage management
US11768744B2 (en) Alerting and managing data storage system port overload due to host path failures
US11693800B2 (en) Managing IO path bandwidth
JP2021124796A (ja) 分散コンピューティングシステム及びリソース割当方法
US11418594B1 (en) Multi-path layer configured to provide link availability information to storage system for load rebalancing
US11941443B2 (en) Distributed storage workload management

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17769456

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 17769456

Country of ref document: EP

Kind code of ref document: A1