CN105657066A

CN105657066A - Load rebalance method and device used for storage system

Info

Publication number: CN105657066A
Application number: CN201610173784.7A
Authority: CN
Inventors: 王东临; 金友兵; 莫仲华; 齐宇
Original assignee: TIANJIN SURSEN CLOUD TECHNOLOGY Co Ltd
Current assignee: Beijing Shusheng Information Technology Co ltd
Priority date: 2016-03-23
Filing date: 2016-03-23
Publication date: 2016-06-08
Anticipated expiration: 2036-03-23
Also published as: CN105657066B; WO2017162179A1

Abstract

The invention relates to a load rebalance method and device used for a storage system. The method includes the steps that a load state between at least two storage nodes is monitored; when it is monitored that a load of one storage node exceeds a preset threshold, storage areas managed by related storage nodes in the at least two storage nodes are adjusted. By means of the load rebalance method and device, the true migration process of data is avoided during load rebalance between the storage areas, and therefore normal business data can not be affected.

Description

For storing load equalization methods again and the device of system

Technical field

The present invention relates to the technical field of data-storage system, more particularly, to the load equalization methods again and the device that are used for the system that stores.

Background technology

Along with computer utility scale is increasing, the demand of memory space is also grown with each passing day. Corresponding, storage resource (such as storage medium) integration of plural number equipment is integrated and provides storage service to become present main flow as a storage pool. In traditional storage system, this storage system is usually connected what multiple distributed storage nodes formed by TCP/IP network. Fig. 1 illustrates the configuration diagram of the storage system of prior art. As it is shown in figure 1, in traditional storage system, each memory node S is connected to TCP/IP network (being realized by core switch) by Access Network switch. Each memory node is an independent physical server, and every station server has the some storage mediums of oneself. Each memory node is coupled together by the such storage network of such as IP network, constitutes a storage pool.

At the opposite side of core switch, each computing node C is connected to TCP/IP network (being realized by core network switches) also by Access Network switch, to access whole storage pool by TCP/IP network.

But, in the storage system that this is traditional, when when it come to arriving dynamic equilibrium, it is necessary to physical data on memory node is migrated, to reach balance purpose.

Further, in the storage system that this is traditional, generally when user's write data, these data are likely to be assigned on memory node fifty-fifty, and it is all that comparison is balanced that now memory node load and data take. But in situations below, it may appear that data unbalanced:

(1) due to data Allocation Algorithms and user data itself, causing that data fail to be evenly distributed to different memory node, the memory node load shown as is high, and some memory node loads are low;

(2) dilatation operation: realizing dilatation typically by increasing new node, the memory node load being now newly added is 0. The Data Physical of existing memory node must be migrated a part to dilatation node, it is achieved the load between memory node is balanced again.

Fig. 2 illustrates the schematic diagram of the Data Migration in traditional process balanced again based on the load realizing between memory node in the storage system 1 of TCP/IP network. In this example, the part data of storage in memory node S1 higher for load are migrated in the relatively low memory node S2 of load, is specifically related to the Data Migration between the storage medium of these two memory nodes, as shown in dotted arrow 201.Visible, in the process of the load equilibrium again between the memory node realizing TCP/IP network, substantial amounts of disk read-write performance and the network bandwidth can be taken, affect the readwrite performance of regular traffic data.

Summary of the invention

In view of this, one of purpose of embodiment of the present invention is in that to provide a kind of equalization scheme again of the high-efficient carrier for storing system.

According to the embodiment of the present invention, described storage system can include storage network, at least two memory node and at least one storage device, described at least two memory node and at least one storage device described are respectively connecting to described storage network, each storage device at least one storage device described includes at least one storage medium, wherein all storage mediums included by described storage system are constituted a storage pool, described storage network is configured such that each memory node can both without accessing each storage medium by other memory nodes, and described storage pool is divided at least two memory area in units of storage medium, each memory node is responsible for zero and is arrived multiple memory areas.

According to an aspect of the present invention, it is provided that a kind of load equalization methods again for aforementioned storage system. Described method includes: monitor the load condition between described at least two memory node; And when the load monitoring a memory node exceeds predetermined threshold, the memory area that the associated storage node in described at least two memory node is managed is adjusted.

According to another aspect of the present invention, it is provided that a kind of load balancer again for aforementioned storage system. Described device includes: monitoring modular, for monitoring the load condition between described at least two memory node; And adjusting module, for exceeding predetermined threshold at the imbalance monitoring load, the memory area that the associated storage node in described at least two memory node is managed is adjusted.

Further, what the load condition monitored between described at least two memory node can include monitoring in the following performance parameter of described at least two memory node is one or more: the IOPS number of request of memory node; The handling capacity of memory node; The CPU usage of memory node; The memory usage of memory node; And the memory space utilization rate of the storage medium of memory node management.

Further, predetermined threshold can be represented by of the respective appointment threshold value of described performance parameter or multinomial combination.

Further, the memory node that the respective appointment threshold value of performance parameter may include that each performance parameter parameter value is the highest and deviation between the parameter value of memory node that the parameter value of this performance parameter is minimum; Deviation between the meansigma methods of this parameter value of the memory node that the parameter value of each performance parameter is the highest and this parameter of each memory node; Or the designated value for each performance parameter.

In one embodiment, what predetermined threshold can be configured so that in the following is one or more: the deviation between deviation between the IOPS number of request of the IOPS number of request of the memory node that IOPS number is maximum and the minimum memory node of IOPS number is the 30% of the IOPS number of request of the minimum memory node of IOPS number; Between deviation between the meansigma methods of the IOPS number of request of the memory node that IOPS number is maximum and the IOPS number of request of each memory node deviation is this meansigma methods 20%;The memory space utilization rate of arbitrary storage medium is 0%; The memory space utilization rate of arbitrary storage medium is 90%; Or the storage medium that memory space utilization rate that arbitrary memory node manages is the highest and memory space use the difference of the memory space utilization rate between the storage medium that utilization rate is minimum more than 20%.

According to the embodiment of the present invention, each memory area in described at least two memory area is made up of at least one memory block, and a memory block is a complete storage medium, or memory block is a part for a storage medium.

In one embodiment, the described adjustment that memory area is carried out may include that the allocation list of the memory area that associated storage node is managed is adjusted, and described at least two memory node determines its memory area managed according to described allocation list.

In one embodiment, each memory area in described at least two memory area is made up of at least one memory block, one memory block is a complete storage medium, and the adjustment wherein memory area carried out may include that and a storage medium in the first memory area in described at least two memory area and a storage medium in the first memory area is exchanged; From described first memory area, delete a storage medium, and the storage medium of this deletion is added in described second memory area; The new storage medium or new memory area that access storage network are joined in described at least two memory area fifty-fifty; Or the territory, partial memory area in described at least two memory area is merged.

In one embodiment, the memory area associated storage node in described at least two memory node managed is adjusted including: artificially determined the adjustment mode of memory area that associated storage node manages by the management personnel of described storage system; Adopt the adjustment mode of the memory area that configuration file mode determines that associated storage node manages; Or the loading condition according to memory node determines the adjustment mode of the memory area that associated storage node manages. Adjustment mode can include the part of memory area to migrate and the target storage node to move to.

Further, storage network can include at least one storage switching equipment, and all at least two memory nodes and at least one storage medium described all pass through memory channel and be connected with storage switching equipment. Memory channel can be SAS passage or PCI/e passage, and storage switching equipment can be SAS switch or PCI/e switch.

Further, storage device can be JBOD; And/or storage medium can be hard disk, flash memory, SRAM or DRAM.

Further, the interface of storage medium can be SAS interface, SATA interface, PCI/e interface, DIMM interface, NVMe interface, scsi interface, ahci interface.

According to the embodiment of the present invention, each memory node can corresponding one or more computing nodes, and the corresponding computing node of each memory node is all located at same server.

According to the embodiment of the present invention, memory node can be a virtual machine of described server, a container or the module running directly on the physical operating system of described server; And/or computing node can be a virtual machine of described server, a container or the module running directly on the physical operating system of described server.

According to the embodiment of the present invention, the management of the memory area that it manages be may include that each memory node can only read and write self-administered memory area by memory node; Or each memory node can only write self-administered memory area, but self-administered memory area and the memory area of other memory node management can be read.

According to a further aspect of the invention, a kind of computer program realized in a computer-readable storage medium is provided, described computer-readable recording medium has the computer readable program code part being stored therein, and described computer readable program code is partially configured as execution according to preceding method. Such as, described computer readable program code part includes: first can perform part, for monitoring the load condition between described at least two memory node; And second can perform part, for when monitoring the load of a memory node beyond predetermined threshold, the memory area that the associated storage node in described at least two memory node is managed is adjusted.

According to the embodiment of the present invention, provide the memory node load equalization scheme again of a kind of migration supporting memory area, either directly through the control redistributing memory area between each memory node balanced again to the load realizing memory node, it is to avoid impact on regular traffic data in transition process.

From the detailed description made below in conjunction with accompanying drawing, these and other advantages of the present invention and feature will be apparent from, wherein in whole some accompanying drawings described below, and the numbering that similar element will have like.

Accompanying drawing explanation

Fig. 1 illustrates the configuration diagram of the storage system of prior art;

Fig. 2 illustrates the principle schematic that the load realizing between memory node in the storage system of prior art is balanced again;

Fig. 3 A illustrates the configuration diagram of the concrete storage system of constructed by an embodiment of the invention;

Fig. 3 B illustrates the configuration diagram of a constructed according to another implementation of the invention concrete storage system;

Fig. 4 illustrates the flow chart of the load equalization methods again for storing system according to an embodiment of the invention;

Fig. 5 illustrates the principle schematic balanced again according to realizing load in one embodiment of the present invention;

Fig. 6 illustrates the principle schematic balanced again according to realizing load in another embodiment of the present invention; And

Fig. 7 illustrates the block diagram of the load balancer again for storing system according to an embodiment of the invention.

Detailed description of the invention

It is described more fully below present disclosure hereinafter with reference to accompanying drawing, wherein shows the embodiment of present disclosure in the accompanying drawings. But these embodiments multi-form can realize and should not be construed as being limited to embodiment as herein described with many. On the contrary, provide these examples so that present disclosure will be thorough and complete, and will scope of the present disclosure to those skilled in the art's expression all sidedly.

The various embodiments of the present invention are described in detail in an illustrative manner below in conjunction with accompanying drawing.

Fig. 3 A illustrates the configuration diagram of storage system according to the embodiment of the present invention. This storage system includes storage network; Memory node, is connected to described storage network; And storage device, it is similarly connected to described storage network. Each storage device includes at least one storage medium.Such as, the storage device that inventor commonly uses can place 45 pieces of storage mediums. Wherein, described storage network is configured such that each memory node can both without accessing all storage mediums by other memory nodes. Storage network is illustrated as SAS switch by Fig. 3 A, but it is to be understood that storage network can also is that SAS set or other forms that will discuss below. Fig. 3 A schematically shows three memory nodes, i.e. memory node S1, memory node S2 and memory node S3, is directly connected with SAS switch respectively. Storage system shown in Fig. 3 A includes physical server 31,32 and 33, and these physical servers are connected with storage device respectively through storage network. Physical server 31 includes computing node C11, the C12 and the memory node S1 that are co-located in it, physical server 32 includes computing node C21, the C22 and the memory node S2 that are co-located in it, and physical server 33 includes computing node C31, the C32 and the memory node S3 that are co-located in it. Storage system shown in Fig. 3 A includes storage device 34,35 and 36, storage device 34 includes being co-located in its storage medium 1, storage medium 2 and storage medium 3, storage device 35 includes being co-located in its storage medium 1, storage medium 2 and storage medium 3, and storage device 36 includes being co-located in its storage medium 1, storage medium 2 and storage medium 3.

Utilize the storage system that the embodiment of the present invention provides, each memory node can both without accessing all storage mediums by other memory nodes, so that all of storage medium of the present invention is all actually shared by all of memory node, and then achieve the effect of pool of global storage. It is to say, storage network is configured such that each memory node can both without accessing all storage mediums by other memory nodes. Further, storage network is configured such that each memory node is only responsible for fixing storage medium simultaneously, and ensure that a storage medium will not be read by multiple memory nodes simultaneously, cause corrupted data, it is thus possible to realize each memory node without accessing by other memory nodes by the storage medium of its management, and the integrity of the data of storage in storage system can both be ensure that. Furthermore, it is possible to constructed storage pool is divided at least two memory area, each memory node is responsible for zero and is arrived multiple memory areas. With reference to Fig. 3 A, it utilizes different background pattern, diagrammatically illustrates the situation of the memory area of memory node management, and the storage medium wherein identical memory area included and the memory node being responsible for it are indicated with identical background patterns. Specifically, memory node S1 is responsible for the first memory area, and it includes being in the storage medium 1 of storage device 34, being in the storage medium 1 of storage device 35 and be in the storage medium 1 of storage device 36; Memory node S2 is responsible for the second memory area, and it includes being in the storage medium 2 of storage device 34, being in the storage medium 2 of storage device 35 and be in the storage medium 2 of storage device 36; Memory node S3 is responsible for the 3rd memory area, and it includes being in the storage medium 3 of storage device 34, being in the storage medium 3 of storage device 35 and be in the storage medium 3 of storage device 36.

Simultaneously, as can be seen from the above description, compared to prior art, (wherein memory node is positioned at storage medium side, or strictly speaking, storage medium is the built-in dish of memory node place physical machine), in the embodiment of the present invention, the physical machine at memory node place is independent of storage device, and storage device is more as the passage connecting storage medium and storage network.

Such mode so that when needs carry out dynamic equilibrium, it is not necessary to physical data is migrated in different storage mediums, it is only necessary to balance the memory area (or storage medium) that different memory nodes manages by configuring.

In an alternative embodiment of the invention, storage-node side farther includes computing node, and computing node and memory node are arranged in a physical server, and this physical server is connected with storage device by storing network. Utilizing the gathering storage system that computing node and memory node are positioned at Same Physical machine constructed by embodiment of the present invention, saying from overall structure, it is possible to reduce the quantity of required physical equipment, thus reducing cost. Meanwhile, the storage resource that computing node can also desire access to it at local IP access. Further, since computing node and memory node are aggregated on same physical server, only shared drive can be simply arrived in data exchange between the two, and performance is excellent especially.

In the storage system that the embodiment of the present invention provides, computing node includes to the I/O data path length between storage medium: (1) storage medium is to memory node; And (2) memory node is to the computing node (cpu bus path) being aggregated in Same Physical server. And by contrast, the storage system of prior art shown in Fig. 1, its computing node includes to the I/O data path length between storage medium: (1) storage medium is to memory node; (2) memory node is to storing network insertion network switch; (3) storage network insertion network switch is to core network switches; (4) core network switches is to computing network Access Network switch; And (5) computing network Access Network switch is to computing node. Obviously, the total data path of the storage system of embodiment of the present invention is only close to (1st) item of heritage storage system. That is, the storage system that the embodiment of the present invention provides, by the ultimate attainment compression of I/O data path length can drastically increase the I/O channel performance of storage system, its actual operational effect is in close proximity to the I/O passage of read-write local hard drive.

In an embodiment of the present invention, memory node can be a virtual machine of physical server, a container or the module running directly on the physical operating system of server, and computing node can also be a virtual machine of same physical machine server, a container or the module running directly on the physical operating system of described server. In one embodiment, each memory node can corresponding one or more computing nodes.

Specifically, it is possible to a physical server is divided into multiple virtual machine, wherein a virtual machine does memory node use, and other virtual machine does computing node and uses; May also be and utilize a module on physics OS to do memory node use, in order to realize better performance.

In an embodiment of the present invention, the Intel Virtualization Technology forming virtual machine can be KVM or Zen or VMware or Hyper-V Intel Virtualization Technology, and the container technique forming described container can be Docker or Rockett or Odin or Chef or LXC or Vagrant or Ansible or Zone or Jail or Hyper-V container technique.

In an embodiment of the present invention, each memory node is only responsible for fixing storage medium simultaneously, and a storage medium will not be read by multiple memory nodes simultaneously, to avoid data collision, it is thus possible to realize each memory node without accessing by other memory nodes by the storage medium of its management, and the integrity of the data of storage in storage system can both be ensure that.

In an embodiment of the present invention, storage medium all of in system can be divided according to storage logic, specifically, it is possible to the storage pool of whole system is divided into memory area, storage group, the such logical storage layers level framework of memory block, wherein, memory block is minimum memory unit.In an embodiment of the present invention, it is possible to storage pool is divided at least two memory area.

In an embodiment of the present invention, each memory area can be divided at least one storage group. In one preferably embodiment, each memory area is at least divided into two storage groups.

In certain embodiments, memory area and storage group can merge, such that it is able to omit a level in this accumulation layer level framework.

In an embodiment of the present invention, each memory area (or storage group) can be made up of at least one memory block, and wherein memory block can be a complete storage medium, can also be a part for a storage medium. In order in memory area internal build redundant storage, each memory area (or storage group) can be made up of at least two memory block, when any of memory block breaks down, it is possible to all the other memory blocks calculate complete stored data from this group. Redundant storage mode can be many copies pattern, raid-array (RAID) pattern, correcting and eleting codes (erasecode) pattern. In an embodiment of the present invention, redundant storage mode can be set up by ZFS file system. In an embodiment of the present invention, in order to resist the hardware fault of storage device/storage medium, multiple memory blocks that each memory area (or storage group) comprises will not be arranged in same storage medium, is even also not in same storage device. In an embodiment of the present invention, any two memory block that each memory area (or storage group) comprises is all without being arranged in same storage medium/storage device. In an alternative embodiment of the invention, the memory block quantity being positioned at same storage medium/storage device in same memory area (or storage group) is preferably less than or equal to the redundancy of redundant storage. Illustrating, when storing the RAID5 mode that redundancy is taked, the redundancy of its redundant storage is 1, then the memory block quantity of the same storage group being positioned at same storage device is up to 1; To RAID6, the redundancy of its redundant storage is 2, then the memory block quantity of the same storage group being positioned at same storage device is up to 2.

In an embodiment of the present invention, each memory node can only the self-administered memory area of read and write. Owing to the read operation of same memory block can't be conflicted mutually by multiple memory nodes, and multiple memory node is write a memory block and is susceptible to conflict simultaneously, therefore, in another embodiment, can be that each memory node can only write self-administered memory area, but can reading self-administered memory area and the memory area of other memory node management, namely write operation is locality, but read operation can be of overall importance.

In one embodiment, storage system can also include storage and control node, and it is connected to storage network, for determining the memory area that each memory node manages. In another embodiment, each memory node can include storage distribution module, for determining the memory area that this memory node manages, this can be realized by the communication between each storage distribution module included by each memory node and Coordination Treatment algorithm.

In one embodiment, when monitoring a memory node and breaking down, it is possible to other part or all of memory nodes are configured so that by the memory area of the described memory node broken down management before the adapter of these memory nodes. Such as, the memory area of the memory node management can broken down by the adapter of one of them memory node, or, can be taken over by other at least two memory node, the memory area of the part of the memory node management that wherein adapter of each memory node is broken down, such as other at least two memory nodes take over the difference storage group in this memory area respectively.

In one embodiment, storage medium can include but not limited to hard disk, flash memory, SRAM, DRAM, NVME or other form, and the access interface of storage medium can include but not limited to SAS interface, SATA interface, PCI/e interface, DIMM interface, NVMe interface, scsi interface, ahci interface.

In an embodiment of the present invention, storage network can include at least one storage switching equipment, by including storage switching equipment between data exchange realize the memory node access to storage medium. Specifically, memory node and storage medium are connected with storage switching equipment respectively through memory channel.

In an embodiment of the present invention, storage switching equipment can be SAS switch or PCI/e switch, and accordingly, memory channel can be SAS (Serial Attached SCSI (SAS)) passage or PCI/e passage.

For SAS passage, compare traditional storage scheme based on IP agreement, based on the scheme of SAS exchange, have performance high, be with roomy, the advantages such as single device number of disks is many. After the SAS interface on host adapter (HBA) or server master board is combined use, the multiple servers that the storage that SAS system provides can be connected easily accesses simultaneously.

Specifically, SAS switch is connected by a SAS line between storage device, also it is be connected by SAS interface between storage device with storage medium, such as, inside storage device, SAS passage is linked each storage medium (can arrange a SAS exchange chip inside storage device). Owing to the bandwidth of SAS network can reach 24Gb or 48Gb, it is tens times of gigabit Ethernet, and the several times of ten thousand mbit ethernets of costliness; There is the lifting of about order of magnitude than IP network at link layer SAS simultaneously, in transport layer, owing to Transmission Control Protocol three-way handshake is closed for four times, significantly high and TCP the machine-processed time delay sometimes resulting in 100 Milliseconds with slow turn-on of delayed acknowledgement of expense, the time delay of SAS protocol only has 1/the tens of TCP, and performance has bigger lifting. In a word, SAS network has huge advantage than the TCP/IP based on Ethernet in bandwidth, time delay. It will be understood by those skilled in the art that the performance of PCI/e passage can also the demand of adaptive system.

In an embodiment of the present invention, storage network can include at least two storage switching equipment, and described each memory node is connected to any one storage device such as through any one storage switching equipment, and then is connected to storage medium. When any one storage switching equipment or be connected to one storage switching equipment memory channel break down time, memory node by other storage switching equipment read-write storage device on data.

With reference to Fig. 3 B, it illustrates the storage system 30 that constructed by one embodiment of the present invention is concrete. Storage device in storage system 30 is built into multiple stage JBOD307-310, is connected to two SAS switch 305 and 306 respectively through SAS data wire, and the two SAS switch constitutes the exchcange core of the storage network included by storage system. Front end is at least two server 301 and 302, and every station server is connected to the two SAS switch 305 and 306 by SAS interface on HBA equipment (not shown) or mainboard. There is basic network between server to connect for monitoring and communicating. Every station server has a memory node, utilizes the information obtained from SAS link, manage the part or all of disk in all JBOD disks. Specifically, it is possible to use JBOD disk is divided into different storage groups by present specification memory area described above, storage group, memory block. Each memory node manages one or more groups such storage group. During the mode adopting redundant storage when each storage group is internal, it is possible to the metadata of redundant storage is present on disk so that redundant storage can by other memory nodes directly from disk identification.

In shown exemplary memory system 30, memory node can install monitoring and management module, is responsible for the locally stored state with other server of monitoring. When certain JBOD entirety is abnormal, or on JBOD during certain disk exception, data reliability is guaranteed by redundant storage. When certain station server fault, the management module in memory node on another pre-set server, according to the data on disk, will identify in this locality and take over the disk originally managed by the memory node of failed server. The storage service that the memory node of failed server externally provides originally, is also continued the memory node on new server. So far, it is achieved that the pool of global storage structure of a kind of brand-new High Availabitity.

Visible, constructed exemplary memory system 30 provides a kind of multiple spot storage pool controlled, global access. Hardware aspect uses multiple servers externally to provide service, uses JBOD to deposit disk. Multiple stage JBOD each connects two SAS switch, and two switches distinguish the HBA card of Connection Service device again, so that it is guaranteed that all disks on JBOD, it is possible to accessed by Servers-all. SAS redundant link also ensure that the high availability on link.

Local at every station server, utilize redundant memory technology, from every JBOD, choose disk composition redundant storage, it is to avoid the loss of separate unit JBOD causes data unavailable. When a station server lost efficacy, the module that integrality is monitored will dispatch another station server, by the disk that the memory node of SAS channel access failed server manages, these disks that rapid pipe connecting the other side is responsible for, it is achieved the overall situation storage of High Availabitity.

Although being deposit disk for JBOD to be illustrated in figure 3, but it is to be understood that embodiments of the present invention as shown in Figure 3 also support the storage device beyond JBOD. It addition, be above for one piece of storage medium (whole) as a memory block, it is applied equally to the part for a storage medium the situation as a memory block.

Fig. 4 illustrates the flow chart of the access control method 40 for exemplary memory system according to the embodiment of the present invention.

Load condition between step S401, monitoring at least two memory node included by storage system.

In step S402, when the load monitoring a memory node exceeds predetermined threshold, the memory area that the associated storage node at least two memory node is managed is adjusted. Associated storage node can be the memory node of the imbalance causing this load, it is possible to depends on the adjustable strategies of memory area and determines. Adjustment to memory area can be that the memory block being involved in is redistributed between memory node, or can be the increase of memory area, merging or deletion etc. The allocation list of the memory area that can associated storage node be managed is adjusted, and described at least two memory node determines its memory area managed according to described allocation list. The adjustment of aforementioned arrangements table can be controlled node by the storage that aforesaid storage system includes or storage distribution module that memory node includes carries out.

In one embodiment, the monitoring of the load condition between at least two memory node can be carried out for one or more in following performance parameter: the occupancy of the storage medium of read-write operation number of times (IOPS) number of request per second of memory node, the handling capacity of memory node, the CPU usage of memory node, the memory usage of memory node and memory node management.

In one embodiment, the performance parameter of each node regular monitoring oneself can be made, regularly inquiring about the data of other nodes simultaneously, then pass through predefined equalization scheme again or dynamically produce a unified equalization scheme again of the overall situation by algorithm, finally each node performs the program. At another embodiment, storage system includes the monitor node independent of memory node S1, memory node S2 and memory node S3 or aforesaid storage controls node or storage distribution module, monitors the performance parameter of each memory node.

In one embodiment, can passing through predefined threshold values (can configure) for unbalanced judgement and realize, the such as deviation of the IOPS number between each node exceedes certain limit and then triggers equilibrating mechanism again. Such as, for IOPS, it is possible to compared with the IOPS number of the memory node minimum with IOPS number of memory node maximum for IOPS number, determine deviation therebetween more than the latter 30% time, trigger memory area is adjusted. Such as, the storage medium that the minimum memory node of the storage medium managed by memory node maximum for IOPS number and IOPS number manages is exchanged, the storage medium that occupancy that storage medium that occupancy that the memory node such as selecting IOPS number maximum manages the is the highest memory node minimum with IOPS number manages is the highest.

Alternatively, can by the IOPS number of memory node maximum for IOPS number compared with the meansigma methods of the IOPS number of each memory node, determine deviation therebetween more than the latter 20% time, trigger and memory area is adjusted so that the memory area allocative decision after adjustment will not trigger balanced immediately again.

Should be appreciated that and be previously described for representing the predetermined threshold 20% of imbalance of load, 30% being merely illustrative of, it is also possible to the threshold value that different definition according to application scenario with for demand is other. Similarly, performance parameter for other, the occupancy of the storage medium of the handling capacity of such as memory node, the CPU usage of memory node, the memory usage of memory node and memory node management, also definition pre-defines for triggering the threshold value that between memory node, load is balanced again.

It should also be understood that, although the predetermined threshold for unbalanced judgement discussed above can specify threshold value, such as IOPS number to represent by one of the respective appointment threshold value in multinomial performance parameter, but it is envisioned that it can also be represented by the combination of the multinomial appointment threshold value of the respective appointment threshold value in multinomial performance parameter to this predetermined threshold. Such as, the IOPS number at memory node reaches its handling capacity specifying threshold value and memory node and reaches it when specifying threshold value, and just the load of triggering memory node is balanced again.

At an embodiment, adjustment (balanced again) for memory area, the storage medium that memory node high for load manages can be assigned in the memory area that the low memory node of load manages, such as can include the exchange of storage medium, or the deletion from the memory area that the memory node that load is high manages and the increase in the memory area that the memory node that load is low manages, or the storage new storage medium of network will be accessed or new memory area joins at least two memory area (such as fifty-fifty, storage System Expansion), or the territory, partial memory area at least two memory area is merged (such as, one memory node fault).At an embodiment, adjustment (balanced again) for memory area, can development behavior algorithm, such as, it is weighted obtaining a single loading index by the various load datas of each storage medium He each memory node, then calculate an equalization scheme again, by mobile minimal number of disk group, make system no longer beyond reservation threshold.

In one embodiment, the performance parameter of the storage medium that each memory node regular monitoring oneself manages can be made, regularly inquire about the performance parameter of the storage medium that other nodes manage simultaneously, performance parameter for storage medium defines the threshold value being used for representing the imbalance of load, such as, the memory space utilization rate that this threshold value can be arbitrary storage medium is 0% (having new disk to add), the memory space utilization rate of arbitrary storage medium is 90% (having disk space by full), or the difference of the storage medium that the storage medium that in storage system, memory space utilization rate is the highest is minimum with memory space utilization rate is more than the 20% of the latter. should be appreciated that the predetermined threshold 0%, 90%, 30% being previously described for representing the imbalance of load is also merely illustrative of.

Fig. 5 illustrates according to one embodiment of the present invention, principle schematic that to realize load in the storage system shown in Fig. 3 A more balanced. Assume at a time, the load of the memory node S1 in this storage system is significantly high, its storage medium managed includes being positioned at the storage medium 1 at storage device 34 place, being positioned at the storage medium 1 at storage device 35 place and be positioned at the storage medium 1 (as shown in Figure 3A) at storage device 36 place, and its total memory space will quickly be used up, the load of memory node 3 is very low simultaneously, and the memory space in its storage medium managed is big.

In traditional storage network, each memory node can only access the memory area being directly connected to itself. Therefore in reequilibrate process, it is necessary to the data on heavy duty memory node are copied on light load node, in the process, it may appear that mass data replicates operation, memory area and network is caused extra load, affects the I O access of regular traffic data. Such as, the one or more storage mediums from memory node 1 management are needed to read data, what then the data of reading be written to memory node 3 manages is one or more, finally stores the disk space of these data in the storage medium of release memory node 1 management, it is achieved load balancing.

But, according to the embodiment of the present invention, each memory node S1, S2 and S3 in included by storage system accesses all memory areas such as through storage network, therefore, can realizing memory area migration between each memory node by the mode of transfer storage medium access right, the memory area that namely can associated storage node be managed is grouped again. In reequilibrate process, the data in each memory area are no longer necessary to do duplication operation. Such as, as shown in Figure 5, storage medium 2 that will be located in storage device 34 place, that originally have memory node 3 to manage is allocated to memory node 1 and manages, storage medium 1 that simultaneously will be located in storage device 34 place, that originally have memory node 1 to manage is allocated to memory node 3 and manages, and realizes the load balancing of residual memory space between memory node 1 and memory node 3 with this. In the process, it is only necessary to the configuration of memory node 1 and memory node 3 is modified, it is possible to complete within very short time, the business datum readwrite performance of user will not be impacted.

Fig. 6 illustrates according to another embodiment of the present invention, principle schematic that to realize load in the storage system shown in Fig. 3 A more balanced. Different from Fig. 5, in figure 6, when monitoring the load of memory node S1 and the load of memory node S2 is relatively low, storage medium 2 that can will be located in storage device 35 place, that originally have memory node 2 to manage is allocated to memory node 1 and manages, storage medium 1 that simultaneously will be located in storage device 34 place, that originally have memory node 1 to manage is allocated to memory node 2 and manages, and realizes the load balancing of residual memory space between memory node 1 and memory node 2 with this.

Monitor be storage medium dilatation another embodiment in, for instance, it is possible to by newly-increased storage medium being evenly distributed on each memory node and by its management, such as according to add order, maintain the load balancing between memory node with this.

It is to be understood that, although above-mentioned two embodiment is balanced again so that storage medium is scheduling realize load between different memory nodes, but it is balanced again to realize load that it can be applicable to dispatch memory area between memory node, such as, when storage medium dilatation, when what monitor addition is the situation of a memory area, it is possible to the memory area of addition is assigned to each memory node by addition sequence.

Additionally, as shown in Figure 5 and Figure 6, significantly high in the load monitoring memory node S1, the configuration between the computing node in storage system and memory node can also be revised, make the one or more computing nodes, the such as C12 that are originally stored at least one computing node of data by memory node S1, other memory nodes, such as memory node S2 can be passed through, store data. Now, computing node can need the memory node accessing its residing physical server part to store data, then can not mobile computing node physically, but accessed the memory area (as shown in Figure 5) on long-range memory node by remote access protocol, such as iSCSI protocol; Or, it is possible to while the adjustment that the memory area that associated storage node is managed carries out, undertaken migrating (as shown in Figure 6) by computing node, this process is likely to need first to close computing node to be moved.

It is to be understood that, the memory node included by storage system that above-mentioned reference Fig. 3-Fig. 6 discusses, storage device, the number of storage medium and memory area is only illustrative, storage system according to embodiment of the present invention can include at least two memory node, storage network and at least one storage device being connected by storage network with at least two memory node, each storage device at least one storage device described can include at least one storage medium, storage network is configured such that each memory node can both without accessing all storage mediums by other memory nodes.

According to the embodiment of the present invention, each memory area is managed by a memory node in multiple memory nodes, and after memory node starts, memory node is from the memory area being dynamically connected by its management, then import, after completing, just can provide storage service to upper strata computing node.

When monitor load imbalance state occurs between memory node time, it is thus necessary to determine that for the higher memory node of load, the part needing the memory area migrated, and need the memory node moved to by this memory area.

For needing the determination of the part of the memory area of migration, it is possible to there is numerous embodiments. In one embodiment, it is possible to needed to migrate which memory area by management personnel's artificial judgment. In one embodiment, configuration file mode can be adopted, namely it is pre-configured with migration priority for each memory area, selects one or more memory block, storage group or storage medium that the priority in the memory area that this memory node currently manages is the highest to migrate in time needing and migrate. In one embodiment, it is possible to the loading condition of memory block included by memory area, storage group or storage medium migrates; Such as, each memory node can monitor the loading condition of the included memory block of memory area at one's disposal, storage group or storage medium, such as collect the information such as IOPS, handling capacity, IO time delay, all these information are weighted comprehensively, in order to select the memory area part needing to migrate.

For needing the determination of the memory node moved to by this memory area, it is possible to there is numerous embodiments. In one embodiment, it is possible to the memory node moved to by management personnel's artificial judgment. In one embodiment, configuration file mode can be adopted, namely it is pre-configured with migration object listing for each memory area, such as according to the memory node list of priority arrangement, after determining that this memory area (or part) needs are migrated, select move target ground successively according to object listing. It should be noted that, adopt this kind of mode, it shall be guaranteed that do not result in target storage node load too high after migration. In one embodiment, the memory node that can move to according to the loading condition selection of memory node, each memory node loading condition can be monitored, such as collect the information such as CPU usage, memory usage, network bandwidth utilization rate, all these information are weighted comprehensively, in order to select the memory node needing to move to memory area. Such as, each memory node can regularly or aperiodically to other memory nodes report the loading condition of self, when needs migrate time, it is necessary to other memory nodes that the memory node prioritizing selection load of migration data is minimum migrate as target storage node.

After the target storage node that the memory area (or its part) and its administrative power that determine needs migration move to, it is possible to confirmed and start concrete transition process by the management personnel of storage system, or this transition process can also be opened by program. It should be noted that, transition process needs the impact reduced upper strata computing node as far as possible, for instance can select to migrate when application load is minimum, such as carry out (assuming that this time period load is minimum) at midnight; When determine need to close computing node at transition process, it should carry out when the low utilization rate of this computing node as far as possible; Migration strategy can be pre-configured with, in order to process when determining the control of order and the concurrent quantity needing migration multiple parts of multiple memory areas or a memory area are migrated; When starting memory area is migrated, it is possible to associated storage node to associated memory region write or read operation carry out necessity configuration, in order to ensure data integrity, for instance by all data cached write disks; After memory area moves to target storage node, memory node needs this memory node is carried out the initial work of necessity, and then this memory area just can be accessed by upper strata computing node; Should again monitor loading condition after transition process completes, confirm whether load balances.

As it was previously stated, storage system can include storage controls node, it is connected to described storage network, for determining the memory area of each memory node management in described at least two memory node;Or, described memory node can also include storage distribution module, for determining the memory area that described memory node manages, can share data between storage distribution module.

In one embodiment, storage controls node or storage distribution module, have recorded the memory area list that each memory node is responsible for. Memory node starts backward storage and controls node or the storage distribution self-administered memory area of module polls, then scans these memory areas, completes initial work. When determine need occur memory area migrate time, storage control node or storage distribution module amendment associated storage node memory area list, then notice memory node complete as requested reality switch operating.

For example, it is assumed that need from memory node A, memory area 1 is moved to memory node B in SAS storage system 30, then transition process may include steps of:

1) from the domain list of management storage region of memory node A, memory area 1 is deleted;

2) on memory node A, all data cached pressures are brushed into memory area 1;

3) on memory node A, the SAS link between all storage mediums in (or replacement) memory node A and memory area 1 is closed by SAS instruction;

4) domain list of management storage region on memory node B adds memory area 1;

5) on memory node B, pass through the SAS link between all storage mediums in SAS instruction unpack (or replacement) memory node B and memory area 1;

6) all storage mediums in memory node B-scan memory area 1, complete initial work; And

7) application program is by the data in territory, memory node B access storage areas 1.

It should be noted that; while for purposes of simplicity of explanation by method representation of the present invention and be described as series of actions; it should be understood that and recognize that not execution sequence by these actions is limited by claimed subject content; because some actions according to occurring with order different shown and described herein or can occur with other action concurrently; simultaneously some actions are also possible that some sub-steps, and are likely to occur in sequential between these sub-steps to intersect the possibility performed. Additionally, it is possible to also the action of not all diagram is to implement necessary to the method according to appended claims. Furthermore, the description of abovementioned steps is not excluded for the additional step that the method can also include being likely to obtain additional effect. It is also understood that the method step described in different embodiments or flow process can be mutually combined or replace.

Fig. 7 illustrates the block diagram of the load balancer 70 again for storing system according to an embodiment of the invention. Load balancer 70 again may include that monitoring modular 701, for monitoring the load condition between described at least two memory node; And adjusting module 702, for exceeding predetermined threshold at the imbalance monitoring load, the memory area that the associated storage node in described at least two memory node is managed is adjusted.

Should be appreciated that each module recorded in device 70 is corresponding with reference to each step in Fig. 4 method 40 described. Thus, being equally applicable to device 70 and the module wherein comprised above with respect to Fig. 4 operation described and feature, the content of repetition does not repeat them here.

According to the embodiment of the present invention, device 70 can be implemented in each memory node place, it is also possible to is implemented in the dispatching device of multiple memory node.

The teachings of the present invention is also implemented as the computer program of a kind of computer-readable recording medium, including computer program code, when computer program code is performed by processor, it enables a processor to realize the load equalization scheme again for storing system as the embodiment described herein according to the method for embodiment of the present invention. Computer-readable storage medium can be any tangible media, for instance floppy disk, CD-ROM, DVD, hard disk drive, even network medium etc.

According to the embodiment of the present invention, provide the memory node load equalization scheme again of a kind of migration supporting storage medium or memory area, realize again balanced either directly through the control redistributing storage medium or memory area between each memory node, avoid the impact on regular traffic data in transition process, considerably enhance the efficiency that memory node load is balanced again.

It is to be understood that, although a kind of way of realization the foregoing describing embodiment of the present invention can be computer program, but the method or apparatus of embodiments of the present invention can by being implemented in combination according to software, hardware or software and hardware. Hardware components can utilize special logic to realize; Software section can store in memory, by suitable instruction execution system, for instance microprocessor or special designs hardware perform. It will be understood by those skilled in the art that above-mentioned method and apparatus can use computer executable instructions and/or be included in processor control routine to realize, for instance on the such as programmable memory of the mounting medium of disk, CD or DVD-ROM, such as read only memory (firmware) or the data medium of such as optics or electrical signal carrier, provide such code. Methods and apparatus of the present invention can be realized by the hardware circuit of the quasiconductor of such as super large-scale integration or gate array, such as logic chip, transistor etc. or the programmable hardware device of such as field programmable gate array, programmable logic device etc., can also realize with the software performed by various types of processors, it is also possible to realized by the combination of above-mentioned hardware circuit and software such as firmware.

Although should be appreciated that the some modules or submodule that are referred to device in detailed descriptions above, but this division being merely exemplary but not enforceable. It practice, according to an illustrative embodiment of the invention, the feature of two or more modules above-described and function can realize in a module. Otherwise, the feature of an above-described module and function can Further Division for be realized by multiple modules.

It is also understood that for fuzzy embodiments of the present invention, description only some are crucial, may not necessary technology and feature be described, and the feature being likely to some those skilled in the art are not capable of is explained.

The foregoing is only presently preferred embodiments of the present invention, not in order to limit the present invention, all within the spirit and principles in the present invention, any amendment of making, equivalent replacement etc., should be included within protection scope of the present invention.

Claims

1. the load equalization methods again being used for the system that stores, described storage system includes storage network, at least two memory node and at least one storage device, described at least two memory node and at least one storage device described are respectively connecting to described storage network, each storage device at least one storage device described includes at least one storage medium, wherein all storage mediums included by described storage system are constituted a storage pool, described storage network is configured such that each memory node can both without accessing all storage mediums by other memory nodes, and described storage pool is divided at least two memory area, each memory node is responsible for zero and is arrived multiple memory areas,

Described method includes:

Monitor the load condition between described at least two memory node;And

When the load monitoring a memory node exceeds predetermined threshold, the memory area that the associated storage node in described at least two memory node is managed is adjusted.

2. method according to claim 1, wherein, described storage system also includes:

Storage controls node, is connected to described storage network, for determining the memory area of each memory node management in described at least two memory node; Or

Described memory node also includes:

Storage distribution module, for determining the memory area that described memory node manages.

3. method according to claim 2, wherein, described storage control node or described storage distribution module have recorded the memory area list of the memory area of each memory node management in described at least two memory node, and the described memory area that associated storage node in described at least two memory node is managed is adjusted including:

The described memory area list of amendment associated storage node.

4. method according to claim 1, wherein, it is one or more that the load condition between described monitoring described at least two memory node includes monitoring in the following performance parameter of described at least two memory node:

The IOPS number of request of memory node;

The handling capacity of memory node;

The CPU usage of memory node;

The memory usage of memory node; And

The memory space utilization rate of the storage medium of memory node management.

5. method according to claim 4, wherein, described predetermined threshold is represented by one of the respective appointment threshold value of described performance parameter or multinomial combination.

6. method according to claim 5, wherein, the respective appointment threshold value of described performance parameter includes:

The memory node that the parameter value of each performance parameter is the highest and deviation between the parameter value of memory node that the parameter value of this performance parameter is minimum;

Deviation between the meansigma methods of this parameter value of the memory node that the parameter value of each performance parameter is the highest and this parameter of each memory node; Or

Designated value for each performance parameter.

7. method according to claim 1, wherein, each memory area in described at least two memory area is made up of at least one memory block, and a memory block is a complete storage medium, or memory block is a part for a storage medium.

8. method according to claim 7, wherein, the described adjustment that memory area is carried out includes: the allocation list of the memory area that associated storage node is managed is adjusted, and described at least two memory node determines its memory area managed according to described allocation list.

9. method according to claim 1, wherein, each memory area in described at least two memory area is made up of at least one memory block, and a memory block is a complete storage medium, and the described adjustment wherein memory area carried out includes:

A storage medium in the first memory area in described at least two memory area and a storage medium in the second memory area are exchanged;

From described first memory area, delete a storage medium, and the storage medium of this deletion is added in described second memory area;

The new storage medium or new memory area that access storage network are joined in described at least two memory area fifty-fifty; Or

Territory, partial memory area in described at least two memory area is merged.

10. the method according to any one of claim 1-9, wherein, the described memory area that associated storage node in described at least two memory node is managed is adjusted including: artificially determined the adjustment mode of memory area that associated storage node manages by the management personnel of described storage system;

Adopt the adjustment mode of the memory area that configuration file mode determines that associated storage node manages; Or

Loading condition according to memory node determines the adjustment mode of the memory area that associated storage node manages,

Wherein, described adjustment mode includes the part of memory area to migrate and the target storage node to move to.

11. the method according to any one of claim 1-9, wherein, described storage network includes at least one storage switching equipment, and all at least two memory nodes and at least one storage medium described all pass through memory channel and be connected with storage switching equipment.

12. method according to claim 11, wherein, described memory channel is SAS passage or PCI/e passage, and described storage switching equipment is SAS switch or PCI/e switch.

13. the method according to any one of claim 1-9, wherein, described storage device is JBOD; And/or

Described storage medium is hard disk, flash memory, SRAM or DRAM; And/or the interface of described storage medium is SAS interface, SATA interface, PCI/e interface, DIMM interface, NVMe interface, scsi interface, ahci interface.

14. the method according to any one of claim 1-9, wherein, the corresponding one or more computing nodes of each memory node, and the corresponding computing node of each memory node is all located at same server.

15. method according to claim 14, wherein, described memory node is a virtual machine of described server, a container or the module running directly on the physical operating system of described server; And/or

Described computing node is a virtual machine of described server, a container or the module running directly on the physical operating system of described server.

16. the method according to any one of claim 1-9, wherein, the management of the memory area that it manages is included by memory node:

Each memory node can only read and write self-administered memory area; Or

Each memory node can only write self-administered memory area, but can read self-administered memory area and the memory area of other memory node management.

17. one kind for storing the load balancer again of system, described storage system includes storage network, at least two memory node and at least one storage device, described at least two memory node and at least one storage device described are respectively connecting to described storage network

Each storage device at least one storage device described includes at least one storage medium, all storage mediums included by wherein said storage system constitute a storage pool, described storage network is configured such that each memory node can both without accessing all storage mediums by other memory nodes, and described storage pool is divided at least two memory area, each memory node is responsible for zero and is arrived multiple memory areas

Described load balancer again includes:

Monitoring modular, for monitoring the load condition between described at least two memory node; And

Adjusting module, during for exceeding predetermined threshold in the load monitoring a memory node, the memory area that the associated storage node in described at least two memory node is managed is adjusted.

18. the computer program realized in a computer-readable storage medium, described computer-readable recording medium has the computer readable program code part being stored therein, and described computer readable program code is partially configured as the method performed according to claim 1-16.