CN114691036A

CN114691036A - Method and system for improving upgrading efficiency of large-scale distributed storage cluster

Info

Publication number: CN114691036A
Application number: CN202210256882.2A
Authority: CN
Inventors: 李凯; 李超; 高传集; 冯建奎; 张锦志
Original assignee: Inspur Cloud Information Technology Co Ltd
Current assignee: Inspur Cloud Information Technology Co Ltd
Priority date: 2022-03-16
Filing date: 2022-03-16
Publication date: 2022-07-01

Abstract

The invention discloses a method and a system for improving upgrading efficiency of a large-scale distributed storage cluster, belonging to the field of distributed storage; the method comprises the following specific steps: s1, inquiring all data storage devices in the current storage cluster, and performing pre-check on each storage node; s2, comparing and determining a list of data storage devices to be upgraded; s3, grouping the data storage devices according to the set fault domain level; s4, safety stop detection and upgrade of the data storage device are carried out according to groups; s5 rejoining the cluster by the data storage node; compared with the upgrading mode of a single data storage device, the method disclosed by the invention has the advantages that the data storage devices are grouped according to the fault domain, and the safety stop detection is carried out, so that the time required for upgrading the large-scale distributed storage cluster is greatly shortened.

Description

Method and system for improving upgrading efficiency of large-scale distributed storage cluster

Technical Field

The invention discloses a method and a system for improving upgrading efficiency of a large-scale distributed storage cluster, and relates to the technical field of distributed storage.

Background

With the explosion of information technology and the increasing demand for government, enterprise and personal data storage, distributed storage clusters have become increasingly popular. The distributed storage cluster is used for storing data on a plurality of independent devices in a distributed mode. Compared with cluster type storage, the distributed storage cluster has the advantages of being unique in high expansibility, high reliability, high performance and cost. However, with the continuous use and iteration of distributed storage systems in production data centers, the upgrading of storage clusters becomes an urgent issue to be solved. Generally, upgrades are focused on two aspects: one is whether the service is available during the upgrade, and the other is the duration of the upgrade window.

The existing upgrading methods are roughly divided into two types, one is off-line upgrading, and the service is required to be suspended during upgrading, so that the service of a user is interrupted, and the details are not repeated. The other type is online upgrading, a user does not sense, and the user can normally perform read-write operation on the cluster during upgrading. Online upgrade is subdivided into two ways: one is to upgrade the single data storage units one by one, and although the influence of the upgrade process is small and normal reading and writing of users are not influenced, the time consumption is long, large-scale cluster upgrade often needs several hours or even longer, which is unacceptable for many users, and meanwhile, much pressure is brought to operation and maintenance personnel responsible for upgrade. Another common upgrading method is to upgrade by using storage nodes as units, which is convenient and direct, and although the upgrading efficiency can be improved to a certain extent, the method has certain use limitations when the number of the storage nodes is small or is the same as the number of the storage nodes and the number of the copies, and cannot fully utilize the characteristic that a distributed storage cluster can normally work when multiple copies of the distributed storage cluster meet the minimum available copy, and the upgrading time is long when the number of the storage nodes is large.

Therefore, the present invention provides a method and a system for improving the upgrading efficiency of a large-scale distributed storage cluster, so as to solve the above problems.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides a method and a system for improving the upgrading efficiency of a large-scale distributed storage cluster, and the adopted technical scheme is as follows: a method for improving the upgrading efficiency of a large-scale distributed storage cluster comprises the following specific steps:

s1, inquiring all data storage devices in the current storage cluster, and performing pre-check on each storage node;

s2, comparing and determining a list of data storage devices to be upgraded;

s3, grouping the data storage devices according to the set fault domain level;

s4, safety stop detection and upgrade of the data storage device are carried out according to groups;

s5 the data storage node rejoins the cluster.

The specific steps of S3 grouping the data storage devices according to the set fault domain level and the method are as follows:

s301, acquiring a list of data storage devices to be upgraded of each storage node;

s302, grouping the data devices according to the fault domain level set by the current cluster.

The specific steps of the S4 grouping the safety stop detection and the upgrade of the data storage device are as follows:

s401, establishing a key-value data structure in each group of data storage equipment;

s402, carrying out safety stop detection on the set data storage equipment with the same level as the fault domain;

s403 upgrades the data storage devices passing the safety stop detection group.

The specific steps of the S5 data storage node rejoining the cluster are as follows:

s501, setting the value of the upgraded data storage device to be upgraded, and waiting for the cluster state to recover to normal;

s502, grouping the data storage devices which fail to pass the security detection again;

s503, updating the value of the upgraded identification data storage equipment to be upgraded;

and S504, rejoining the upgraded data storage device into the storage cluster.

A system for improving the upgrading efficiency of a large-scale distributed storage cluster specifically comprises an upgrading control module, an information comparison module, a grouping setting module, a grouping processing module and a node processing module:

the upgrade control module: inquiring all data storage devices in the current storage cluster, and performing pre-inspection on each storage node;

an information comparison module: comparing and determining a list of data storage devices to be upgraded;

a grouping setting module: grouping the data storage devices according to the set fault domain level;

a packet processing module: the safety stop detection and the upgrade of the data storage equipment are carried out according to groups;

a node processing module: the data storage node rejoins the cluster.

The grouping setting module specifically comprises a list acquisition module and an equipment grouping module:

a list acquisition module: acquiring a data storage equipment list to be upgraded of each storage node;

a device grouping module: and grouping the data equipment according to the fault domain level set by the current cluster.

The grouping processing module specifically comprises a structure establishing module, a detection control module and an equipment upgrading module:

a structure establishment module: establishing a key-value data structure in each group of data storage equipment;

the detection control module: carrying out safety stop detection on the data storage equipment of the same level of the set fault domain;

an equipment upgrading module: the data storage devices passing the safety stop detection group are upgraded.

The node processing module specifically comprises a state setting module, a secondary grouping module, an equipment updating module and a cluster updating module:

a state setting module: setting the value of the upgraded data storage equipment as upgraded, and waiting for the cluster state to recover to normal;

a secondary grouping module: grouping the data storage devices which fail to pass the security detection again;

an equipment updating module: updating the value of the upgraded identification data storage equipment to be upgraded;

a cluster update module: and rejoining the upgraded data storage device into the storage cluster.

The invention has the beneficial effects that: compared with the upgrading mode of a single data storage device, the method disclosed by the invention has the advantages that the data storage devices are grouped according to the fault domain, and the safety stop detection is carried out, so that the time required for upgrading the large-scale distributed storage cluster is greatly shortened; compared with an upgrading mode in which the storage nodes are used as units, the method can meet the requirement of effective upgrading when the storage nodes are fewer or under the scene that the number of the storage nodes is the same as that of the copies, can fully utilize the characteristic that the distributed storage cluster can normally work when the distributed storage cluster meets the minimum available copy under the condition of redundant copies, and greatly improves the upgrading efficiency.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a flow chart of the method of the present invention; FIG. 2 is a schematic flow chart of an embodiment of the present invention.

Detailed Description

The present invention is further described below in conjunction with the following figures and specific examples so that those skilled in the art may better understand the present invention and practice it, but the examples are not intended to limit the present invention.

The first embodiment is as follows:

a method for improving the upgrading efficiency of a large-scale distributed storage cluster comprises the following specific steps:

s2, comparing and determining a list of data storage devices to be upgraded;

s3, grouping the data storage devices according to the set fault domain level;

s5 rejoining the cluster by the data storage node;

firstly, starting an upgrading controller, inquiring all data storage devices in a current storage cluster, respectively establishing a pre-detection controller on each storage node, and collecting physical storage device information on the current storage node by the pre-detection controller, wherein the physical storage device information comprises id of each physical storage device, mapping relation and path of a physical disk device, type of a distributed storage engine, metadata storage position and other storage related information;

comparing the information of the data storage devices in the current storage cluster with the information of the data storage devices to be upgraded, determining the range of the data storage devices to be upgraded after comparing the difference, and reporting the range to the upgrade controller

Further, the specific steps of S3 grouping the data storage devices according to the set fault domain level and the method include:

s302, grouping the data equipment according to the fault domain level set by the current cluster;

after receiving the list of the data storage devices to be upgraded of each storage node, the upgrade controller groups the data storage devices according to the fault domain level set in the current cluster,

carrying out safety stop detection on the data storage equipment in the same level as the set fault domain, wherein if the fault domain is a storage node; )

Meanwhile, the method can support setting the maximum capacity of each group, if not, the number of each group is not limited, and the groups are grouped according to the number of fault domain partitions, which is also default setting; if the resource consumption is caused by upgrading due to certain specific reasons, such as insufficient upgrade estimated resources, and if too many data storage devices which are upgraded concurrently on the same storage node are provided, the resource capacity which can be provided on the storage node is exceeded, so that the upgrade is abnormal; to avoid this, the number of data storage devices divided into the same group can be reduced by setting the maximum capacity of each group;

usually, each group includes a plurality of data storage devices, and these data storage devices may be distributed on different storage nodes, or the data storage devices on the same storage node may be divided into a plurality of groups;

further, the specific steps of performing the safety stop detection and the upgrade of the data storage device by grouping in S4 are as follows:

s403, upgrading the data storage equipment passing through the safety stop detection group;

establishing a key-value data structure for each group of data storage equipment, wherein the key represents the id of the data storage equipment, and the value identifies whether the data storage equipment is upgraded or not;

next, carrying out safety stop detection on the group of data storage devices; the safety stop detection is to detect whether the safety stop can be carried out under the condition of not influencing the availability of real-time data, the redundant number of data is reduced, and all data can be read and written;

if the safety detection is passed, the group of data storage devices stop simultaneously, so that data loss is avoided, normal reading and writing of the data are not influenced, the data storage devices are stopped, and the data storage devices in the cluster are kicked away; )

Next, performing parallel upgrading operation on the data storage devices of the group, such as upgrading system images, modifying configuration parameters, and upgrading contents such as cluster expansion or contraction and the like;

still further, the specific step of rejoining the cluster by the data storage node of S5 is as follows:

s503, updating the value of the upgraded identification data storage device to be upgraded;

s504, adding the upgraded data storage device into the storage cluster again;

after the upgrading is finished, setting the value of the upgraded data storage equipment as upgraded, waiting for the cluster state to recover to normal, and continuing the upgrading of the next group of data storage equipment;

if the data storage device does not pass the security detection, it is indicated that the normal use of the cluster will be affected after the data storage device stops simultaneously, such as the scenes of insufficient data copies, abnormal cluster state and the like; at the moment, the data storage devices of the group need to be grouped again, the number of the data storage devices which are stopped at the same time is reduced in half, safety stop detection is carried out again, and if the data storage devices still do not pass, the number is reduced in half again until the number is reduced to a single data storage device;

in the process, a key-value data structure for identifying whether the data storage equipment in the group is upgraded needs to be updated in time, the updated value passing and being upgraded is detected to be upgraded, and otherwise, the value is still in a state to be upgraded; until all the data storage devices in the group are upgraded;

after the upgrading is finished, adding the cluster, adding the upgraded data storage equipment into the storage cluster again, and waiting for data balance in the cluster; after the group is completed, removing the data storage devices from the list to be upgraded;

meanwhile, the upgrading controller continues to carry out the next group of upgrading until all groups of upgrading are finished;

compared with the upgrading mode of a single data storage device, the method has the advantages that the data storage devices are grouped according to the fault domain, and safety stop detection is carried out, so that the time required for upgrading the large-scale distributed storage cluster is greatly shortened; compared with an upgrading mode in which the storage nodes are used as units, the method can meet the requirement of effective upgrading under the condition that the number of the storage nodes is less or the condition that the number of the storage nodes is the same as the number of the copies, and can fully utilize the characteristic that the distributed storage cluster can normally work when multiple copies meet the minimum available copy, thereby greatly improving the upgrading efficiency.

Example two:

an information comparison module: comparing and determining a data storage device list to be upgraded;

a packet processing module: performing safety stop detection and upgrade of the data storage equipment according to groups;

a node processing module: the data storage nodes rejoin the cluster;

further, the grouping setting module specifically includes a list obtaining module and an equipment grouping module:

a device grouping module: grouping the data equipment according to the fault domain level set by the current cluster;

further, the grouping processing module specifically includes a structure establishing module, a detection control module and an equipment upgrading module:

an equipment upgrading module: upgrading the data storage equipment passing the safety stop detection group;

still further, the node processing module specifically includes a state setting module, a secondary grouping module, an equipment updating module, and a cluster updating module:

a secondary grouping module: grouping the data storage devices which do not pass the security detection again;

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method for improving the upgrading efficiency of a large-scale distributed storage cluster is characterized by comprising the following specific steps:

s2, comparing and determining a list of data storage devices to be upgraded;

s3, grouping the data storage devices according to the set fault domain level;

s5 the data storage node rejoins the cluster.

2. The method as claimed in claim 1, wherein the step of grouping the data storage devices according to the set fault domain level and the S3 comprises the following steps:

3. The method as claimed in claim 2, wherein the step of S4 performing the safety stop detection and upgrade of the data storage device by group comprises the following steps:

s403, upgrading the data storage device passing the safety shutdown detection group.

4. The method of claim 3, wherein the step of rejoining the cluster by the data storage node of S5 is as follows:

and S504, rejoining the upgraded data storage device into the storage cluster.

5. A system for improving the upgrading efficiency of a large-scale distributed storage cluster is characterized by specifically comprising an upgrading control module, an information comparison module, a grouping setting module, a grouping processing module and a node processing module:

an upgrade control module: inquiring all data storage devices in the current storage cluster, and performing pre-inspection on each storage node;

a node processing module: the data storage nodes rejoin the cluster.

6. The system of claim 5, wherein the grouping setting module specifically comprises a list acquisition module and a device grouping module:

7. The system of claim 6, wherein the packet processing module specifically comprises a structure building module, a detection control module, and an equipment upgrade module:

8. The system of claim 7, wherein the node processing module specifically comprises a state setting module, a secondary grouping module, an equipment update module, and a cluster update module: