CN114691036A - Method and system for improving upgrading efficiency of large-scale distributed storage cluster - Google Patents

Method and system for improving upgrading efficiency of large-scale distributed storage cluster Download PDF

Info

Publication number
CN114691036A
CN114691036A CN202210256882.2A CN202210256882A CN114691036A CN 114691036 A CN114691036 A CN 114691036A CN 202210256882 A CN202210256882 A CN 202210256882A CN 114691036 A CN114691036 A CN 114691036A
Authority
CN
China
Prior art keywords
data storage
module
cluster
upgraded
grouping
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210256882.2A
Other languages
Chinese (zh)
Inventor
李凯
李超
高传集
冯建奎
张锦志
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Cloud Information Technology Co Ltd
Original Assignee
Inspur Cloud Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Cloud Information Technology Co Ltd filed Critical Inspur Cloud Information Technology Co Ltd
Priority to CN202210256882.2A priority Critical patent/CN114691036A/en
Publication of CN114691036A publication Critical patent/CN114691036A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • G06F16/2379Updates performed during online database operations; commit processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0629Configuration or reconfiguration of storage systems
    • G06F3/0635Configuration or reconfiguration of storage systems by changing the path, e.g. traffic rerouting, path reconfiguration
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/067Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Human Computer Interaction (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method and a system for improving upgrading efficiency of a large-scale distributed storage cluster, belonging to the field of distributed storage; the method comprises the following specific steps: s1, inquiring all data storage devices in the current storage cluster, and performing pre-check on each storage node; s2, comparing and determining a list of data storage devices to be upgraded; s3, grouping the data storage devices according to the set fault domain level; s4, safety stop detection and upgrade of the data storage device are carried out according to groups; s5 rejoining the cluster by the data storage node; compared with the upgrading mode of a single data storage device, the method disclosed by the invention has the advantages that the data storage devices are grouped according to the fault domain, and the safety stop detection is carried out, so that the time required for upgrading the large-scale distributed storage cluster is greatly shortened.

Description

Method and system for improving upgrading efficiency of large-scale distributed storage cluster
Technical Field
The invention discloses a method and a system for improving upgrading efficiency of a large-scale distributed storage cluster, and relates to the technical field of distributed storage.
Background
With the explosion of information technology and the increasing demand for government, enterprise and personal data storage, distributed storage clusters have become increasingly popular. The distributed storage cluster is used for storing data on a plurality of independent devices in a distributed mode. Compared with cluster type storage, the distributed storage cluster has the advantages of being unique in high expansibility, high reliability, high performance and cost. However, with the continuous use and iteration of distributed storage systems in production data centers, the upgrading of storage clusters becomes an urgent issue to be solved. Generally, upgrades are focused on two aspects: one is whether the service is available during the upgrade, and the other is the duration of the upgrade window.
The existing upgrading methods are roughly divided into two types, one is off-line upgrading, and the service is required to be suspended during upgrading, so that the service of a user is interrupted, and the details are not repeated. The other type is online upgrading, a user does not sense, and the user can normally perform read-write operation on the cluster during upgrading. Online upgrade is subdivided into two ways: one is to upgrade the single data storage units one by one, and although the influence of the upgrade process is small and normal reading and writing of users are not influenced, the time consumption is long, large-scale cluster upgrade often needs several hours or even longer, which is unacceptable for many users, and meanwhile, much pressure is brought to operation and maintenance personnel responsible for upgrade. Another common upgrading method is to upgrade by using storage nodes as units, which is convenient and direct, and although the upgrading efficiency can be improved to a certain extent, the method has certain use limitations when the number of the storage nodes is small or is the same as the number of the storage nodes and the number of the copies, and cannot fully utilize the characteristic that a distributed storage cluster can normally work when multiple copies of the distributed storage cluster meet the minimum available copy, and the upgrading time is long when the number of the storage nodes is large.
Therefore, the present invention provides a method and a system for improving the upgrading efficiency of a large-scale distributed storage cluster, so as to solve the above problems.
Disclosure of Invention
Aiming at the problems in the prior art, the invention provides a method and a system for improving the upgrading efficiency of a large-scale distributed storage cluster, and the adopted technical scheme is as follows: a method for improving the upgrading efficiency of a large-scale distributed storage cluster comprises the following specific steps:
s1, inquiring all data storage devices in the current storage cluster, and performing pre-check on each storage node;
s2, comparing and determining a list of data storage devices to be upgraded;
s3, grouping the data storage devices according to the set fault domain level;
s4, safety stop detection and upgrade of the data storage device are carried out according to groups;
s5 the data storage node rejoins the cluster.
The specific steps of S3 grouping the data storage devices according to the set fault domain level and the method are as follows:
s301, acquiring a list of data storage devices to be upgraded of each storage node;
s302, grouping the data devices according to the fault domain level set by the current cluster.
The specific steps of the S4 grouping the safety stop detection and the upgrade of the data storage device are as follows:
s401, establishing a key-value data structure in each group of data storage equipment;
s402, carrying out safety stop detection on the set data storage equipment with the same level as the fault domain;
s403 upgrades the data storage devices passing the safety stop detection group.
The specific steps of the S5 data storage node rejoining the cluster are as follows:
s501, setting the value of the upgraded data storage device to be upgraded, and waiting for the cluster state to recover to normal;
s502, grouping the data storage devices which fail to pass the security detection again;
s503, updating the value of the upgraded identification data storage equipment to be upgraded;
and S504, rejoining the upgraded data storage device into the storage cluster.
A system for improving the upgrading efficiency of a large-scale distributed storage cluster specifically comprises an upgrading control module, an information comparison module, a grouping setting module, a grouping processing module and a node processing module:
the upgrade control module: inquiring all data storage devices in the current storage cluster, and performing pre-inspection on each storage node;
an information comparison module: comparing and determining a list of data storage devices to be upgraded;
a grouping setting module: grouping the data storage devices according to the set fault domain level;
a packet processing module: the safety stop detection and the upgrade of the data storage equipment are carried out according to groups;
a node processing module: the data storage node rejoins the cluster.
The grouping setting module specifically comprises a list acquisition module and an equipment grouping module:
a list acquisition module: acquiring a data storage equipment list to be upgraded of each storage node;
a device grouping module: and grouping the data equipment according to the fault domain level set by the current cluster.
The grouping processing module specifically comprises a structure establishing module, a detection control module and an equipment upgrading module:
a structure establishment module: establishing a key-value data structure in each group of data storage equipment;
the detection control module: carrying out safety stop detection on the data storage equipment of the same level of the set fault domain;
an equipment upgrading module: the data storage devices passing the safety stop detection group are upgraded.
The node processing module specifically comprises a state setting module, a secondary grouping module, an equipment updating module and a cluster updating module:
a state setting module: setting the value of the upgraded data storage equipment as upgraded, and waiting for the cluster state to recover to normal;
a secondary grouping module: grouping the data storage devices which fail to pass the security detection again;
an equipment updating module: updating the value of the upgraded identification data storage equipment to be upgraded;
a cluster update module: and rejoining the upgraded data storage device into the storage cluster.
The invention has the beneficial effects that: compared with the upgrading mode of a single data storage device, the method disclosed by the invention has the advantages that the data storage devices are grouped according to the fault domain, and the safety stop detection is carried out, so that the time required for upgrading the large-scale distributed storage cluster is greatly shortened; compared with an upgrading mode in which the storage nodes are used as units, the method can meet the requirement of effective upgrading when the storage nodes are fewer or under the scene that the number of the storage nodes is the same as that of the copies, can fully utilize the characteristic that the distributed storage cluster can normally work when the distributed storage cluster meets the minimum available copy under the condition of redundant copies, and greatly improves the upgrading efficiency.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
FIG. 1 is a flow chart of the method of the present invention; FIG. 2 is a schematic flow chart of an embodiment of the present invention.
Detailed Description
The present invention is further described below in conjunction with the following figures and specific examples so that those skilled in the art may better understand the present invention and practice it, but the examples are not intended to limit the present invention.
The first embodiment is as follows:
a method for improving the upgrading efficiency of a large-scale distributed storage cluster comprises the following specific steps:
s1, inquiring all data storage devices in the current storage cluster, and performing pre-check on each storage node;
s2, comparing and determining a list of data storage devices to be upgraded;
s3, grouping the data storage devices according to the set fault domain level;
s4, safety stop detection and upgrade of the data storage device are carried out according to groups;
s5 rejoining the cluster by the data storage node;
firstly, starting an upgrading controller, inquiring all data storage devices in a current storage cluster, respectively establishing a pre-detection controller on each storage node, and collecting physical storage device information on the current storage node by the pre-detection controller, wherein the physical storage device information comprises id of each physical storage device, mapping relation and path of a physical disk device, type of a distributed storage engine, metadata storage position and other storage related information;
comparing the information of the data storage devices in the current storage cluster with the information of the data storage devices to be upgraded, determining the range of the data storage devices to be upgraded after comparing the difference, and reporting the range to the upgrade controller
Further, the specific steps of S3 grouping the data storage devices according to the set fault domain level and the method include:
s301, acquiring a list of data storage devices to be upgraded of each storage node;
s302, grouping the data equipment according to the fault domain level set by the current cluster;
after receiving the list of the data storage devices to be upgraded of each storage node, the upgrade controller groups the data storage devices according to the fault domain level set in the current cluster,
carrying out safety stop detection on the data storage equipment in the same level as the set fault domain, wherein if the fault domain is a storage node; )
Meanwhile, the method can support setting the maximum capacity of each group, if not, the number of each group is not limited, and the groups are grouped according to the number of fault domain partitions, which is also default setting; if the resource consumption is caused by upgrading due to certain specific reasons, such as insufficient upgrade estimated resources, and if too many data storage devices which are upgraded concurrently on the same storage node are provided, the resource capacity which can be provided on the storage node is exceeded, so that the upgrade is abnormal; to avoid this, the number of data storage devices divided into the same group can be reduced by setting the maximum capacity of each group;
usually, each group includes a plurality of data storage devices, and these data storage devices may be distributed on different storage nodes, or the data storage devices on the same storage node may be divided into a plurality of groups;
further, the specific steps of performing the safety stop detection and the upgrade of the data storage device by grouping in S4 are as follows:
s401, establishing a key-value data structure in each group of data storage equipment;
s402, carrying out safety stop detection on the set data storage equipment with the same level as the fault domain;
s403, upgrading the data storage equipment passing through the safety stop detection group;
establishing a key-value data structure for each group of data storage equipment, wherein the key represents the id of the data storage equipment, and the value identifies whether the data storage equipment is upgraded or not;
next, carrying out safety stop detection on the group of data storage devices; the safety stop detection is to detect whether the safety stop can be carried out under the condition of not influencing the availability of real-time data, the redundant number of data is reduced, and all data can be read and written;
if the safety detection is passed, the group of data storage devices stop simultaneously, so that data loss is avoided, normal reading and writing of the data are not influenced, the data storage devices are stopped, and the data storage devices in the cluster are kicked away; )
Next, performing parallel upgrading operation on the data storage devices of the group, such as upgrading system images, modifying configuration parameters, and upgrading contents such as cluster expansion or contraction and the like;
still further, the specific step of rejoining the cluster by the data storage node of S5 is as follows:
s501, setting the value of the upgraded data storage device to be upgraded, and waiting for the cluster state to recover to normal;
s502, grouping the data storage devices which fail to pass the security detection again;
s503, updating the value of the upgraded identification data storage device to be upgraded;
s504, adding the upgraded data storage device into the storage cluster again;
after the upgrading is finished, setting the value of the upgraded data storage equipment as upgraded, waiting for the cluster state to recover to normal, and continuing the upgrading of the next group of data storage equipment;
if the data storage device does not pass the security detection, it is indicated that the normal use of the cluster will be affected after the data storage device stops simultaneously, such as the scenes of insufficient data copies, abnormal cluster state and the like; at the moment, the data storage devices of the group need to be grouped again, the number of the data storage devices which are stopped at the same time is reduced in half, safety stop detection is carried out again, and if the data storage devices still do not pass, the number is reduced in half again until the number is reduced to a single data storage device;
in the process, a key-value data structure for identifying whether the data storage equipment in the group is upgraded needs to be updated in time, the updated value passing and being upgraded is detected to be upgraded, and otherwise, the value is still in a state to be upgraded; until all the data storage devices in the group are upgraded;
after the upgrading is finished, adding the cluster, adding the upgraded data storage equipment into the storage cluster again, and waiting for data balance in the cluster; after the group is completed, removing the data storage devices from the list to be upgraded;
meanwhile, the upgrading controller continues to carry out the next group of upgrading until all groups of upgrading are finished;
compared with the upgrading mode of a single data storage device, the method has the advantages that the data storage devices are grouped according to the fault domain, and safety stop detection is carried out, so that the time required for upgrading the large-scale distributed storage cluster is greatly shortened; compared with an upgrading mode in which the storage nodes are used as units, the method can meet the requirement of effective upgrading under the condition that the number of the storage nodes is less or the condition that the number of the storage nodes is the same as the number of the copies, and can fully utilize the characteristic that the distributed storage cluster can normally work when multiple copies meet the minimum available copy, thereby greatly improving the upgrading efficiency.
Example two:
a system for improving the upgrading efficiency of a large-scale distributed storage cluster specifically comprises an upgrading control module, an information comparison module, a grouping setting module, a grouping processing module and a node processing module:
the upgrade control module: inquiring all data storage devices in the current storage cluster, and performing pre-inspection on each storage node;
an information comparison module: comparing and determining a data storage device list to be upgraded;
a grouping setting module: grouping the data storage devices according to the set fault domain level;
a packet processing module: performing safety stop detection and upgrade of the data storage equipment according to groups;
a node processing module: the data storage nodes rejoin the cluster;
further, the grouping setting module specifically includes a list obtaining module and an equipment grouping module:
a list acquisition module: acquiring a data storage equipment list to be upgraded of each storage node;
a device grouping module: grouping the data equipment according to the fault domain level set by the current cluster;
further, the grouping processing module specifically includes a structure establishing module, a detection control module and an equipment upgrading module:
a structure establishment module: establishing a key-value data structure in each group of data storage equipment;
the detection control module: carrying out safety stop detection on the data storage equipment of the same level of the set fault domain;
an equipment upgrading module: upgrading the data storage equipment passing the safety stop detection group;
still further, the node processing module specifically includes a state setting module, a secondary grouping module, an equipment updating module, and a cluster updating module:
a state setting module: setting the value of the upgraded data storage equipment as upgraded, and waiting for the cluster state to recover to normal;
a secondary grouping module: grouping the data storage devices which do not pass the security detection again;
an equipment updating module: updating the value of the upgraded identification data storage equipment to be upgraded;
a cluster update module: and rejoining the upgraded data storage device into the storage cluster.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (8)

1. A method for improving the upgrading efficiency of a large-scale distributed storage cluster is characterized by comprising the following specific steps:
s1, inquiring all data storage devices in the current storage cluster, and performing pre-check on each storage node;
s2, comparing and determining a list of data storage devices to be upgraded;
s3, grouping the data storage devices according to the set fault domain level;
s4, safety stop detection and upgrade of the data storage device are carried out according to groups;
s5 the data storage node rejoins the cluster.
2. The method as claimed in claim 1, wherein the step of grouping the data storage devices according to the set fault domain level and the S3 comprises the following steps:
s301, acquiring a list of data storage devices to be upgraded of each storage node;
s302, grouping the data devices according to the fault domain level set by the current cluster.
3. The method as claimed in claim 2, wherein the step of S4 performing the safety stop detection and upgrade of the data storage device by group comprises the following steps:
s401, establishing a key-value data structure in each group of data storage equipment;
s402, carrying out safety stop detection on the set data storage equipment with the same level as the fault domain;
s403, upgrading the data storage device passing the safety shutdown detection group.
4. The method of claim 3, wherein the step of rejoining the cluster by the data storage node of S5 is as follows:
s501, setting the value of the upgraded data storage device to be upgraded, and waiting for the cluster state to recover to normal;
s502, grouping the data storage devices which fail to pass the security detection again;
s503, updating the value of the upgraded identification data storage equipment to be upgraded;
and S504, rejoining the upgraded data storage device into the storage cluster.
5. A system for improving the upgrading efficiency of a large-scale distributed storage cluster is characterized by specifically comprising an upgrading control module, an information comparison module, a grouping setting module, a grouping processing module and a node processing module:
an upgrade control module: inquiring all data storage devices in the current storage cluster, and performing pre-inspection on each storage node;
an information comparison module: comparing and determining a data storage device list to be upgraded;
a grouping setting module: grouping the data storage devices according to the set fault domain level;
a packet processing module: the safety stop detection and the upgrade of the data storage equipment are carried out according to groups;
a node processing module: the data storage nodes rejoin the cluster.
6. The system of claim 5, wherein the grouping setting module specifically comprises a list acquisition module and a device grouping module:
a list acquisition module: acquiring a data storage equipment list to be upgraded of each storage node;
a device grouping module: and grouping the data equipment according to the fault domain level set by the current cluster.
7. The system of claim 6, wherein the packet processing module specifically comprises a structure building module, a detection control module, and an equipment upgrade module:
a structure establishment module: establishing a key-value data structure in each group of data storage equipment;
the detection control module: carrying out safety stop detection on the data storage equipment of the same level of the set fault domain;
an equipment upgrading module: the data storage devices passing the safety stop detection group are upgraded.
8. The system of claim 7, wherein the node processing module specifically comprises a state setting module, a secondary grouping module, an equipment update module, and a cluster update module:
a state setting module: setting the value of the upgraded data storage equipment as upgraded, and waiting for the cluster state to recover to normal;
a secondary grouping module: grouping the data storage devices which fail to pass the security detection again;
an equipment updating module: updating the value of the upgraded identification data storage equipment to be upgraded;
a cluster update module: and rejoining the upgraded data storage device into the storage cluster.
CN202210256882.2A 2022-03-16 2022-03-16 Method and system for improving upgrading efficiency of large-scale distributed storage cluster Pending CN114691036A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210256882.2A CN114691036A (en) 2022-03-16 2022-03-16 Method and system for improving upgrading efficiency of large-scale distributed storage cluster

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210256882.2A CN114691036A (en) 2022-03-16 2022-03-16 Method and system for improving upgrading efficiency of large-scale distributed storage cluster

Publications (1)

Publication Number Publication Date
CN114691036A true CN114691036A (en) 2022-07-01

Family

ID=82139815

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210256882.2A Pending CN114691036A (en) 2022-03-16 2022-03-16 Method and system for improving upgrading efficiency of large-scale distributed storage cluster

Country Status (1)

Country Link
CN (1) CN114691036A (en)

Similar Documents

Publication Publication Date Title
CN107015872B (en) The processing method and processing device of monitoring data
EP3620905B1 (en) Method and device for identifying osd sub-health, and data storage system
CN202798798U (en) High availability system based on cloud computing technology
CN110807064B (en) Data recovery device in RAC distributed database cluster system
CN109446169B (en) Double-control disk array shared file system
US20050283673A1 (en) Information processing apparatus, information processing method, and program
EP3306476B1 (en) Method and apparatus for hot cpu removal and hot cpu adding during operation
US9208039B2 (en) System and method for detecting server removal from a cluster to enable fast failover of storage
CN110737924B (en) Data protection method and equipment
CN113810216A (en) Cluster fault switching method and device and electronic equipment
CN110515757B (en) Information processing method, device, server and medium of distributed storage system
CN108509296B (en) Method and system for processing equipment fault
CN107729182B (en) Data storage and access method and device
US8482875B2 (en) Tape library control apparatus and tape library control method
CN114691036A (en) Method and system for improving upgrading efficiency of large-scale distributed storage cluster
CN116974489A (en) Data processing method, device and system, electronic equipment and storage medium
CN115373896B (en) Distributed block storage-based duplicate data recovery method and system
CN109032525A (en) A kind of method, apparatus, equipment and storage medium being automatically positioned low-quality disk
CN113112786B (en) Multi-data reading method, medium and electronic equipment for ammeter
CN108897645B (en) Database cluster disaster tolerance method and system based on standby heartbeat disk
CN112667167B (en) Configuration file updating method and device
US20240020044A1 (en) Method for locating hard disk, system and server
CN113568710B (en) High availability realization method, device and equipment for virtual machine
CN110837451B (en) Processing method, device, equipment and medium for high availability of virtual machine
CN114741242A (en) Disk detection method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination