CN111770158A

CN111770158A - Cloud platform recovery method and device, electronic equipment and computer readable storage medium

Info

Publication number: CN111770158A
Application number: CN202010591210.8A
Authority: CN
Inventors: 葛凯凯; 邬沛君; 郑松坚; 潘晓东; 吴晓清; 徐凯; 李文达; 江鹏飞
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-06-24
Filing date: 2020-06-24
Publication date: 2020-10-13
Anticipated expiration: 2040-06-24
Also published as: CN111770158B

Abstract

The application provides a cloud platform recovery method, a cloud platform recovery device, an electronic device and a computer-readable storage medium, and the method comprises the following steps: when one or more storage nodes in a Ceph storage cluster of a cloud platform are in failure, acquiring identification information of each virtual machine in the cloud platform; determining identification information of each block device rbd corresponding to each virtual machine based on the identification information of each virtual machine; acquiring each subfile which is distributed and stored in a Ceph storage cluster and corresponds to each rbd based on the identification information of each rbd; splicing the subfiles corresponding to each rbd to obtain the subfiles corresponding to the rbd, and uploading the local files to a standby Ceph storage cluster; and controlling the cloud platform to be switched from the Ceph storage cluster to the standby Ceph storage cluster so as to restore the cloud platform. According to the scheme, the functional components in the failed Ceph storage cluster do not need to be repaired one by one, and the cloud platform can be timely recovered even if the Ceph storage cluster is seriously damaged or the Ceph storage cluster bears a large amount of data.

Description

Cloud platform recovery method and device, electronic equipment and computer readable storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a cloud platform recovery method and apparatus, an electronic device, and a computer-readable storage medium.

Background

Many cloud platforms are a combination of an OpenStack management system that provides management functions of the cloud platform and a Ceph storage cluster that provides unified storage functions, including block storage, object storage, and file storage. A widely used way of Ceph storage clusters today is to utilize block storage as system disks and data disks for virtual machines.

In the using process of the cloud platform, as more and more virtual machines are applied by users, the data volume carried by the Ceph storage cluster is larger and larger, and the Ceph storage cluster can be gradually enlarged through capacity expansion to reach hundreds of OSDs (QbjectStorage devices) or thousands of OSDs. As the scale of the Ceph storage cluster becomes larger, data migration and recovery become challenges, storage is the basis of the cloud platform, and if the Ceph storage cluster fails, the cloud platform will face the risk of being unusable, and then the Ceph storage cluster needs to be repaired in time to recover the normal operation of the cloud platform when the Ceph storage cluster fails.

At present, the repair scheme for the damage of the Ceph storage cluster is to repair the failed corresponding components one by one until all the components can normally operate, and then to recover the data consistency by using the self-recovery capability of the Ceph storage cluster, so as to finally achieve the recovery of the whole cloud platform. However, in the case that the Ceph storage cluster is seriously damaged or the Ceph storage cluster bears a large amount of data, the cloud platform cannot be timely recovered by the repair scheme.

Disclosure of Invention

The purpose of this application is to solve at least one of the above technical defects, and the technical solution provided by this application embodiment is as follows:

in a first aspect, an embodiment of the present application provides a cloud platform recovery method, including:

when one or more storage nodes in a Ceph storage cluster of a cloud platform are in failure, acquiring identification information of each virtual machine in the cloud platform;

determining identification information of each block device rbd corresponding to each virtual machine based on the identification information of each virtual machine;

acquiring each subfile which is distributed and stored in a Ceph storage cluster and corresponds to each rbd based on the identification information of each rbd;

splicing the subfiles corresponding to each rbd to obtain a local file corresponding to the rbd, and uploading the local file to a standby Ceph storage cluster;

and controlling the cloud platform to be switched from the Ceph storage cluster to the standby Ceph storage cluster so as to restore the cloud platform.

In an optional embodiment of the present application, determining, based on the identification information of each virtual machine, the identification information of each block device rbd corresponding to the virtual machine includes:

acquiring the name of each rbd corresponding to each virtual machine based on the identification information of each virtual machine and the corresponding relation between the virtual machine and each data storage pool in the Ceph storage cluster;

identification information of each rbd is obtained from the name of the rbd.

In an optional embodiment of the present application, the file name of each subfile includes identification information of a corresponding rbd, and based on the identification information of each rbd, the method for acquiring each subfile, which is distributed and stored in the Ceph storage cluster and corresponds to the rbd, includes:

comparing the identification information of each rbd with the file name of each subfile stored in the Ceph storage cluster, and determining the subfile containing the identification information of the rbd in the file name as the subfile corresponding to the rbd;

all subfiles corresponding to each rbd are retrieved.

In an optional embodiment of the present application, comparing the identification information of each rbd with the file name of each subfile stored in the Ceph storage cluster, and determining the subfile containing the identification information of the rbd in the file name as the subfile corresponding to the rbd includes:

and respectively comparing the identification information of each rbd with the identification information of the rbd contained in each subfile stored in each object storage device OSD in the Ceph storage cluster, and determining the subfile containing the identification information of the rbd in the file name as the subfile corresponding to the rbd.

In an optional embodiment of the present application, after determining the subfile with the file name containing the identification information of the rbd as the subfile corresponding to the rbd, the method further includes:

the method comprises the steps of obtaining storage path information of each subfile corresponding to each rbd, and storing the storage path information of each subfile and identification information of the corresponding rbd into a preset database in a one-to-one correspondence mode, wherein the storage path information comprises a host name of an OSD (on screen display) where the corresponding subfile is located and storage directory information of the subfile in the OSD;

acquiring all subfiles corresponding to each rbd, including:

acquiring storage path information of each corresponding subfile from a preset database based on the identification information of each rbd;

and acquiring all corresponding subfiles from the corresponding OSD based on the path information of each subfile.

In an optional embodiment of the present application, a file of each subfile includes a location offset of the subfile in a corresponding rbd, where the location offset indicates a location of the corresponding subfile in the corresponding rbd, and the subfiles corresponding to each rbd are spliced to obtain a local file corresponding to the rbd, including:

determining target subfiles of positions of each rbd based on the subfiles corresponding to each rbd;

determining the splicing sequence of each target subfile based on the position offset contained in the file name of each target subfile corresponding to each rbd;

and splicing the target subfiles corresponding to each rbd according to the splicing sequence to obtain a local file corresponding to the rbd.

In an optional embodiment of the present application, determining a target subfile at each location of each rbd based on the subfiles corresponding to the rbd includes:

for each position in the rbd, if the position corresponds to a subfile, determining the subfile as a target subfile of the position;

and if the position corresponds to at least two subfiles, selecting one subfile from the at least two subfiles according to a preset strategy to determine the selected subfile as a target subfile of the position.

In an optional embodiment of the present application, selecting one subfile from the at least two subfiles according to a preset policy to determine the target subfile at the location includes:

acquiring the information summary algorithm MD5 value of each subfile;

if the MD5 values of the subfiles are the same, one of the optional subfiles is determined as a target subfile of the position;

if the MD5 values of the subfiles are different, selecting the corresponding subfile with the first or last modification time in the subfiles to determine the target subfile at the position.

In an optional embodiment of the present application, after the controlling cloud platform is switched from the Ceph storage cluster to the standby Ceph storage cluster, the method further includes:

formatting the Ceph storage cluster to obtain a formatted Ceph storage cluster, and adding storage nodes in the formatted Ceph storage cluster into a standby Ceph storage cluster to obtain a combined Ceph storage cluster;

and eliminating the storage nodes belonging to the standby Ceph storage cluster system in the combined Ceph storage cluster.

In a second aspect, an embodiment of the present application provides a cloud platform recovery apparatus, including:

the identification information acquisition module of the virtual machine is used for acquiring identification information of each virtual machine in the cloud platform when one or more storage nodes in a Ceph storage cluster of the cloud platform are in failure;

an rbd identification information acquisition module, configured to determine, based on identification information of each virtual machine, identification information of each block device rbd corresponding to the virtual machine;

the subfile acquisition module is used for acquiring subfiles which are distributed and stored in the Ceph storage cluster and correspond to each rbd based on the identification information of each rbd;

the local file uploading module is used for splicing the subfiles corresponding to each rbd to obtain a local file corresponding to the rbd, and uploading the local file to the standby Ceph storage cluster;

and the storage cluster switching module is used for controlling the cloud platform to be switched from the Ceph storage cluster to the standby Ceph storage cluster so as to recover the cloud platform.

In an optional embodiment of the present application, the identification information obtaining module of rbd is specifically configured to:

identification information of each rbd is obtained from the name of the rbd.

In an optional embodiment of the present application, a file name of each subfile includes identification information of a corresponding rbd, and the subfile obtaining module is specifically configured to:

all subfiles corresponding to each rbd are retrieved.

In an optional embodiment of the present application, the subfile obtaining module is further configured to:

In an optional embodiment of the present application, the apparatus further includes a storage path information obtaining module, configured to:

after determining subfiles containing identification information of the rbd in the file name as subfiles corresponding to the rbd, acquiring storage path information of each subfile corresponding to each rbd, and storing the storage path information of each subfile and the identification information of the corresponding rbd into a preset database in a one-to-one correspondence manner, wherein the storage path information comprises a host name of an OSD (on screen display) where the corresponding subfile is located and storage directory information of the subfile in the OSD;

correspondingly, the subfile acquiring module is specifically configured to:

In an optional embodiment of the present application, a file of each subfile includes a location offset of the subfile in a corresponding rbd, where the location offset indicates a location of the corresponding subfile in the corresponding rbd, and the local file uploading module is specifically configured to:

In an optional embodiment of the present application, the local file upload module is further configured to:

acquiring the information summary algorithm MD5 value of each subfile;

In an optional embodiment of the present application, the apparatus further comprises a capacity expansion module configured to:

after the cloud platform is controlled to be switched from the Ceph storage cluster to the standby Ceph storage cluster, formatting the Ceph storage cluster to obtain a formatted Ceph storage cluster, and adding storage nodes in the formatted Ceph storage cluster to the standby Ceph storage cluster to obtain a combined Ceph storage cluster;

In a third aspect, an embodiment of the present application provides an electronic device, including a memory and a processor;

the memory has a computer program stored therein;

a processor configured to execute a computer program to implement the method provided in the embodiment of the first aspect or any optional embodiment of the first aspect.

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and the computer program, when executed by a processor, implements the method provided in the embodiment of the first aspect or any optional embodiment of the first aspect.

In a fifth aspect, embodiments of the present application provide a computer program product or a computer program comprising computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device when executing implements the method provided in the embodiment of the first aspect or any optional embodiment of the first aspect.

The beneficial effect that technical scheme that this application provided brought is:

the method comprises the steps of obtaining identification information of corresponding rbd through identification information of each virtual machine, obtaining corresponding undamaged subfiles from a failed Ceph storage cluster based on the identification information of each rbd, obtaining corresponding local files based on the undamaged subfiles corresponding to the rbd and uploading the local files to a standby Ceph storage cluster, and finally switching a cloud platform to the standby Ceph storage cluster to achieve recovery of the cloud platform.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the description of the embodiments of the present application will be briefly described below.

Fig. 1 is an interaction schematic diagram of an OpenStack management system and a Ceph storage cluster in a cloud platform according to an embodiment of the present disclosure;

fig. 2 is a schematic flowchart of a cloud platform recovery method according to an embodiment of the present disclosure;

FIG. 3 is a schematic diagram illustrating distributed storage of subfiles corresponding to rbd in a Ceph storage cluster in an embodiment of the present application;

fig. 4 is a schematic diagram illustrating switching of a storage cluster of a cloud platform in an embodiment of the present application;

fig. 5 is a schematic flowchart of determining identification information of a corresponding rbd based on identification information of a virtual machine in an embodiment of the present application;

fig. 6 is a schematic diagram illustrating a correspondence relationship between a sub file and a local file in an example according to an embodiment of the present application;

FIG. 7 is a diagram illustrating local files obtained by splicing target subfiles according to an example of the embodiment of the present application;

FIG. 8 is a diagram illustrating parallel comparison of identification information in an example of an embodiment of the present application;

fig. 9 is a schematic diagram illustrating that a corresponding subfile is acquired based on the identification information of rbd in an example of the embodiment of the present application;

fig. 10 is a schematic diagram illustrating that cloud platform recovery is performed after a Ceph storage cluster is divided into a plurality of small Ceph storage clusters in the embodiment of the present application;

fig. 11 is a block diagram illustrating a structure of a cloud platform recovery apparatus according to an embodiment of the present disclosure;

fig. 12 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present invention.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

The terms referred to in this application will first be introduced and explained:

OpenStack: OpenStack is an open source IaaS (Infrastructure as a Service) management platform.

Ceph: ceph is an open source distributed storage system and can simultaneously provide storage services of objects, blocks and files.

OSD: an OSD (Object Storage Device) is a component in the Ceph that manages specific data disks.

Pg: pg is a logical concept in Ceph, representing a collection of data objects.

rbd: rbd (rados Block device) is a Block device service of the Ceph cluster.

Fig. 1 is an interaction schematic diagram of an OpenStack management system 101 and a Ceph storage cluster 102 in a cloud platform according to an embodiment of the present application, where data in the cloud platform is generated by a system disk and a data disk of a virtual machine applied by a user in the OpenStack management system 101 and a base image, and the data are to be stored in various corresponding data storage pools of the Ceph storage cluster 102, where the data storage pools may be understood to be formed by multiple storage nodes, that is, the data are stored in the storage nodes corresponding to the corresponding data storage pools. For example, data corresponding to the base image is stored in an images pool, data corresponding to a system disk of the image boot virtual machine is stored in a vms pool, data corresponding to a data disk of the image boot virtual machine is stored in a volumes pool, and data corresponding to the system disk and the data disk of the volume boot virtual machine are both stored in the volumes pool. When the Ceph storage cluster 102 fails, the cloud platform may not operate normally, and the cloud platform needs to be restored in time in order to ensure that a user uses the cloud platform normally.

Fig. 2 is a schematic flowchart of a cloud platform recovery method provided in an embodiment of the present application, and as shown in fig. 2, the method may include:

step S201, when one or more storage nodes in a Ceph storage cluster of the cloud platform are in failure, acquiring identification information of each virtual machine in the cloud platform.

It should be noted that the execution main body of the method may be a cloud platform recovery component disposed in an OpenStack management system of a cloud platform, or may be a cloud platform recovery component independent of the OpenStack management system and a Ceph storage cluster, and the component may interact with the OpenStack management system and the Ceph storage cluster respectively to obtain relevant information required for recovering the cloud platform from the OpenStack management system and the Ceph storage cluster, and control the OpenStack management system and the Ceph storage cluster to execute relevant instructions required for recovering the cloud platform.

Specifically, when one or more storage nodes in a Ceph storage cluster of the cloud platform fail, the cloud platform recovery component acquires, by accessing the OpenStack management system of the cloud platform, identification information of a virtual machine applied by a user, for example, identification information of a base image, identification information of an image boot virtual machine, identification information of a volume boot virtual machine, and the like.

Step S202, based on the identification information of each virtual machine, determines the identification information of each block device rbd corresponding to the virtual machine.

The data generated by each virtual machine is stored in the corresponding data storage pool in an rbd form, and when each rbd is stored in the corresponding data storage pool, each rbd carries identification information indicating the corresponding virtual machine and the data storage pool in which the corresponding rbd is located, so that the identification information of the corresponding rbd can be determined based on the identification information of each virtual machine and the corresponding data storage pool.

Step S203, based on the identification information of each rbd, acquiring each subfile corresponding to the rbd and stored in the Ceph storage cluster in a distributed manner.

When each rbd is stored in the Ceph storage cluster, the rbd is distributed and stored in each storage node of the corresponding data storage pool according to a preset algorithm, specifically, each rbd is split into a plurality of subfiles and distributed and stored in a plurality of OSDs of the Ceph storage cluster, wherein the preset algorithm may be a brush distribution algorithm. Meanwhile, the file name of the subfile corresponding to each rbd contains the identification information of the rbd to which the subfile belongs.

Specifically, according to the identification information of each rbd, all subfiles with file names containing the identification information of the rbd can be searched from the Ceph storage cluster, and the subfiles are split from the rbd. It should be noted that, in the process of splitting the rbd and then storing the distributed OSD and the Ceph storage cluster, each subfile typically copies multiple copies and stores the copies in different OSDs, so as to ensure the reliability of data storage.

As shown in fig. 3, for n data storage pools Pool1 and Pool2 … Pool in the Ceph storage cluster, the rbd in the n pools is converted (split) into n corresponding Pg files by using a Hash algorithm, wherein each Pg file contains a plurality of corresponding subfiles, and the subfiles are distributively stored in OSDs of the Ceph storage cluster by using a Crush algorithm, as shown in fig. 3, each subfile is copied, and two copies of the same subfile are stored in different OSDs, so as to ensure the reliability of data storage.

And step S204, splicing the subfiles corresponding to each rbd to obtain a local file corresponding to the rbd, and uploading the local file to the standby Ceph storage cluster.

Specifically, as can be seen from the foregoing description, since multiple copies are copied for storage when each subfile is stored in a distributed manner, when the Ceph storage cluster fails, all copies of the subfile are less likely to be damaged, and a copy without damage can be obtained from the multiple copies as the subfile to ensure the accuracy of the subfile. After the undamaged subfiles of each file of the rbd are acquired, the local file of the rbd can be obtained based on the undamaged subfiles. And uploading the local file of the rbd to a standby Ceph storage cluster for storage.

And step S205, controlling the cloud platform to be switched from the Ceph storage cluster to the standby Ceph storage cluster so as to recover the cloud platform.

Specifically, after the local files corresponding to the rbd of all the virtual machines are uploaded to the standby Ceph storage cluster, the data of all the virtual machines are restored, and then the storage cluster of the cloud platform is switched from the Ceph storage cluster to the standby Ceph storage cluster, so that normal operation of the cloud platform can be restored. Specifically, as shown in fig. 4, which is a schematic diagram of switching a storage cluster of a cloud platform, switching the storage cluster of the cloud platform from a Ceph storage cluster to a standby Ceph storage cluster requires the following three steps of updating: (1) updating the Ceph configuration files corresponding to the storage nodes in all OpenStack management systems and Keying corresponding to users, wherein the Keying is used for authenticating the identity of the users, and the Keying of different users is different; (2) updating a database of a liance component for managing mirror images in the OpenStack management system, wherein storage path information of each mirror image data in the database of the liance component comprises cluster identification Information (ID) of a Ceph storage cluster, and the cluster IDs of different Ceph storage clusters are different; (3) updating a database of nova components for managing virtual machines in the OpenStack management system, wherein the database of nova components stores all disk information, each disk information includes an ip address of mon corresponding to a Ceph storage cluster, and the ip addresses of mon corresponding to different Ceph clusters are different, wherein mon is also called as a control node, and the Ceph storage clusters are managed collectively, such as authority authentication, an OSD topology, a Pg topology, health conditions of OSD, and the like.

According to the scheme provided by the embodiment of the disclosure, the identification information of the corresponding rbd is acquired through the identification information of each virtual machine, the corresponding undamaged subfiles are acquired from the failed Ceph storage cluster based on the identification information of each rbd, the corresponding local files are acquired and uploaded to the standby Ceph storage cluster based on the undamaged subfiles corresponding to the rbd, and finally the cloud platform is restored by switching the cloud platform to the standby Ceph storage cluster.

identification information of each rbd is obtained from the name of the rbd.

For example, for images pools, data generated by a basic mirror image is stored, and the name of the corresponding rbd is identification information of the basic mirror image. For the vms pool, where data generated by the system disk mirroring the boot virtual machine is stored, the name of the corresponding rbd is a combination of identification information of the mirroring boot virtual machine and "_ disk". For the volumes pool, the data generated by the data disk of the storage image starting virtual machine and the data generated by the data disk and the system disk of the volume starting virtual machine, the name of the corresponding rbd is' volume^-"with the identification information of the mirror-boot virtual machine/the identification information of the volume-boot virtual machine.

After the names of the rbds are determined, the identification information of the rbds stored in the Ceph storage cluster can be obtained through resolving the names of the rbds, specifically, the identification information of the corresponding rbds can be obtained through resolving the names of the rbds through an "rbd info" command, for example, the identification information of the rbds can be in a form of "rbd _ data.e 7626a6b8b4567". As shown in fig. 5, the process of determining the identification information of each rbd corresponding to each virtual machine according to the identification information of each virtual machine can be divided into two steps, step 501 obtains the name of each rbd corresponding to each virtual machine according to the preset naming rule corresponding to each data storage pool, and step 502 parses the corresponding identification information from the name of each rbd through the "rbd info" command.

all subfiles corresponding to each rbd are retrieved.

As can be known from the foregoing description, the file name of each subfile corresponding to an rbd includes identification information of the rbd, and to recover the rbd, all subfiles belonging to the rbd need to be acquired from the failed Ceph storage cluster, that is, it is determined which subfiles belong to the rbd in the failed Ceph storage cluster.

Specifically, the identification information of each rbd is compared with the file name of each subfile stored in the Ceph storage cluster, and the file name of each subfile containing the identification information of the rbd is the subfile belonging to the rbd. Determining and recording the subfiles belonging to the rbd, and acquiring the subfiles belonging to the rbd before generating the local file of the rbd.

The method includes the steps that a Ceph storage cluster comprises a plurality of OSD (on screen displays), the OSD is a main data bearing assembly in the Ceph storage cluster, distributed storage of data corresponding to a virtual machine on the Ceph storage cluster is achieved through the OSD, in the process of comparing identification information of each rbd with file names of sub-files stored in the Ceph storage cluster, the sub-files stored in each OSD in the Ceph storage cluster need to be compared, and when the number of the OSD is large, if the sub-files stored in each OSD are compared in sequence, efficiency is low. In order to improve the efficiency of determining the subfiles corresponding to the rbd, the embodiments of the present application may perform parallel comparison on the OSDs, that is, compare the identification information of each rbd with the file names of the subfiles stored in the OSDs at the same time.

Specifically, a plurality of OSD running components (osdranner) can be controlled by a sub-file collecting component (Collect manager) to respectively perform parallel comparison on OSDs, that is, each OSD is configured with an OSD runner to perform comparison between the identification information and the file name of the sub-file.

In an optional embodiment of the present application, after determining the subfile with the file name containing the identification information of the rbd as the subfile corresponding to the rbd, the method may further include:

acquiring all subfiles corresponding to each rbd, including:

Specifically, after determining the subfile corresponding to each rbd, obtaining the storage path information of each subfile, and storing the storage path information of each subfile and the identification information of the corresponding rbd in a preset database in a one-to-one correspondence manner. The storage path information comprises the host name of the OSD where the corresponding subfile is located and the storage directory information of the subfile in the OSD. In the subsequent step of acquiring the corresponding local file through each sub-step corresponding to each rbd, the storage path information of all corresponding subfiles can be acquired from a preset database through the identification information of the rbd, and the subfiles are acquired according to the storage path information of the subfiles. Specifically, the storage path information of all corresponding subfiles is acquired from a preset database according to the identification information of the rbd, and the corresponding subfiles are acquired under the storage directory of the OSD indicated by the host name according to the host name and the storage directory information in the storage path information.

As can be seen from the foregoing description, each rbd is distributed and stored in each storage node of the corresponding data storage pool according to a preset algorithm, that is, each rbd is split into multiple subfiles and distributed and stored in multiple OSDs of the Ceph storage cluster, and then the multiple subfiles occupy different positions in the corresponding rbd. The file name of each subfile comprises identification information of the corresponding rbd and a position offset of the corresponding rbd, the identification information of the corresponding rbd is used for indicating the rbd, and the offset is used for indicating the position of the corresponding subfile. For example, as shown in fig. 6, the identification information of a certain rbd is "rbd _ data.e7626a6b8b4567", the rbd is split into n (n ≧ 2) sub-files during storage, the size of each sub-file is 4M, when a sub-file with the file name "rbd _ data.e7626a6b8b4567.00000000000000002" is acquired, because the prefix (the identification information of the corresponding rbd) "rbd _ data.e7626ab8b4567" in the file name of the sub-file is the same as the rbd, it is determined that the sub-file belongs to the rbd, and because the suffix (location offset) "0000000000002" in the file name of the sub-file, it is determined that the sub-file is located at the 2 nd position from the beginning to the end of the n positions in the rbd.

It should be noted that, as can be seen from the foregoing description, in the process of splitting the rbd and then performing distributed storage on a plurality of OSDs in a Ceph storage cluster, each subfile generally copies a plurality of copies and stores the copies in different OSDs, and after obtaining the subfile corresponding to each rbd, it is necessary to determine the target subfile at each position in the rbd, that is, determine the target subfile without damage from the plurality of copies corresponding to each position.

Specifically, after all subfiles corresponding to each rbd are obtained, a target subfile at each position is determined first, and then the subfiles are spliced according to the arrangement sequence of the subfiles indicated by the position offset, so that a local file corresponding to the rbd is obtained.

For example, fig. 7 is a schematic diagram of a splicing process of a local file corresponding to a rbd, where the rbd is composed of 61 target subfiles, the corresponding target subfiles are respectively obtained from corresponding OSDs, and the rbd local files are obtained by aligning and splicing the corresponding target subfiles according to a splicing sequence.

Next, referring to fig. 8 and 9, a recovery process of a cloud platform is illustrated, where a Ceph storage cluster in a certain cloud platform, which has a fault, includes an OSD1, an OSD2, and an OSD3, and when a select manager is used to search for a subfile corresponding to an rbd, a corresponding OSD runner is configured for each OSD, that is, an OSD runner1 is configured for the OSD1, an OSD runner2 is configured for the OSD2, and an OSD runner3 is configured for the OSD 3.

Fig. 8 illustrates a parallel comparison process, specifically, the corresponding OSD runner is used to perform parallel comparison on the subfiles of each OSD type, and the storage path information of the subfiles obtained by the comparison and the identification information of the corresponding rbd are stored in a preset database (Data Base, DB). Among other things, the subfiles of the OSD are typically stored under the designated directory "/var/lib/ceph/OSD". Fig. 9 illustrates a process of acquiring subfiles corresponding to rbd, and specifically, the storage path information of each corresponding subfile is acquired from the DB according to the identification information of rbd, and the corresponding subfiles are acquired from the specified directory of the corresponding OSD according to the storage path information. Then, for the same position of the plurality of subfiles (copies) in the rbd, selecting the subfiles without damage as target subfiles through a preset strategy, uploading all the target subfiles corresponding to the rbd to an aggregation storage node, splicing all the target subfiles corresponding to the rbd in the aggregation storage node according to a splicing sequence to obtain local files corresponding to the rbd, uploading the local files to a standby Ceph storage cluster, and recovering data of each virtual machine.

Specifically, if the subfile corresponding to each position in the rbd is not copied during storage, the obtained subfile corresponding to the position is directly determined as the target subfile corresponding to the position. If the subfile corresponding to each position in the rbd is copied during storage, that is, the position corresponds to multiple subfiles, a target subfile of the position needs to be determined from the multiple subfiles according to a preset strategy.

For example, for an rbd with identification information of "rbd _ data.e7626a6b8b4567", 3 sub-files (copies) corresponding to the 1 st position thereof are acquired, and the storage paths of the 3 sub-files are respectively:

“/var/lib/ceph/osd/ceph-1/current/9.f7_head/rbd_data.e7626a6b8b4567.0000000000001”；

“/var/lib/ceph/osd/ceph-4/current/9.f7_head/rbd_data.e7626a6b8b4567.0000000000001”；

“/var/lib/ceph/osd/ceph-7/current/9.f7_head/rbd_data.e7626a6b8b4567.0000000000001”。

then before the sub-file splicing is performed, the target sub-file corresponding to the first position needs to be determined from the 3 sub-files.

acquiring the information summary algorithm MD5 value of each subfile;

Specifically, if the same position of the rbd corresponds to multiple subfiles, the MD5 values of the multiple subfiles are first obtained, and if the MD5 values of the subfiles are the same, it can be considered that the multiple subfiles are the same and are not damaged when the Ceph storage cluster fails, one of the multiple subfiles can be selected as a target subfile of the position. If the MD5 of each subfile is not the same, it may be considered that there is a case where the subfiles are damaged, and it is necessary to determine a target subfile without damage from the subfiles, specifically, if a deletion operation is being performed on the subfiles when the Ceph storage cluster fails, a subfile with the first modification time in the subfiles is determined as the target subfile, and if a new addition operation is being performed on the subfiles when the Ceph storage cluster fails, a subfile with the last modification time in the subfiles is determined as the target subfile.

In an optional embodiment of the present application, after the controlling cloud platform is switched from the Ceph storage cluster to the standby Ceph storage cluster, the method may further include:

Specifically, after the cloud platform is controlled to be switched from the Ceph storage cluster to the standby Ceph storage cluster, data of the virtual machines are all recovered, the cloud platform is further recovered, and further, the cloud platform can be switched from the standby Ceph storage cluster to the original Ceph storage cluster. Specifically, the failed Ceph storage cluster is first formatted and emptied to obtain a formatted Ceph storage cluster, then storage nodes in the formatted Ceph storage cluster are added to the standby Ceph storage cluster to obtain a merged Ceph storage cluster, and this process can be understood as capacity expansion of the Ceph storage cluster of the cloud platform. And then, eliminating the storage nodes belonging to the standby storage cluster in the combined Ceph storage cluster, wherein the data stored on the storage nodes of the standby Ceph storage cluster in the elimination process can be automatically transferred to the storage nodes of the formatted Ceph storage cluster, and the process can be understood as capacity reduction of the Ceph storage cluster of the cloud platform. Through the capacity expansion and reduction processes, the cloud platform can be switched from the standby Ceph storage cluster to the original Ceph storage cluster, and meanwhile, the repair of the failed Ceph storage cluster is completed.

In an optional embodiment of the present application, in order to further improve the efficiency of recovering the cloud platform when the Ceph storage cluster of the cloud platform fails, the Ceph storage cluster of the cloud platform may be divided into a plurality of small Ceph storage clusters, so that when the small Ceph storage cluster fails, only the small Ceph storage clusters are processed by using the scheme described in the above embodiment to recover the normal operation of the cloud platform. Meanwhile, the standby Ceph storage cluster can be constructed through an integrated cabinet, the integrated cabinet comprises the standby Ceph storage cluster and the aggregation storage node, and the cost is lower.

As shown in fig. 10, a Ceph storage cluster of a certain cloud platform is divided into two small Ceph storage clusters, i.e., a Ceph storage cluster 1 and a Ceph storage cluster 2, and when the Ceph storage cluster 1 fails, the Ceph storage cluster 1 is processed by the method described in the above embodiment, so that the cloud platform is recovered. Specifically, target subfiles corresponding to the rbds are obtained from the Ceph storage cluster 1, local files corresponding to the rbds are obtained through splicing based on the target subfiles, the local files are uploaded to a standby Ceph storage cluster constructed by an integrated cabinet to be stored, and finally the cloud platform is switched from the Ceph storage cluster 1 to the standby Ceph storage cluster, so that the cloud platform is recovered.

Fig. 11 is a block diagram illustrating a structure of a cloud platform recovery apparatus according to an embodiment of the present application, and as shown in fig. 11, the apparatus 1100 may include: an identification information acquisition module 1101 of the virtual machine, an identification information acquisition module 1102 of the rbd, a subfile acquisition module 1103 and a local file upload module 1104, wherein:

the identification information acquisition module 1101 of the virtual machine is configured to acquire identification information of each virtual machine in an OpenStack management system of a cloud platform when one or more storage nodes in a Ceph storage cluster of the cloud platform fail;

the identification information obtaining module 1102 of the rbd is configured to determine identification information of each block device rbd corresponding to each virtual machine based on the identification information of the virtual machine;

the subfile obtaining module 1103 is configured to obtain, based on the identification information of each rbd, subfiles corresponding to the rbd and stored in the Ceph storage cluster in a distributed manner;

the local file uploading module 1104 is used for splicing the sub-files corresponding to each rbd to obtain a local file corresponding to the rbd, and uploading the local file to the standby Ceph storage cluster;

the storage cluster switching module 1105 is configured to control the cloud platform to switch from the Ceph storage cluster to the standby Ceph storage cluster, so as to restore the cloud platform.

identification information of each rbd is obtained from the name of the rbd.

According to the scheme, the identification information of the corresponding rbd is obtained through the identification information of each virtual machine, the corresponding undamaged subfiles are obtained from the failed Ceph storage cluster based on the identification information of each rbd, the corresponding local files are obtained and uploaded to the standby Ceph storage cluster based on the undamaged subfiles corresponding to the rbd, and finally the cloud platform is switched to the standby Ceph storage cluster to recover the cloud platform.

all subfiles corresponding to each rbd are retrieved.

correspondingly, the subfile acquiring module is specifically configured to:

acquiring the information summary algorithm MD5 value of each subfile;

In the embodiment of the application, the server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, a middleware service, a domain name service, a security service, a CDN, a big data and artificial intelligence platform, and the like. The terminal may be, but is not limited to, a smart phone, a tablet computer, a laptop computer, a desktop computer, a smart speaker, a smart watch, and the like. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the application is not limited herein.

Cloud technology (Cloud technology) is based on a general term of network technology, information technology, integration technology, management platform technology, application technology and the like applied in a Cloud computing business model, can form a resource pool, is used as required, and is flexible and convenient. Cloud computing technology will become an important support. Background services of the technical network system require a large amount of computing and storage resources, such as video websites, picture-like websites and more web portals. With the high development and application of the internet industry, each article may have its own identification mark and needs to be transmitted to a background system for logic processing, data in different levels are processed separately, and various industrial data need strong system background support and can only be realized through cloud computing.

Database (Database), which can be regarded as an electronic file cabinet in short, a place for storing electronic files, a user can add, query, update, delete, etc. to data in files. A "database" is a collection of data that is stored together in a manner that can be shared by multiple users, has as little redundancy as possible, and is independent of the application.

A Database Management System (DBMS) is a computer software System designed for managing a Database, and generally has basic functions of storage, interception, security assurance, backup, and the like. The database management system may classify the database according to the database model it supports, such as relational, XML (Extensible markup language); or classified according to the type of computer supported, e.g., server cluster, mobile phone; or sorted according to the query language used, such as SQL (Structured query language), XQuery, or sorted according to performance impulse emphasis, such as max size, maximum operating speed, or other sorting.

Based on the same principle, an embodiment of the present application further provides an electronic device, where the electronic device includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the method provided in any optional embodiment of the present application is implemented, and specifically, the following situations are implemented:

when one or more storage nodes in a Ceph storage cluster of a cloud platform are in failure, acquiring identification information of each virtual machine in the cloud platform; determining identification information of each block device rbd corresponding to each virtual machine based on the identification information of each virtual machine; acquiring each subfile which is distributed and stored in a Ceph storage cluster and corresponds to each rbd based on the identification information of each rbd; splicing the subfiles corresponding to each rbd to obtain a local file corresponding to the rbd, and uploading the local file to a standby Ceph storage cluster; and controlling the cloud platform to be switched from the Ceph storage cluster to the standby Ceph storage cluster so as to restore the cloud platform.

The embodiment of the present application provides a computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the method shown in any embodiment of the present application.

It is understood that the medium may store a computer program corresponding to the cloud platform recovery method.

Fig. 12 is a schematic structural diagram of an electronic device to which the embodiment of the present application is applied, and as shown in fig. 12, an electronic device 1200 shown in fig. 12 includes: a processor 1201 and a memory 1203. Wherein the processor 1201 is coupled to the memory 1203, such as by a bus 1202. Further, the electronic device 1200 may further include a transceiver 1204, and the electronic device 1200 may interact with other electronic devices through the transceiver 1204. It should be noted that the transceiver 1204 is not limited to one in practical applications, and the structure of the electronic device 1200 is not limited to the embodiment of the present application.

The processor 1201 applied in this embodiment of the application may be configured to implement the function of the cloud platform recovery apparatus shown in fig. 11.

The processor 1201 may be a CPU, general purpose processor, DSP, ASIC, FPGA or other programmable logic device, transistor logic device, hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor 1201 may also be a combination of computing functions, e.g., comprising one or more microprocessors, a combination of a DSP and a microprocessor, or the like.

Bus 1202 may include a path that conveys information between the aforementioned components. The bus 1202 may be a PCI bus or an EISA bus, etc. The bus 1202 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 12, but this is not intended to represent only one bus or type of bus.

The memory 1203 may be, but is not limited to, a ROM or other type of static storage device that can store static information and instructions, a RAM or other type of dynamic storage device that can store information and instructions, an EEPROM, a CD-ROM or other optical disk storage, optical disk storage (including compact disk, laser disk, optical disk, digital versatile disk, blu-ray disk, etc.), a magnetic disk storage medium or other magnetic storage device, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer.

The memory 1203 is used for storing application program codes for executing the scheme of the application, and the execution is controlled by the processor 1201. The processor 1201 is configured to execute the application program code stored in the memory 1203 to implement the actions of the cloud platform recovery apparatus provided in the embodiment shown in fig. 11.

Embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device realizes the following when executed:

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

The foregoing is only a partial embodiment of the present application, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present application, and these modifications and decorations should also be regarded as the protection scope of the present application.

Claims

1. A cloud platform recovery method is characterized by comprising the following steps:

when one or more storage nodes in a Ceph storage cluster of the cloud platform are in failure, acquiring identification information of each virtual machine in the cloud platform;

acquiring each subfile which is distributed and stored in the Ceph storage cluster and corresponds to each rbd based on the identification information of each rbd;

controlling the cloud platform to switch from the Ceph storage cluster to the standby Ceph storage cluster to restore the cloud platform.

2. The method according to claim 1, wherein the determining, based on the identification information of each virtual machine, the identification information of the respective block device rbd corresponding to the virtual machine comprises:

identification information of each rbd is obtained from the name of the rbd.

3. The method according to claim 1, wherein a file name of each subfile includes identification information of a corresponding rbd, and the obtaining of the subfiles corresponding to the rbd and stored in the Ceph storage cluster in a distributed manner based on the identification information of each rbd comprises:

all subfiles corresponding to each rbd are retrieved.

4. The method according to claim 3, wherein the comparing the identification information of each rbd with the file name of each subfile stored in the Ceph storage cluster, and determining the subfile with the file name containing the identification information of the rbd as the subfile corresponding to the rbd comprises:

5. The method according to claim 3, wherein after determining the subfile with the file name containing the identification information of the rbd as the subfile corresponding to the rbd, the method further comprises:

the method comprises the steps of obtaining storage path information of each subfile corresponding to each rbd, and storing the storage path information of each subfile and identification information of the corresponding rbd into a preset database in a one-to-one correspondence mode, wherein the storage path information comprises a host name of an OSD (on screen display) where the corresponding subfile is located and storage directory information of the subfile on the OSD;

the obtaining all subfiles corresponding to each rbd comprises:

acquiring storage path information of each corresponding subfile from the preset database based on the identification information of each rbd;

6. The method according to claim 1, wherein the file of each subfile includes a location offset of the subfile in the corresponding rbd, the location offset indicates a location of the corresponding subfile in the corresponding rbd, and the splicing of the subfiles corresponding to each rbd to obtain the local file corresponding to the rbd includes:

7. The method of claim 6, wherein the determining the target subfile at each rbd location based on the subfiles corresponding to each rbd comprises:

if the position corresponds to at least two subfiles, one subfile is selected from the at least two subfiles according to a preset strategy and determined as a target subfile of the position.

8. The method according to claim 7, wherein the selecting one of the at least two subfiles according to a preset policy to determine the target subfile of the location comprises:

acquiring the information summary algorithm MD5 value of each subfile;

9. The method of claim 1, wherein after controlling the cloud platform to switch from the Ceph storage cluster to the backup Ceph storage cluster, the method further comprises:

formatting the Ceph storage cluster to obtain a formatted Ceph storage cluster, and adding storage nodes in the formatted Ceph storage cluster into the standby Ceph storage cluster to obtain a combined Ceph storage cluster;

10. A cloud platform recovery apparatus, comprising:

the local file uploading module is used for splicing the sub-files corresponding to each rbd to obtain a local file corresponding to the rbd, and uploading the local file to the standby Ceph storage cluster;

11. An electronic device comprising a memory and a processor;

the memory has stored therein a computer program;

the processor for executing the computer program to implement the method of any one of claims 1 to 9.

12. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, carries out the method of any one of claims 1 to 9.