CN111708488B

CN111708488B - Distributed memory disk-based Ceph performance optimization method and device

Info

Publication number: CN111708488B
Application number: CN202010452359.8A
Authority: CN
Inventors: 丁钊
Original assignee: Suzhou Inspur Intelligent Technology Co Ltd
Current assignee: Suzhou Inspur Intelligent Technology Co Ltd
Priority date: 2020-05-26
Filing date: 2020-05-26
Publication date: 2023-01-06
Anticipated expiration: 2040-05-26
Also published as: CN111708488A

Abstract

The invention provides a distributed memory disk-based Ceph performance optimization method and device, wherein the method comprises the following steps: creating a virtual disk on a memory file system on each storage node of the Ceph distributed storage system; integrating the virtual disks on the plurality of storage nodes, and creating a high-speed storage pool by using the integrated virtual disks; the performance of the Ceph distributed storage system is accelerated based on the created high-speed storage pool. By using the scheme of the invention, the distributed memory disk can be used as a high-speed storage pool, higher performance is provided than media such as a solid state disk, the scheme of the prior art of using the solid state disk for acceleration can be compatible, the defect that the memory is easy to lose data is overcome through the use of a redundancy rule and a processing flow under special conditions, and the reliability is higher.

Description

Distributed memory disk-based Ceph performance optimization method and device

Technical Field

The field relates to the field of computers, and more particularly, to a method and apparatus for distributed memory disk-based Ceph performance optimization.

Background

Ceph is a unified, distributed storage system designed for excellent performance, reliability, and scalability. The Ceph cluster can provide three use scenes, namely block storage, object storage and file storage. In the existing performance optimization mode, a solid state disk is usually used as a high-speed medium, and a hierarchical storage or cache mode is used to improve performance. Memory in a host in the distributed storage domain is typically used as a cache for a software stack on a single node. The memory is cached in units of physical pages, which is different from the unit of sectors used by block devices such as hard disks. In the prior art, a scheme that a single machine uses a memory disc for acceleration exists, and the defects of small capacity and easy data loss exist.

Disclosure of Invention

In view of this, an object of the embodiments of the present invention is to provide a method and an apparatus for Ceph performance optimization based on a distributed memory disk, where by using the method of the present invention, the distributed memory disk can be used as a high-speed storage pool, and provides higher performance than media such as a solid state disk, and can be compatible with an existing scheme for acceleration by using a solid state disk, and by using a redundancy rule and a processing flow under a special condition, a defect that a memory is prone to data loss is overcome, and the method and the apparatus have higher reliability.

In view of the above, an aspect of the embodiments of the present invention provides a method for Ceph performance optimization based on a distributed memory disk, including the following steps:

creating a virtual disk on a memory file system on each storage node of the Ceph distributed storage system;

integrating virtual disks on a plurality of storage nodes, and creating a high-speed storage pool by using the integrated virtual disks;

the performance of the Ceph distributed storage system is accelerated based on the created high-speed storage pool.

According to an embodiment of the present invention, further comprising:

in response to receiving a command of restarting or shutting down the storage node, calling a script program to record the state and configuration information of a high-speed storage pool of the storage node, and recording the configuration of a memory disc on the storage node and the configuration information of a corresponding OSD (on screen display) (object storage device);

in response to the fact that the storage node is restarted, reconstructing a virtual block device in a memory file system of the storage node based on the detected recorded information, and replacing the virtual block device of the original storage node with the newly-created virtual block device;

and calculating missing data blocks in the restarting process and synchronizing the data blocks based on the data on other non-restarted storage nodes.

According to an embodiment of the present invention, further comprising:

in response to receiving a command of unexpected power failure recovery of a storage node, transmitting OSD capacity and id information of the storage node in a management node to the storage node, reconstructing virtual block equipment in a memory file system of the storage node, and replacing the virtual block equipment of the original storage node with newly-created virtual block equipment;

and calculating missing data blocks in the power-off process and synchronizing based on data on other storage nodes.

According to an embodiment of the present invention, creating a virtual disk on a memory file system on each storage node of a Ceph distributed storage system comprises:

mounting Tmpfs (linux memory file system) with a specified size on each storage node to a specified path;

respectively creating virtual disk files with specified sizes under the paths;

the virtual disk file is mounted as a local loop device (a pseudo device, which is a technique for simulating a block device using a file, and the file is used as a magnetic disk or an optical disk after being simulated into the block device) on each storage node.

According to an embodiment of the present invention, the integrating virtual disks on a plurality of storage nodes, and the creating a high-speed storage pool using the integrated virtual disks includes:

respectively initializing local loop equipment on all storage nodes into an OSD (on screen display) of the Ceph distributed storage system, creating a high-speed storage pool by using the OSD, and setting fault domains of the high-speed storage pool according to the number and distribution conditions of the storage nodes;

dividing each OSD into a plurality of PGs (grouped together), and uniformly distributing the original data block and the redundant data block in different PGs of different OSD through the hash algorithm carried by the Ceph distributed storage system.

In another aspect of the embodiments of the present invention, an apparatus for Ceph performance optimization based on a distributed memory disk is further provided, where the apparatus includes:

the creating module is configured to create a virtual disk on a memory file system on each storage node of the Ceph distributed storage system;

the integration module is configured to integrate the virtual disks on the plurality of storage nodes, and a high-speed storage pool is created by using the integrated virtual disks;

an application module configured to perform performance acceleration on the Ceph distributed storage system based on the created high-speed storage pool.

According to an embodiment of the invention, the recovery module is further configured to:

in response to receiving a command of restarting or shutting down the storage node, calling a script program to record the state and configuration information of a high-speed storage pool of the storage node, and recording the configuration of a memory disc on the storage node and the configuration information of a corresponding OSD (on screen display);

in response to the fact that the storage node is restarted, reconstructing virtual block equipment in a memory file system of the storage node based on the detected recorded information, and replacing the virtual block equipment of the original storage node with newly-created virtual block equipment;

and calculating the missing data blocks in the restarting process and synchronizing based on the data on other non-restarted storage nodes.

According to one embodiment of the invention, the power supply further comprises a power-off module configured to:

According to one embodiment of the invention, the creation module is further configured to:

mounting the Tmpfs with the specified size on each storage node to a specified path;

respectively creating virtual disk files with specified sizes under the paths;

and mounting the virtual disk file as a local loop device on each storage node.

According to one embodiment of the invention, the integration module is further configured to:

respectively initializing local loop equipment on all storage nodes into OSD (on screen display) of the Ceph distributed storage system, creating a high-speed storage pool by using the OSD, and setting fault domains of the high-speed storage pool according to the number and distribution conditions of the storage nodes;

dividing each OSD into a plurality of PG (PG) arrangement groups, and uniformly distributing the original data blocks and the redundant data blocks in different PGs of different OSD through a hash algorithm carried by a Ceph distributed storage system.

The invention has the following beneficial technical effects: in the method for optimizing Ceph performance based on a distributed memory disk provided by the embodiment of the present invention, a virtual disk is created on a memory file system on each storage node of a Ceph distributed storage system; integrating the virtual disks on the plurality of storage nodes, and creating a high-speed storage pool by using the integrated virtual disks; the technical scheme for accelerating the performance of the Ceph distributed storage system based on the created high-speed storage pool can use the distributed memory disk as the high-speed storage pool, provides higher performance than media such as a solid state disk, can be compatible with the existing scheme for accelerating by using the solid state disk, overcomes the defect that the memory is easy to lose data through the use of a redundancy rule and a processing flow under special conditions, and has higher reliability.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other embodiments can be obtained by using the drawings without creative efforts.

FIG. 1 is a schematic flow chart diagram of a method for distributed memory disk based Ceph performance optimization according to an embodiment of the present invention;

FIG. 2 is a diagram illustrating an apparatus for distributed memory disk based Ceph performance optimization according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of constructing a high-speed storage pool, according to one embodiment of the invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the following embodiments of the present invention are described in further detail with reference to the accompanying drawings.

In view of the foregoing, a first aspect of the embodiments of the present invention provides an embodiment of a method for Ceph performance optimization based on a distributed memory disk. Fig. 1 shows a schematic flow diagram of the method.

As shown in fig. 1, the method may comprise the steps of:

s1, creating a virtual disk on a memory file system on each storage node of a Ceph distributed storage system, and using a memory as storage to greatly improve the access performance;

s2, integrating the virtual disks on the plurality of storage nodes, and creating a high-speed storage pool by using the integrated virtual disks, so that a redundancy design can be added to ensure data safety in case of accidents;

s3, the performance of the Ceph distributed storage system is accelerated based on the created high-speed storage pool, and for example, the performance of the storage system can be accelerated by using a three-layer storage scheme.

The technical scheme provided by the invention can overcome the defect of data loss caused by power failure of the memory by technical means. The method converts the memory into block devices which can be managed by the distributed storage system, and performs pooling on the memory resources in the plurality of storage nodes in the form of the block devices. And a redundancy design is added into the resource pool, so that the data security is enhanced. And accelerating distributed storage by using the created high-speed memory storage pool. Data security is a prerequisite for storage systems.

By the technical scheme of the invention, the distributed memory disk can be used as a high-speed storage pool, higher performance is provided compared with media such as a solid state disk, the existing scheme of using the solid state disk for acceleration can be compatible, the defect that the memory is easy to lose data is overcome through the use of a redundancy rule and a processing flow under special conditions, and the distributed memory disk has higher reliability.

In a preferred embodiment of the present invention, further comprising:

The storage nodes are restarted or shut down, and the two situations are respectively explained below, wherein the first situation is that part of the storage nodes are shut down or restarted in a plan, and when the number of the restarting nodes is not higher than the number of the fault nodes allowed by the redundancy rule, the data recovery is performed by using the following process: (1) After a restart or shutdown command is executed, a script program is called to record the state and the configuration information of the high-speed storage pool, and the configuration of the memory disk on the node to be restarted and the configuration information of the corresponding OSD are recorded. The data in the high-speed storage pool is not required to be migrated to the nonvolatile storage pool. The node operating system is closed according to a normal flow; (2) After the node system is restarted, the event and the configuration information recorded during shutdown are detected, and the virtual block device in the memory file system is reconstructed according to the method of the invention. Executing a failure disk replacement process in the Ceph, removing the OSD recorded in the last shutdown according to a failure disk, and replacing the OSD by using newly created virtual block equipment; (3) And automatically reconstructing data when the number and distribution of OSD in the storage pool reach the state before shutdown or restart, and calculating missing data blocks according to data on other nodes which are not restarted. Because the memory speed is high, the space is small, the data recovery can be completed in a short time, and the time is less than the time taken by restarting the equipment.

The second case is that when all the storage nodes are shut down or restarted in plan, or the number of the restart nodes is higher than the number of the fault nodes allowed by the redundancy rule at the same time, the following process is used: (1) After a restart or shutdown command is executed, calling a script program to record the state and configuration information of the high-speed storage pool, and recording the configuration of a memory disk on a node to be restarted and the configuration information of a corresponding OSD; (2) Migrating the data in the high-speed storage pool to a nonvolatile storage pool, recording logs, and closing a node operating system according to a normal flow; (3) After all node systems are restarted, detecting events and configuration information recorded during shutdown, and rebuilding virtual block equipment in the memory file system according to the step of the first point; (4) Rebuilding a high-speed storage pool according to the method of the invention; (5) And (3) recovering data from the nonvolatile storage pool to the high-speed memory pool according to the log recorded in the step (2) when the number and distribution of OSD in the high-speed storage pool reach the state before shutdown or restart.

In a preferred embodiment of the present invention, further comprising:

in response to receiving a command of restoring the storage node from an unexpected power failure, transmitting OSD capacity and id information of the storage node in the management node to the storage node, reconstructing a virtual block device in a memory file system of the storage node, and replacing the virtual block device of the original storage node with the newly-created virtual block device;

When the number of the nodes which are unexpectedly powered off is lower than the number of the fault nodes which can be tolerated by the redundancy rule, the following steps are carried out: (1) After the storage node with the unexpected power failure is completely started, the management node of the distributed storage cluster can sense that the node with the unexpected power failure is on-line, and the OSD corresponding to the node memory disc after the node with the on-line is in a fault state because the corresponding block device cannot be found; (2) The information of OSD capacity, id and the like with fault state in the management node is transmitted to the online node as parameters, virtual block equipment is established according to the method of the invention, and then a fault disk replacing process is executed, so that the newly established empty virtual block equipment with the same capacity is added into a high-speed memory pool; (3) And automatically reconstructing data when the number and distribution of OSD in the storage pool reach the state before shutdown or restart, and calculating the missing data blocks according to the data on other nodes which are not restarted. Because the memory speed is high, the space is small, the data recovery can be completed in a short time, and the time is less than the time taken by restarting the equipment.

In a preferred embodiment of the present invention, creating a virtual disk on the memory file system on each storage node of the Ceph distributed storage system comprises:

respectively creating virtual disk files with specified sizes under the paths;

and mounting the virtual disk file as a local loop device on each storage node.

The method comprises the steps of changing a physical page cache using mode of a memory managed by a local operating system into a block device using mode which can be cross-node and managed by a distributed storage system, wherein the memory is a temporary storage area for data operation of the operating system, a page is taken as a basic unit, generally, a server A cannot directly use the memory of a server B for caching, and data in the memory cannot be stored persistently. The mechanical hard disk and the solid state hard disk belong to block equipment, and sectors are used as basic storage units and can be stored persistently. The Tmpfs, i.e. Linux memory file system, in the Linux distribution version is a common memory disk using method. For example, on 4 storage nodes with 256G memories configured, the following operations are respectively performed, namely, a memory file system with the size of 128G is mounted to a/mnt/ramdisk path, two virtual disk files of 63G are respectively created under the path, and the virtual disk files are mounted to two virtual block devices of/dev/loop 0 and/dev/loop 1 by using a linux layout tool. This results in a total of 8 disk-based block storage devices on the 4 storage nodes.

In a preferred embodiment of the present invention, the integrating the virtual disks on the plurality of storage nodes, and the creating the high-speed storage pool using the integrated virtual disks includes:

Memory resources on a plurality of storage nodes are integrated and aggregated into a large capacity space, and a redundancy design is added to ensure data security in the event of an accident. Firstly, initializing the created local loopback loop devices on all storage nodes into OSD of a Ceph distributed storage system, selecting proper fault domain setting according to the number and distribution condition of the storage nodes, then dividing each OSD into a plurality of PG arranging groups, setting a redundancy strategy in a storage pool, and uniformly distributing original data blocks and redundant data blocks in different PGs of different OSD through a Hash algorithm carried by the Ceph. For example, a double-copy redundancy mode is selected, each redundant data block and the original data block are respectively stored in PGs of different fault domains, when at most 50% of nodes can be tolerated to fail, the remaining nodes reconstruct the failed PG, a higher reliability can be obtained by selecting a triple copy, and a higher space utilization rate can be obtained by selecting an erasure code rule. For example, 2 blocks of loop devices 63G are respectively arranged on 4 storage nodes, a storage pool mempool is created by 8 blocks of virtual block devices, a fault domain is set as a node level, three copies are selected by a redundancy rule, that is, each data block generates two copy blocks, and the two copy blocks are respectively stored on the three storage nodes. The actual size of mempool is the total block device size/number of copies, i.e., 168G. When the system is in operation, when any two storage nodes fail, at least one data copy is still stored on the rest nodes. The mempool normally provides storage service and can automatically reconstruct missing data on the remaining normal nodes.

FIG. 3 is a schematic diagram of implementing the building of a high-speed storage pool, where 01 is a memory; 02 is a memory file system; 03, creating a virtual disk file in a memory file system; 04 is an OSD corresponding to the loop device for mounting the virtual disk file; 05, a PG (PG) homing group on OSD (on screen display); 06 is a storage node, and the figure shows that two loop devices are provided on the storage node, and a plurality of storage nodes are provided; numeral 07 denotes a high-speed memory pool.

The high-speed storage pool created by the method can accelerate the performance of the Ceph storage system, and the high-speed storage pool based on the memory disk created by the steps is consistent with the high-speed storage pool created based on the solid-state disk in attribute. The time delay of the mechanical hard disk is 10 milliseconds, the solid-state disk is within 1 millisecond, and the response speed of the internal memory disk is several times faster than that of the solid-state disk. Existing schemes for mechanical disk acceleration using solid state disks are equally applicable to high speed memory pools based on memory disks.

For example, performance acceleration is performed using a tiered storage technique. The three-tier storage scheme is as follows: data is firstly stored in a high-speed memory storage pool as a first-level cache, when a threshold value is reached, a migration operation is triggered, the data is migrated to a solid-state disk storage pool as a second-level cache, and when the second-level cache reaches the threshold value, the migration operation is triggered to be written into a mechanical disk storage pool. The two-tier storage scheme is as follows: data is firstly stored into a high-speed memory storage pool as a first-level cache, and when a threshold value is reached, a migration operation is triggered, and the data is written into a mechanical disk storage pool.

For example, in a cloud computing scenario, a template volume for creating virtual machines in batch is migrated from a normal storage pool to a high-speed memory storage pool on line, so that the performance of the volume is improved. The virtual machine starting storm can be solved.

By the technical scheme of the invention, the distributed memory disk can be used as a high-speed storage pool, higher performance is provided than media such as a solid state disk, the method can be compatible with the existing scheme of accelerating by using the solid state disk, the defect that the memory is easy to lose data is overcome by using a redundancy rule and a processing flow under special conditions, and the method has higher reliability.

It should be noted that, as can be understood by those skilled in the art, all or part of the processes in the methods of the embodiments described above can be implemented by instructing relevant hardware by a computer program, and the program may be stored in a computer-readable storage medium, and when executed, the program may include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like. The embodiments of the computer program may achieve the same or similar effects as any of the above-described method embodiments.

Furthermore, the method disclosed according to an embodiment of the present invention may also be implemented as a computer program executed by a CPU, which may be stored in a computer-readable storage medium. The computer program, when executed by the CPU, performs the functions defined above in the methods disclosed in the embodiments of the present invention.

In view of the above objects, in a second aspect of the embodiments of the present invention, an apparatus for optimizing Ceph performance based on a distributed memory disk is provided, as shown in fig. 2, an apparatus 200 includes:

In a preferred embodiment of the present invention, there is further provided a recovery module configured to:

In a preferred embodiment of the present invention, the power supply further comprises a power-off module configured to:

In a preferred embodiment of the present invention, the creation module is further configured to:

respectively creating virtual disk files with specified sizes under the paths;

and mounting the virtual disk file as a local loop device on each storage node.

In a preferred embodiment of the invention, the integration module is further configured to:

It should be particularly noted that the embodiment of the system described above employs the embodiment of the method described above to specifically describe the working process of each module, and those skilled in the art can easily think that the modules are applied to other embodiments of the method described above.

Further, the above-described method steps and system units or modules may also be implemented using a controller and a computer-readable storage medium for storing a computer program for causing the controller to implement the functions of the above-described steps or units or modules.

Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as software or hardware depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the disclosed embodiments of the present invention.

The embodiments described above, particularly any "preferred" embodiments, are possible examples of implementations and are presented merely to clearly understand the principles of the invention. Many variations and modifications may be made to the above-described embodiments without departing from the spirit and principles of the technology described herein. All such modifications are intended to be included within the scope of this disclosure and protected by the following claims.

Claims

1. A distributed memory disk-based Ceph performance optimization method is characterized by comprising the following steps:

creating a virtual disk on a memory file system on each storage node of the Ceph distributed storage system, wherein the creation of the virtual disk on the memory file system on each storage node of the Ceph distributed storage system comprises mounting Tmpfs with a specified size on each storage node to a specified path, respectively creating virtual disk files with the specified size under the path, and mounting the virtual disk files on each storage node as local loop equipment;

integrating the virtual disks on the plurality of storage nodes, creating a high-speed storage pool by using the integrated virtual disks, wherein the virtual disks on the plurality of storage nodes are integrated, the creating of the high-speed storage pool by using the integrated virtual disks comprises the steps of respectively initializing local loop devices on all the storage nodes into OSD (on screen display) of the Ceph distributed storage system, creating the high-speed storage pool by using the OSD, setting fault domains of the high-speed storage pool according to the number and the distribution condition of the storage nodes, dividing each OSD into a plurality of PG (group membership) groups, and uniformly distributing original data blocks and redundant data blocks in different PGs of different OSD (on screen display) through a self-contained Hash algorithm of the Ceph distributed storage system;

performing performance acceleration on the Ceph distributed storage system based on the created high-speed storage pool.

2. The method of claim 1, further comprising:

in response to receiving a command of restarting or shutting down a storage node, calling a script program to record the state and configuration information of the high-speed storage pool of the storage node, and recording the configuration of a memory disk on the storage node and the configuration information of a corresponding OSD (on screen display);

3. The method of claim 1, further comprising:

in response to receiving a command of unexpected power failure recovery of a storage node, transferring OSD capacity and id information of the storage node in a management node to the storage node, reconstructing a virtual block device in a memory file system of the storage node, and replacing the virtual block device of the original storage node with the newly-created virtual block device;

and calculating and synchronizing missing data blocks in the power-off process based on data on other storage nodes.

4. An apparatus for Ceph performance optimization based on a distributed memory disk, the apparatus comprising:

the creating module is configured to create a virtual disk on a memory file system on each storage node of the Ceph distributed storage system, mount Tmpfs with a specified size on each storage node to a specified path, respectively create virtual disk files with the specified size under the path, and mount the virtual disk files as local loop equipment on each storage node;

the integration module is configured to integrate the virtual disks on the plurality of storage nodes, create a high-speed storage pool by using the integrated virtual disks, initialize local loop devices on all the storage nodes to OSD (on screen display) of the Ceph distributed storage system respectively, create the high-speed storage pool by using the OSD, set fault domains of the high-speed storage pool according to the number and distribution conditions of the storage nodes, divide each OSD into a plurality of PG (packet control) homing groups, and uniformly distribute original data blocks and redundant data blocks in different PGs of different OSD (on screen display) through a hash algorithm carried by the Ceph distributed storage system;

an application module configured to accelerate performance of the Ceph distributed storage system based on the created high-speed storage pool.

5. The device of claim 4, further comprising a recovery module configured to:

6. The device of claim 4, further comprising a power-down module configured to: