WO2015127647A1

WO2015127647A1 - Storage virtualization manager and system of ceph-based distributed mechanism

Info

Publication number: WO2015127647A1
Application number: PCT/CN2014/072707
Authority: WO
Inventors: 汤传斌; 朱勤
Original assignee: 运软网络科技(上海)有限公司
Priority date: 2014-02-28
Filing date: 2014-02-28
Publication date: 2015-09-03

Abstract

Disclosed is a storage virtualization manager of a Ceph-based distributed mechanism, characterized in that the storage virtualization manager at least comprises a first part of the storage virtualization manager and a second part of the storage virtualization manager, wherein the first part of the storage virtualization manager is configured to exist as an abstract of a basic storage device independent of a specific storage device; and the second part of the storage virtualization manager is configured to be implemented using inherent characteristics of a Ceph cluster, and the second part of the storage virtualization manager comprises various specific storage devices, wherein the specific storage devices correspond to device files, a device control request which is sent by the first part of the storage virtualization manager is regarded as one file write operation for the device files, and a response to the device control request is regarded as one file read operation for the device files via a Ceph interface.

Description

Storage virtualization manager and system based on Ceph distributed mechanism

The present invention relates to computer virtualization technologies and the delivery and deployment of physical storage resources and virtual storage resources within an enterprise data center. More specifically, it relates to a storage virtualization manager (storage hypervisor) implemented using Ceph. Background technique

Computer storage virtualization technology separates the management of applications or servers, network resources, and the independent management of applications and networks by abstracting or isolating the internal functions of storage subsystems or storage services.

In short, physical resources and logical resources are no longer one-to-one correspondence. They can be one-to-many or many-to-one, and this relationship is transparent to users. For users, virtualized storage resources are like a huge "storage pool". Users don't see specific disks or tapes, and they don't have to care which path their own data is stored in. on.

At present, there is no uniform standard for the development of virtual storage technology. Depending on where the virtualization software is implemented, it can be divided into three different implementations: host-based virtualization, interconnect-based virtualization, and storage-based virtualization.

Host-based virtualization is typically implemented through storage management software such as Logical Volume Management Software (LVM). Physical devices are mapped into a contiguous sequence of logical storage spaces. Through the management of logical views, users use virtual management software to map storage media to logical volumes. The benefit of host-based virtual storage is that virtualized storage is easy to implement, with a variety of storage management supported by pure software. To further improve the reliability, security, and management of storage systems, multiple servers can use clustering technology to achieve shared storage. The downside is that virtualization needs to be done through software. The scalability and compatibility are not good. The scheduling work will affect the application performance of the server.

Storage virtualization can also be implemented in storage subsystems or storage devices. Storage-based virtualization adapts to the architecture of heterogeneous SANs (storage area networks) and is more adaptive to storage-centric environments. Device-level virtual storage is independent of the host, and multiple hosts can be connected to the storage device, but the storage device itself should be isomorphic. Depending on the solution used, mirroring, RAID, instant snapshots, and data copies can all adopt storage-level virtual storage methods. Due to the poor compatibility and interoperability of storage devices among various vendors, device-level virtual storage can only provide one kind without the support of third-party virtual software. A complete virtual storage solution.

An interconnected network-based virtualization solution moves the virtual engine to the core of the SAN system, the interconnected network. The specific implementation depends on the network device. It can be the switch itself, or a router, or it can be implemented by a dedicated server. Generally, if the switch or router mode is adopted, the in-band mode is generally adopted, that is, the data and the metadata are the same path. If the dedicated server is used, the two modes of the in-band mode and the out-of-band mode can be implemented. In the in-band mode structure, the virtual control device is between the server and the storage device, and the storage management software running thereon manages and configures all the storage devices. In the out-of-band mode, the virtual control device is outside the system data path and does not directly participate in the data transmission. The virtual control device configures all the storage devices and submits the configuration information to all servers. When the server accesses the storage device, the server no longer After the storage device. Host-based virtual storage is much worse than network-based virtual storage when considering management and maintenance costs. The reason is that network-based virtual storage can provide a virtual resource management pool, through which various resources can be centrally managed, which can greatly reduce the workload of maintenance and management, and the corresponding management personnel can be greatly reduced.

Many storage companies now offer these types of storage virtualization products. Host-based storage virtualization technologies, such as Veritas' volume management software (Volume Manager), are server-based virtualization software that can be installed on a server or host to virtualize multiple physical disks into logical volumes for users. Virsto has developed a software solution that is installed on each host server and creates a high-performance disk or solid-state storage area called "vLog". The read operation will point directly to the primary store, and the write operation will be done via vLog, which vLog distributes these writes asynchronously to the primary store. Similar to how caching works, vLog improves storage performance by reducing coupling at the storage front end, reducing latency for back-end storage. Device-based storage virtualization technologies such as Hitachi's Data Systems' Universal Storage Platform enable virtualized applications by consolidating and managing other storage under their own storage array systems.

Network-based storage virtualization technologies such as EMC's Invista enable virtualization by plugging a server or intelligent switch device into a FC SAN or iSCSI SAN to intercept 1/0 from the network device to the storage controller. DataCore's SANsymphony is a network-based in-band virtualization software that runs on dedicated hardware devices between servers and storage devices. IBM's out-of-band virtualization solution with SVC Volume Controller + SAN File System has a high market share. In addition, Tivoli's SANergy products and SGI's CXFS products are virtualization software based on SAN file system.

There are many ways to implement storage virtualization, one of which is to use the "storage virtualization manager" (again Said: store hypervisor). A storage virtualization manager is a monitor that manages multiple "storage pools" as virtual resources. As a type of virtual engine, it treats all the storage hardware it manages as a common platform, although these hardware may be different and incompatible. To do this, a storage virtualization manager must understand the performance, capacity, and service characteristics of its "underlying storage." Here, "underlying storage" may refer to physical hardware, such as solid-state disks or hard disks; or storage architectures such as storage area networks (SANs), network attached storage (NAS), and direct attached storage (DAS). .

The Storage Virtualization Manager ensures that no business disruption occurs when adding new devices (such as a new array) or replacing resources in some or all of the existing storage pools. This means that during this process, the Storage Virtualization Manager fully understands which storage features and functions it will acquire and knows which storage features and functions it will ignore. In other words, the Storage Virtualization Manager is more than just a simple combination of a storage monitor (supe "viso") and storage virtualization. It is a higher level of intelligent software. The Storage Virtualization Manager can be controlled not only. Device-level storage controllers, disk arrays, and virtualization middleware, and it also enables storage provisioning, providing snapshot and backup services, and managing policy-driven service level agreements (SLAs). Storage virtualization The manager also provides a solid foundation for further building software defined storage.

In summary, there are many benefits to using a storage virtualization manager, including: better leveraging existing storage facilities, increasing administrator productivity, and further improving storage performance and availability. Currently, there are three major vendors offering storage virtualization managers: ^M, DataCore, and Virsto. The technical solutions they use are not disclosed. At present, there are many patents related to the concept of storage and virtualization, but most of them do not involve the concept of "storage virtualization manager" (store Hype "viso"), two examples:

1) Patent US20100153617A1 "Storage Management System for Virtual Machines". The applicant for this patent is Virsto. Its purpose is to provide a better storage management solution for virtual desktop systems. It solves two problems: a) Traditional server virtualization relies entirely on the virtualization manager, without storage management optimization, and performance bottlenecks are encountered when server virtualization in the data center grows. b) Traditional volume management often does not adequately address the functional requirements of server virtualization. The patent is applicable to virtual desktop systems.

(VDI), whose implementation type is host-based virtualization, is quite different from the present invention.

2) Patent US 8504757 B1 "Method of Converting Virtual Storage Device Addresses to Physical Storage Device Addresses in a Dedicated Virtualization Manager". The method is to allow third-party software or physical storage devices to add functionality to the dedicated virtualization manager's I/O handler based on each virtual storage device. The proprietary virtualization manager described in this invention still refers to the server virtualization manager. It is worth noting that the Server Virtualization Manager adds II 0 processing capabilities to improve access storage efficiency and the "Storage Virtualization Manager" subordinate Different concept categories; server virtualization manager and storage virtualization manager are completely different in mechanism

Summary of the invention

It is an object of the present invention to construct a single unified storage component within a data center. The present invention provides a new method for intelligently utilizing Ceph to implement a storage virtualization manager (stored hypervisor), which is a distributed solution that uses Ceph cluster files/objects to control/manage individual devices. At the same time, this method allows Ceph's traditional storage functions to continue to be used.

The storage virtualization manager of the present invention is implemented using Ceph's distributed mechanism. The storage device object is abstracted into a Ceph file. Because this file is similar to a device file in Linux, we call it a device file. Generally speaking, a specific storage device corresponds to a storage device object, that is, corresponds to a specific device file. The device file is the bridge between the upper and lower half of the storage virtualization manager. More specifically, a device control request in the upper half of the storage virtualization manager can be considered as a file write operation to the device file; and the response can be viewed as a device file through the Ceph interface (ie, the unified adapter). A file read operation.

The SMI-S of the present invention is used to manage SAN storage; and SSH is used to manage switches. If the storage user's management application controls or monitors a device through SMI-S or SSH, in some cases after virtualization, the storage virtualization manager needs to provide the corresponding SMI-S module or SSH module to match the real underlying device. . In order for the unified adapter to be unaffected, multiple instances of the module are distributed to Ceph's 0SD. Thus, the SMI-S module and the SSH module can be shared by different storage device objects, and also have fault tolerance due to their multiple instances being present in Ceph's 0SD.

Specifically, the present invention provides a storage virtualization manager based on a Ceph-based distributed mechanism, wherein the storage virtualization manager includes at least:

a first portion of the storage virtualization manager (eg, the upper half of the storage virtualization manager); and a second portion of the storage virtualization manager (eg, the lower half of the storage virtualization manager); wherein the storage virtualization The first portion of the chemistry manager is configured to be independent of the particular storage device and exists as an abstraction of an underlying storage device; the second portion of the storage virtualization manager is configured to be implemented using the inherent characteristics of the Ceph cluster, The second part of the Storage Virtualization Manager includes various specific storage devices;

The specific storage device corresponds to a device file, and the first part of the storage virtualization manager is The issued device control request is treated as a file write operation to the device file, and the response to the device control request is considered a file read operation to the device file by the Ceph client.

In one embodiment, the first part of the storage virtualization manager includes:

a virtual finite state machine, which serves as an abstraction of the underlying storage device, and provides various functions to the resource decision module as a basis for the decision making by the resource decision module; the virtual finite state machine also cooperates with the overall decision The modules work together to ensure the reliability of the entire storage virtualization manager system; the manager collects and provides device type, capability, QoS attributes to the resource decision group during the unified construction of the heterogeneous device Status information, wherein when the resource decision module sends a batch of resource reservation requests, the manager makes a decision whether the resource reservation request can be accepted; if the storage resource starts its function at a certain moment The performance meets the user requirements, and the topology indicates that the stored data link can be established, and the resource decision module will reply the message accepting the resource reservation to the user;

a unified adapter, the unified adapter being implemented using a Ceph client mechanism configured to obtain information of all of the underlying storage devices and provide the information to the manager, the information including a topology of the underlying storage device Structure, function, and performance, the unified adapter provides a unified device operation/control interface for monitoring and control tasks.

In one embodiment, the device file is a Ceph file.

In one embodiment, the unified adapter is the Ceph client.

In one embodiment, the monitoring and control tasks include monitoring or controlling a light switch port to allocate a volume from a storage area network (SAN).

In one embodiment, the first portion of the storage virtualization manager further comprises: a data facility configured to provide operations on user data, the operations including disaster tolerance operations, compression release, and deletion redundancy.

In one embodiment, the second part of the storage virtualization manager includes:

a storage management suggestion specification module configured to manage storage of a storage area network (SAN); a secure shell protocol module configured to manage the switch;

The storage management suggestion specification module and the secure shell protocol module may be shared by different storage device objects, and the storage management suggestion specification module and the instance of the secure shell protocol module are distributed to the Ceph object storage device. Medium and fault tolerant.

The present invention also provides a storage virtualization manager system based on a Ceph-based distributed mechanism, the system having a control plane, a data plane, and a data stream, wherein the storage virtualization manager system includes: a storage virtualization manager as described above;

Live Resource secondary storage infrastructure domain, including all storage area network (SAN) devices that work for Live Resources;

Customer storage infrastructure domain, including all storage area network (SAN) devices that work for customers; and

a customer computing infrastructure domain, including a host having all of the virtual machine clusters working for the client, the virtual machine cluster accessing its SAN device storage data within the customer storage infrastructure domain through the data plane, the sink The Ceph client is on the host.

In an embodiment, the system further includes:

The resource decision module, the resource decision module is a part of the real resource domain, and operates on the control plane, and the resource decision module determines whether the storage resource reservation can be successful according to the reality provided by the storage virtualization manager.

In an embodiment, the system further includes:

The overall decision making module is a part of the real resource domain, which works on the control plane.

In an embodiment, the operation of the data plane does not occur in a resource reservation phase; when the virtual machine is started, the storage virtualization manager needs to establish a data link on the data plane, where the virtual This link is required to access the data of the client application stored by the SAN.

In one embodiment, the data stream participates in a mirror library management and a mirror transmission, the image includes a client application system and an operating system on which it depends; when the virtual machine on the host machine is started, the virtual The template image required by the machine will be copied from the SAN device of the real resource attached storage infrastructure domain to the host, wherein the copying action is assisted by a Ceph client on the host.

In one embodiment, the storage virtualization manager system has three layers: a unified adapter layer, a Ceph transport layer, and a lower layer physical device, where

The unified adapter layer includes the unified adapter, and the unified adapter layer establishes a device private protocol with the Ceph transport layer, and the unified adapter is responsible for communicating with the storage device of the lower layer physical device, and The unified adapter does not need to know where the storage device is;

The Ceph transport layer includes a Ceph cluster, and the Ceph transport layer is responsible for transmitting links, which are not concerned with data transmitted thereon;

The lower layer physical device includes each specific storage device, and the Ceph set in the Ceph transport layer Each object in the group is controlled, and the matching relationship between the storage device and the object is dynamic. In one embodiment, the storage virtualization manager system runs on a real resource (LR) service delivery platform, which is a resource management system based on the autonomic computing characteristics of the ACRA architecture.

The technical solution provided by the present invention is a distributed solution using a Ceph cluster file/object as a control/management device, and at the same time, the scheme allows Ceph's traditional storage function to continue to be used. The highlight of the program is the Ceph cluster. The storage virtualization manager of the present invention can implement storage virtualization and further implement an autonomous storage management system in combination with the ACRA architecture, thereby greatly improving the reliability and availability of the storage system. DRAWINGS

The above summary of the present invention and the following detailed description of the invention will be better understood. It should be noted that the drawings are only illustrative of the claimed invention. In the drawings, the same reference numerals indicate the same or similar elements.

1 is an architectural diagram of an autonomous storage management system showing a storage virtualization manager and its working environment, in accordance with one embodiment of the present invention;

2 is a three level diagram involved in a storage virtualization manager workflow in accordance with one embodiment of the present invention;

3 is a block diagram of an implementation environment LR service delivery platform in accordance with one embodiment of the present invention;

4 is an ACRA architecture referenced by an implementation environment LR service delivery platform in accordance with one embodiment of the present invention. detailed description

The detailed features and advantages of the present invention are set forth in the Detailed Description of the Detailed Description of the <RTIgt; </RTI> <RTIgt; The objects and advantages associated with the present invention will be readily understood by those skilled in the art.

The present invention cleverly utilizes Ceph to implement a storage virtualization manager, the creativity of which is to use the files/objects of the Ceph cluster as a distributed solution for controlling/managing individual devices. At the same time, Ceph's traditional storage capabilities can continue to be used. The storage virtualization manager of the present invention can implement storage virtualization And combined with the ACRA architecture to further implement an autonomous storage management system, thereby greatly improving the reliability and availability of the storage system.

Ceph is a Linux petabyte distributed file system. In general, Ceph is a high-performance, highly reliable and scalable cluster of multiple PCs. It can be roughly divided into four parts (see Figure 1):

1. Client: Used by data users; Each Client instance provides a set of POSIX-like interfaces to hosts or processes; here, POSIX stands for Portable Operating System Interface;

2. Metadata Cluster Server (MDS, Metadata Cluster Server): Used to cache and synchronize distributed metadata; manage namespaces (file names and directory names) and coordinate security, consistency, and coupling;

3. Object Storage Cluster: Contains multiple object storage devices (OSD) for storing all data and metadata.

4. Cluster Monitor (MONs): Performs monitoring functions.

Ceph uses a POSIX-like interface to ensure the scalability and consistency of the interface, which is consistent with the application and helps improve system performance. While achieving high performance, high reliability and high availability, Ceph achieves system scalability through three basic design features: decomposed data domain metadata, dynamic distributed metadata management, and automatic distributed object management.

The client uses the metadata server MDS for metadata operations to determine the data location. Metadata Server MDS not only manages data locations but also arranges where to store new data. It is worth noting that the metadata itself is stored on the storage cluster and is identified as "metadata I/O". The actual file I/O occurs between the client and the object storage 0SD cluster. Thus, higher-level POSIX features (for example, open, close, rename) are managed through the metadata server MDS, while lower-level POSIX features (such as read and write) are managed directly by the object store 0SD cluster.

Ceph Client Client

The intelligent control of the Ceph file system is distributed to each node, which not only simplifies the client interface, but also provides a large-scale dynamic expansion capability for Ceph. Traditional storage often uses an allocation list method, in which metadata is used to map blocks on a disk to a specified file. In Ceph, a file is assigned a node number from the metadata server. Gnode number) , the file is used as the unique identifier. Then the file is cut into several objects (the number is based on Depending on the size of the file). Using the file's node number IN〇 and the object number 〇N〇(object number), each object is assigned an object identifier, the 0ID.

Using a simple hash table based on the object identifier 0ID, each object is assigned to a placement group. Here, the location group (identified as PGID) is a conceptual container for an object. Finally, the location group is mapped to a series of object storage devices 0SD using a pseudo-random mapping algorithm called Controlled Replication Under Scalable Hashing (CRUSH). Thus, the process of mapping location groups (and replicas) to a storage device does not have to rely on metadata, but rather on a pseudo-random mapping function. The above method is very useful because it not only minimizes the overhead of storage, but also simplifies the process of allocating and querying data.

Thus, when a user opens a file, the client client sends a request to the metadata server MDS cluster. MDS translates file names into file nodes (inodes) through a file system hierarchy and obtains node numbers IN0, schema, file size, and other metadata.

If the file exists and the operation rights are available, MDS returns the node number, file length, and other file information in a hierarchical structure. MDS also gives the client client the right to operate. At present, there are four kinds of operation rights, which are represented by one bit: read, cache read, write, and buffer write. In the future, the right to operate will add security keywords for the client client to prove to 0SD that it can read and write data (the current strategy is allowed by all clients).

Metadata server MDS

Metadata Server MDS manages file node (inode) space and converts file names into metadata. That is, the metadata server MDS translates the file name into an inode, file size, and striping data that the Ceph client uses for file I/O.

In fact, the task of the metadata server MDS is to manage the file system namespace (namespace). Although both metadata and data are stored in an object storage cluster, they are managed separately for scalability reasons. In fact, metadata is further split on a metadata server MDS cluster, and these metadata servers MDS can adaptively copy and allocate namespaces to avoid hot spots. Metadata Server MDS manages individual segments of the namespace, and namespaces can overlap for reasons of redundancy and performance. The mapping process of metadata from the MDS to the namespace in Ceph is implemented according to the dynamic subtree partitioning method, which allows Ceph to adjust according to the workload changes (that is, the namespace is migrated between the metadata servers MDS). ).

Ceph object storage 0SD As a type of object store, Ceph's storage nodes include not only storage but also intelligent control. Traditional drivers can only respond to commands simply. The object storage device OSD is a smart device that can both request and respond, thereby enabling communication and cooperation with other object storage devices OSD.

From the perspective of the storage angle, the Ceph object storage device OSD implements an object-to-block mapping (this task is traditionally done in the client's file system layer). This design allows the local entity to decide on how to store an object in an optimal way. For example, the B-tree File System (BTRFS) can be applied to a storage node.

Thus, when one or more client clients open the same file for reading, an MDS gives them the ability to read and cache the contents of the file; by file node number, level and file size, the client can name or assign all files containing the file. The object of the data, and read directly from the 0SD cluster; any non-existent object or byte sequence is defined as a file empty or 0. Similarly, if the client opens the file for writing, it gains the ability to use buffered writes and the data will be written to the appropriate object on the appropriate 0SD. These capabilities are automatically discarded when the Client closes the file.

In this process, because the Ceph client client uses the CRUSH algorithm, which knows nothing about the block mapping of files on physical disks, the following storage device 0SD can safely manage object-to-block mappings. This allows the storage node to replicate data (especially when a device fails). Since fault detection and recovery are distributed, the Ceph storage system is very scalable. Ceph calls it RAD〇S.

RADOS (reliable autonomic distributed object store) is a reliable autonomous distributed object store. The object store adheres to a simple principle: As part of the object store, all servers run software that manages and outputs the server's local disk space; all instances of the software collaborate on the cluster This provides a single, large data store that looks from the outside. In order to implement internal storage management, the object storage software no longer uses the original format of the data, but instead saves them on the storage nodes in the form of binary objects. More importantly, the number of individual nodes that create large object storage can be arbitrary; users can even dynamically add storage nodes during the run.

RAD0S can implement the object storage function as described above. Its key technology consists of three layers from bottom to top: 1. Object storage device (0SD). In RADOS, an OSD is always rendered as a folder of an existing file system. There is no hierarchical nesting in the 0SD folder, all files with UUID format names, no subfolders. Each 0SD is combined to form an object store; These binary objects are converted from stored files by RADOS and stored.

2. Monitoring Server (MONs): The Monitoring Server constitutes the interface for RADOS storage and supports access to individual objects in the storage. The monitoring server works in a decentralized manner, handling communication between all and external applications: that is, there is no limit to the number of monitoring servers, and any client can contact any monitoring server. The monitoring server manages MONmap (list of all monitoring servers) and OSDmap (list of all OSDs). The information provided by these two tables allows client clients to calculate which OSD they need to contact when accessing a particular file.

3. Metadata Server (MDS): MDS provides Ceph customers with P0SIX metadata for each object in the RADOS object store.

Another concept related to the present invention is the Autonomic Computing Reference Architecture (ACRA).

As shown in Figure 4, ACRA divides the autonomic computing system into three layers: at the bottom, the system components or managed resources 4300. These managed resources 4300 can be any type of resource, including hardware or software. The managed element can be a variety of internal resources, including databases, servers, routers, application modules, Web services or virtual machines, or other autonomous elements. These resources can have some embedded, self-managed Attributes. Each managed resource 4300 generally provides a standardized interface (touchpoint). Each touchpoint X inch should be in a sensor/effector group. On the one hand, a single autonomous element manages internal resources through autonomous managers. On the other hand, it provides a standard interface (sensor/effector group) to be managed, including accepting strategies specified by IT managers and collaboration information with other autonomous elements. . For example, the parent autonomous element responsible for global orchestration can manage multiple subordinate autonomous elements.

The middle layer contains resource managers. The 4200 is often divided into four categories: self-configuration, self-healing, self-optimization, and self-protection. Each resource may have one or more resource managers 4300, each of which implements its own control loop.

The top level contains a global autonomic manager 4100 that coordinates various resource managers. These global autonomous managers 4100 achieve certain system-level management objectives through a system-wide large control loop to achieve system-wide autonomous management. See Figure 4. The left side shows the Human Manager 4400, which provides IT professionals with a common system management interface through an integrated console. Referring to Figure 4, on the right is a Knowledge Base 4500, from which the Human Manager 4400 and the Autonomous Managers 4100, 4200, 4300 can acquire and share all knowledge of the system.

Figure 3 illustrates an implementation environment in accordance with one embodiment of the present invention: Live Resource Service Delivery Platform (referred to as LR). The Live Resource Service Delivery Platform is an automated system that supports the scheduling of logical resource reservations. There are four different types of users on the platform: Project Developer, Project Operator, Application User, System Operator. The project developer designs the project development required by the user. test environment. He/she creates, saves, publishes, edits, previews, deletes the environment design, and views the resources to which the user belongs.

The project manager deploys and uninstalls a project environment. He/she uses LR for environment deployment, offloading, backup management, re-deployment, operational state management, and project resource scheduling management, environmental topology management, and resource consumption statistics for the environment.

The application user deploys and accesses a project environment. He/she quickly deploys business environments through LR, fast SSH access to the environment, fast access deployment services, and self-security management services.

The system administrator performs asset management and monitors the operational status of the entire environment. He/she implements resource discovery through the network management system 302 in the LR, as well as physical server resources, physical network resources, physical storage resource management, and virtual server resources, virtual network resources, virtual storage resource management, and resource realization. Alarm management.

As shown in Figure 3, the LR service delivery platform includes three levels of scheduling:

(1) Project delivery scheduling. Includes requirements design services for computing, storage, network resources, system resource analysis services, virtual resource reservations, and deployment services. Support is completed by the Project Delivery Service Network 300. Closely related to the present invention is system resource analysis and virtual resource reservation 301. The deployment process is the process of binding logical resources in a logical delivery environment to virtual resources. Logical resources are bound to virtual resources in a one-to-one manner, which is the first binding in the entire logical delivery environment for scheduled delivery automation.

(2) Virtual resource scheduling. Including the allocation, configuration, and provisioning of virtual resources. Completion is supported by resource engine component 304. The binding process of virtual resources to physical resources must go through the resource engine 304, which is the second binding in the automated delivery of the entire logical delivery environment. The resource engine 304 provides "capability" of various virtual resources by aggregating individual virtual resources. In addition, the resource engine 304 also maintains a state model for each virtual resource, thereby completing the binding from the virtual resource to the physical resource.

(3) Physical resource scheduling. The proxy 306, 307, 308 on the physical resource accepts the resource command of the resource engine 304, implements resource multiplexing, resource space sharing, and the resource state information is passed back to the resource engine 304 via the proxy 306, 307, 308.

The above functionality consists of a multi-physical delivery environment consisting of a data center physical resource service network 309 partitioned for the project. The physical resource service network 309 supports scheduled delivery of the delivery environment, while supporting sharing by space and pressing Time shares physical resources, including many unallocated and allocated physical resources, such as network, storage, and computing resources. System Operator In addition to managing the various physical resources of the physical data center, the division of the physical delivery environment is also implemented by the system administrator.

Referring to FIG. 3, the resource engine 304 uses physical resource information provided by the NMS (Network Management System) to track physical resources to obtain the latest resource status; and map physical resources to virtual resources. Commercial network management systems used to manage physical resources generally provide information about state and performance, and all have the function of finding and searching for physical resources, so they are not described here.

Various storage resources include: storage area network, network attached storage, distributed file system, Ceph, mirroring, etc. The above physical resource information is stored in the reference model 303, as shown in FIG. Additionally, a "push" or "pull" schedule can be selected between the project delivery service network 300 and the resource engine 304. In the case of "push" scheduling, resources are required to change regardless of the capabilities of the physical delivery environment, and support for parallelized resource provisioning. In the case of "pull" scheduling, resource change requirements are only promised when the physical delivery environment capacity is ready, and parallelized resource provisioning is supported.

Referring to FIG. 3, the resource engine 304 in the LR can perform binding of virtual resources to physical resources. Its main task is to virtualize physical resources, including the virtualization of various storage resources. The "Storage Virtualization Manager" 305 is an important part of implementing storage virtualization. It is the focus of the present invention.

Referring back to FIG. 1, FIG. 1 is a diagram of an autonomous storage management system architecture showing a storage virtualization manager and its working environment, in accordance with one embodiment of the present invention.

As shown in Figure 1, the frame structure of the "Storage Virtualization Manager" 1000 is shown in the rounded rectangle. In one embodiment, the storage virtualization manager 1000 can be divided into two logical portions: a storage virtualization manager first portion (top half) 1001 and a storage virtualization manager second portion (lower half) 1002. The upper part is independent of the specific device and exists as an abstraction layer of the underlying device. It is part of the laaS platform tool Live Resource Domain 1010. The second half, on the other hand, includes a variety of specific devices that are implemented using the inherent features of the Ceph Cluster 1020.

The following briefly describes the main components of Storage Virtualization Manager 1000 and the interactions between these components, see Figure 1:

1) vFSM 101 1 vFSM 1011 is the (local) Virtual Finite State Machine. It has two important functions: One is the abstraction of the underlying storage device, and the various "capabilities" are provided to the user of the storage virtualization manager 1000, a resource decision maker 1100 (Resource Decision Maker, or funded The source decision module is used as the basis for decision making by the resource decision maker 1100; the other is the decision part of the smart storage device (Smart Storage Device), which cooperates with the overall system decision making module coreVFSM1200 to ensure the reliability of the whole system. The intelligent storage device here, that is, the storage virtualization manager 1000 and all the SAN storages it controls, including the LR attached storage infrastructure domain 1300 and the customer storage infrastructure domain 1400 as a whole, provides intelligence to the upper layer users. Virtualized storage capabilities.

2) Manager 1012 Manager 1012 (Manager) An important prerequisite for unified scheduling of managed storage resources is that the Unified Adapter 1014 knows almost all information about the underlying storage devices, such as topology, functions, Performance, etc. In the unified construction process of the heterogeneous device, the manager 1012 collects and provides the resource decision maker 1100 with information such as device type, capability, QoS, and status information.

For example: When the resource decision maker 1100 sends a batch of resource reservation requests, the manager 1012 needs to analyze the information grasped by the hand and make a decision whether or not the resource reservation can be accepted. If the storage resource meets the user's needs at some point in time and the topology indicates that the stored data link can be established, the resource decider 1100 will reply the accept resource reservation message to the user. It should be noted that the data link for accessing the storage at this time has not yet been established, and the real data link needs to wait until at least one compute node is started before it can be established.

3) Data facilities 1013

Data Facility 1013 (Data Facility) is an optional module. Used for application App data, that is, user data operations, such as: disaster recovery operations such as backup and recovery, or compression release, redundancy, and other optimization operations.

4) Unified Adapter 1014 Unified Adapter 1014 ( Unified Adapter) provides a unified device operation/control interface based on the following Device Adapter Layer that depends on the storage device in the Ceph Cluster 1020. Complete monitoring and control tasks, such as monitoring or controlling the Light Switch (FC SW) port, and allocating a volume (LUN) from the SAN. The unified adapter 1014 itself is implemented using Ceph's client mechanism.

The storage virtualization manager 1000 of the present invention is implemented using a distributed mechanism of Ceph. The storage device object is abstracted into a Ceph file. Because this file is similar to the device file in Linux, we call it device file 1030. Generally speaking, some A specific storage device corresponds to a storage device object, that is, corresponds to a specific device file. The device file 1030 is a bridge between the storage virtualization manager first portion 1001 (top half) and the storage virtualization manager second portion 1002 (lower half). More specifically, a device control request to store the upper portion 1001 of the virtualization manager can be considered a write to the device file 1030; and the response can be considered to be through the Ceph interface (ie, the unified adapter 1014). A file read operation of device file 1030.

5) SMI-S module 1021 and SSH module 1022

Network managers need multiple independent management applications to manage SANs from different vendors. These management applications are developed by different vendors and connect to multiple hardware management APIs. Therefore, the Storage Networking Industry Association (SNIA) has proposed the Storage Management Initiative Specification (SMI-S). The main purpose of SMI-S is to unify the management objects of the storage network and the tools used to manage the objects, and finally allow all storage network components to be deployed using the local SMI-S interface. This allows all components to adopt a common interface, making management functions more convenient. Based on Web-based Enterprise Management (WBEM) technology and Common Information Model (CIM), SMI-S is actually a middleware installed between managed objects and managed applications. For example, an SMI-S agent can query a device, such as a switch, host, or storage array, to extract relevant management data from a CIM-enabled device and provide the data to the requester.

SSH is a Secure Shell protocol (Secure Shell) that is well known to those skilled in the art, so it will not be described. In one embodiment of the invention, SMI-S is used to manage SAN (storage area network) storage; SSH is used to manage switches. If the storage user's management application controls or monitors a device through SMI-S or SSH, in some cases after virtualization, the storage virtualization manager 1000 needs to provide a corresponding SMI-S module or SSH module to match the real underlying layer. device. In one embodiment, to unify the unified adapter 1014, multiple instances of the module can be distributed into the 0SD of Ceph. Thus, SMI-S module 1021 and SSH module 1022 can be shared by different storage device objects, and also have fault tolerance due to their multiple instances being present in Ceph's 0SD. In addition, since Ceph's MDS does not need to be concerned by the storage virtualization manager, in one embodiment, Ceph's MDS can be omitted in Figure 1.

The back-end storage devices include all SAN (Storage Area Network) devices operating in the LR attached storage infrastructure domain 1300 for LR, and all SAN devices working in the customer storage infrastructure domain 1400 for customers. These backend devices are implemented by SMI-S module 1021 and SSH module 1022 via Controls 1604 and 1605. For example: Divide a volume (LUN) or delete a volume (LUN). We call it Fabric Management.

6) Data plane 1601

The operation of data plane 1601 (Data Plane) does not occur during the resource reservation phase. When a computing service starts, for example, to start a virtual machine VM1511, the storage virtualization manager 1000 needs to establish a data link, from the source HBA1514 to the fiber switch (omitted in Figure 1) port, to the destination HBA1410... SAN1401. The virtual machine VM1511 needs to access the data of the client APP stored in the SAN.

7) Data stream 1602

Data stream 1602 (Streaming) participates in image library management and image transfer. The image here belongs to APP Content, which contains the client's application system and the operating system it depends on. For example: When the virtual machine VM1511 on the host machine 1510 starts, it The required template image will be copied from the SAN of the LR attached storage infrastructure domain 1300 to the host. This copying is done by the Ceph client 1513 on the host 1510.

Here, the use of Ceph clusters within the LR attached storage infrastructure domain 1300 can alleviate communication congestion due to concurrent access. For example, when a large number of virtual machines are started at the same time, communication congestion often occurs.

In the storage management system shown in FIG. 1, in addition to the storage virtualization manager 1000, the system includes, but is not limited to, the following modules:

1) Resource Decision Module 1100

Resource Decision Module 1100 (Resource Decision Maker) is part of the LR and works on the control plane. For example: To run a new online banking system, a bank needs a certain amount of computing power, memory, and storage. These resources require an appointment. The resource decision module 1100 will determine whether the storage resource can meet the requirements of running a new online banking system according to the reality provided by the storage virtualization manager. For example, if the SAN is only 1% idle, the storage resource reservation cannot be successful.

2) coreVFSM 1200

The Coordination Decision Module 1200 (coreVFSM) is part of the LR and works on the control plane.

3) LR Auxiliary Storage Infrastructure Domain 1300

LR Auxiliary Storage Infrastructure Domain 1300 (LR-Attached Storage Infrastructure Domain

) Includes all SAN devices that work for LR.

4) Customer Storage Infrastructure Domain 1400

The Customer Storage Infrastructure Domain (1400) includes all SAN devices that work for customers, such as SAN devices that bank applications use to store their data. 5) Customer Computing Infrastructure Domain 1500

The Customer Computing Infrastructure Domain 1500 (Company Computing Infrastructure Domain) includes all computing resources that work for customers, such as: Support for virtual machine clusters running various applications of the bank. These virtual machines (e.g., VM 1511) access the SAN storage data within the customer storage infrastructure domain 1400 through the data plane 1601.

The specific implementation of the autonomous storage management system using Ceph provided by the present invention will be further described in detail below.

As shown in Figure 1, from the perspective of the upper part of the storage virtualization manager 1001, each storage device corresponds to a device file 1030, just like the ioctl mechanism in the Linux device driver mode. The current device files are distributed to the OSD in the Ceph cluster. Under the Ceph inherent mechanism, the Storage Virtualization Manager has a distributed device file system.

Ioctl is the abbreviation of iocontrol, which is 10 controls. In simple terms, when writing a Linux device driver, you will encounter some I0 operations, which can neither be classified into read nor logically written, and those operations can be considered part of ioctl. Read and write should be used to write and read data, and should be handled as a simple way of data exchange. Ioctl is used to control certain options for read and write. For example: The user has designed a universal driver module for reading and writing the I0 port. Read and write are read and write data from the port, but change the port read and write, how should this operation be handled? Obviously it is reasonable to use ioctl. For example: The read and write of the port can be blocked, or can not be blocked; or the read and write of the device file can be concurrent, or can not be concurrent; these can be designed to be configured with ioctl. On the parameter, the general parameter format of ioctl is the way of command word (constant) + command parameter. The read and write parameters are both data buffer + data destination pointer + length.

2 is a three level diagram involved in a storage virtualization manager workflow in accordance with one embodiment of the present invention. As shown in Figure 2, the present invention creates a Ceph object as an abstraction for a Fibre Channel switch (FC SW). In one embodiment, the fiber switch can utilize a Brocade fiber switch. We write a Control Message to the device file from the Unified Adapter Layer 2100 (refer to the Unified Adapter 1014 in Figure 1, where the Unified Adapter Layer 2100 is bound to the Ceph Client) (see the device file in Figure 1). 1030), the device file referred to herein is more specifically an object in the Ceph cluster 2210, such as object 2211. This triggers an SSH process running on a certain OSD (see SSH Module 1022 in Figure 1), which can successfully disable the port of the fabric switch (i.e., one of the storage 2310s) by controlling 1604 or 1605. As expected, SSH then writes the resulting state of this disable operation back to the device file (eg, object 2211). The write back operation The Unified Adapter Layer 2100 (see Uniform Adapter 1014 in Figure 1) will be triggered, which can get the result status of the operation from the device file (eg object 2211). We can use other means to verify that the specified port has been disabled, for example, manually log in to the management terminal of the fabric switch for query.

The object in FIG. 2 (for example, object 2211) is an example of a file in device file 1030 in FIG. 1, but the operation object is different from the operation file. The former is a whole in Ceph, and is easy to be stored in the upper part of the virtualization manager 1001. Operation; the latter is a piece of fragmentation in Ceph, which is not conducive to being manipulated.

Referring to Figure 2, from a vertical perspective, the workflow of the Storage Virtualization Manager involves three levels: Unified Adapter Layer 2100, Ceph Transport Layer 2200, and Lower Physical Device 2300. From a landscape perspective, these three levels apply not only to storage virtualization, but also to network virtualization and compute virtualization.

It is apparent that establishing a device proprietary protocol between the unified adapter layer 2100 and the Ceph transport layer 2200 and communicating with its device is the responsibility of the unified adapter 2110, and the unified adapter 2110 does not need to know where the device is. The Ceph cluster in the Ceph transport layer 2200 is mainly used to transmit links. It does not care about the data transmitted on it. Finally, each of the lower physical devices 2300 is controlled by each object within the Ceph cluster in the Ceph transport layer 2200. The matching relationship between storage devices and objects is dynamic.

Referring to Figure 1, the workflow for requesting a volume (LUN) on a SAN through the Storage Virtualization Manager is as follows:

Step 1: Write "alloc-lun" to the device file device file in the form of message; Step 2: The device file change will trigger the SSH process on the OSD; the SSH process is constantly listening for device files. Will read out the "alloc-lun" information;

Step 3: "alloc-lun" is executed by Ceph, and the SAN will allocate the required volume LUNs;

Step 4: The volume LUN allocation success message is written back to the device file device file;

Step 5: After the unified adapter reads the device file device file and knows that the operation is completed, the user will be further notified.

Referring to FIG. 2, the above method embodies the most fundamental idea of the storage virtualization manager implemented by the present invention. BP: In the case of dynamic heterogeneity of devices in the storage 2310, the original user 2000 (or, for example, the resource decision maker 1100) The specific device in the storage 2310 is directly accessed, and is changed to access an object in the Ceph cluster 2210, that is, a storage device object. Then, when the device in the storage 2310 changes, all user 2000 programs accessing the storage device object are not affected. If the original direct access method is used, the user 2000 needs to make corresponding adjustment changes.

4 is a reference to an implementation environment LR service delivery platform in accordance with an embodiment of the present invention. ACRA architecture. The implementation environment LR service delivery platform of the present invention shown in FIG. 4 is a resource management system with autonomic computing features. In one embodiment, its implementation refers to IBM's ACRA architecture, and of course its storage resource management portion.

Referring to Figures 3 and 4, the lower half of the Ceph storage is the RADOS 4300, a reliable, autonomous, distributed object store. As the name suggests, it has a self-management attribute. They are one of the managed resources 4300. The resource engine 304 includes a resource manager 4200 and a global autonomous manager 4100. The storage virtualization manager 305 is one of the resource managers 4200, and the knowledge in the resource manager 4200 (autonomous elements) includes the vFSM. The knowledge within the Global Autonomic Manager 4100 (autonomous elements) includes coreFSM. vFSM is the (storage virtualization manager local) virtual finite state machine. It works in conjunction with the virtual finite state machine coreVFSM1200 of the entire system, virtual finite state machine to improve the availability and reliability of the entire storage system.

The storage virtualization manager of the present invention needs to support the autonomous computing feature that the resource manager 4200 has. Please refer to Figure 1. As mentioned above, the Ceph/RADOS mechanism has its own management attributes. With its support, the SAN can automatically report alarms. For example, the port of the fabric switch (that is, one of the storage devices in the 2310) is broken. The alarm is reported to the unified adapter 1014 through SSH 1022 and device file 1030. VFSM 1011 is the virtual finite state machine that stores virtualization manager 100. It has two important functions: one is abstraction as the underlying storage device; the other is the decision-making part of the Smart Storage Device. Since the port loss of the fabric switch is within its control range, it can send a message requesting the fiber switch to replace a port. For example, the SAN device has a large-area damage alarm due to external reasons. This problem is beyond the scope of the storage virtualization manager 100. The state of the vFSM 1011 needs to be reported to the coreVFSM1200. It is coordinated by the decision-making part of the LR service delivery platform of the higher level, and even requires manual intervention.

Thus, the autonomous storage management system provided by the present invention has the following features:

(i) Self-configuration: It can adapt to changes in the storage system. Such changes may include the deployment of new storage devices or the removal of existing storage devices; dynamic adaptations help ensure continuous operation of the storage devices/software.

(ii) Self-optimization: Ability to automatically monitor and coordinate storage resources to meet end-user or enterprise needs, thereby providing high-performance storage services;

(iii) Self-healing: Ability to detect storage system failures and initiate scheduled recovery actions without disrupting the rest of the storage system. Make storage systems more reliable and available.

These autonomous features described above can all be performed jointly by the implementation environment Live Resoiurce service delivery platform of the present invention with the participation of the described storage virtualization manager. The terms and expressions employed herein are for illustrative purposes only and the invention is not limited to the terms and expressions. The use of these terms and expressions is not intended to be exhaustive or to limit the scope of the invention. Other modifications, changes, and replacements may also exist. Accordingly, the claims are to be construed as covering all such equivalents.

Also, it should be noted that although the present invention has been described with reference to the present embodiments, it will be understood by those skilled in the art that Various equivalent changes or substitutions may be made in the case of the spirit, and variations and modifications of the above-described embodiments are intended to fall within the scope of the claims of the present application.

Claims

Rights request

A storage virtualization manager based on a Ceph-based distributed mechanism, wherein the storage virtualization manager includes at least:

The first part of the storage virtualization manager;

The second part of the storage virtualization manager;

The first part of the storage virtualization manager is configured to be independent of a specific storage device and exists as an abstraction of a specific storage device; the second part of the storage virtualization manager is configured to utilize

The inherent features of the Ceph cluster are implemented. The second part of the storage virtualization manager includes various specific storage devices.

The specific storage device corresponds to the device file, and the device control request sent by the first part of the storage virtualization manager is regarded as a file write operation to the device file, and the response to the device control request is It is considered as a file read operation of the device file through the Ceph client.

2. The storage virtualization manager of claim 1, wherein the first portion of the storage virtualization manager comprises:

a unified adapter, the unified adapter being implemented using a client mechanism of Ceph, the unified adapter being configured to obtain information of all of the underlying storage devices and provide the information to the manager, the information including the underlying storage device Topology, functionality, and performance, the unified adapter provides a unified device operation/control interface to perform monitoring and control tasks.

3. The storage virtualization manager of claim 2, wherein the device file is Ceph file.

4. The storage virtualization manager of claim 2, wherein the unified adapter is the Ceph client.

5. The storage virtualization manager of claim 2, wherein the monitoring and controlling tasks comprise monitoring or controlling a light switch port, and allocating a volume from a storage area network (SAN).

The storage virtualization manager of claim 2, wherein the first part of the storage virtualization manager further comprises:

The data facility is configured to provide operations on user data, including disaster tolerance operations, compressed release, and redundancy removal.

7. The storage virtualization manager of claim 1, wherein the second portion of the storage virtualization manager comprises:

The storage management suggestion specification module and the secure shell protocol module may be shared by different storage device objects, and the storage management suggestion specification module and the instance of the secure shell protocol module are distributed to the object storage of the Ceph cluster. Fault-tolerant in the device.

8. A storage virtualization manager system based on a Ceph-based distributed mechanism, the system having a control plane, a data plane, and a data stream, wherein the storage virtualization manager system comprises: The storage virtualization manager of any of 7;

Live Resource An attached storage infrastructure domain, including storage area network (SAN) devices that work for Live Resources;

a customer storage infrastructure domain, including a storage area network (SAN) device that works for a customer; and a customer computing infrastructure domain, including a host having a virtual machine cluster working for a client, the virtual machine cluster passing through the data plane Accessing stored data of its SAN device within the customer storage infrastructure domain, the host having the Ceph client.

9. The storage virtualization manager system of claim 8, further comprising: a resource decision module, the resource decision module being part of a real resource domain, operating on the control plane, The resource decision module determines whether the storage resource reservation can be successful according to the reality provided by the storage virtualization manager.

10. The storage virtualization manager system of claim 8, further comprising: a unified decision module, the coordinated decision module being part of a real resource domain, operating on the control plane.

11. The storage virtualization manager system of claim 8, wherein the operation of the data plane does not occur in a resource reservation phase; when the virtual machine is started, the storage virtualization manager needs A data link is established on the data plane, and the virtual machine accesses data of the client application stored by the SAN through the data link.

12. The storage virtualization manager system of claim 8, wherein the data stream participates in a mirror library management and a mirror transmission, the image comprising a client application system and an operating system on which it depends; When the virtual machine on the host is started, the template image required by the virtual machine is copied from the SAN device of the real resource attached storage infrastructure domain to the host, where the copy action is performed by The Ceph client on the host assists with the completion.

13. The storage virtualization manager system according to claim 8, wherein the storage virtualization manager system has three layers: a unified adapter layer, a Ceph transport layer, and a lower layer physical device; wherein, the The unified adapter layer includes the unified adapter, and the unified adapter layer establishes a device private protocol with the Ceph transport layer, the unified adapter is responsible for communicating with the storage device of the lower layer physical device, and the unified adapter It is not necessary to know where the storage device is;

The Ceph transport layer includes a Ceph cluster, and the Ceph transport layer is responsible for transmitting links, which do not need to care about data transmitted thereon;

The lower layer physical device includes each specific storage device, and is controlled by each object in the Ceph cluster in the Ceph transport layer, and the matching relationship between the storage device and the object is dynamic.

14. The storage virtualization manager system of claim 8, wherein the storage virtualization manager system runs on a real resource (LR) service delivery platform, the real resource (LR) service delivery platform It is a resource management system based on the autonomic computing features of the ACRA architecture.