WO2015127647A1 - Storage virtualization manager and system of ceph-based distributed mechanism - Google Patents

Storage virtualization manager and system of ceph-based distributed mechanism Download PDF

Info

Publication number
WO2015127647A1
WO2015127647A1 PCT/CN2014/072707 CN2014072707W WO2015127647A1 WO 2015127647 A1 WO2015127647 A1 WO 2015127647A1 CN 2014072707 W CN2014072707 W CN 2014072707W WO 2015127647 A1 WO2015127647 A1 WO 2015127647A1
Authority
WO
WIPO (PCT)
Prior art keywords
storage
virtualization manager
ceph
storage virtualization
resource
Prior art date
Application number
PCT/CN2014/072707
Other languages
French (fr)
Chinese (zh)
Inventor
汤传斌
朱勤
Original Assignee
运软网络科技(上海)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 运软网络科技(上海)有限公司 filed Critical 运软网络科技(上海)有限公司
Priority to PCT/CN2014/072707 priority Critical patent/WO2015127647A1/en
Publication of WO2015127647A1 publication Critical patent/WO2015127647A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • G06F2009/45595Network integration; Enabling network access in virtual machine instances

Definitions

  • the present invention relates to computer virtualization technologies and the delivery and deployment of physical storage resources and virtual storage resources within an enterprise data center. More specifically, it relates to a storage virtualization manager (storage hypervisor) implemented using Ceph. Background technique
  • Computer storage virtualization technology separates the management of applications or servers, network resources, and the independent management of applications and networks by abstracting or isolating the internal functions of storage subsystems or storage services.
  • physical resources and logical resources are no longer one-to-one correspondence. They can be one-to-many or many-to-one, and this relationship is transparent to users.
  • virtualized storage resources are like a huge "storage pool”. Users don't see specific disks or tapes, and they don't have to care which path their own data is stored in. on.
  • Host-based virtualization is typically implemented through storage management software such as Logical Volume Management Software (LVM). Physical devices are mapped into a contiguous sequence of logical storage spaces. Through the management of logical views, users use virtual management software to map storage media to logical volumes.
  • the benefit of host-based virtual storage is that virtualized storage is easy to implement, with a variety of storage management supported by pure software.
  • To further improve the reliability, security, and management of storage systems multiple servers can use clustering technology to achieve shared storage.
  • the downside is that virtualization needs to be done through software. The scalability and compatibility are not good. The scheduling work will affect the application performance of the server.
  • Storage virtualization can also be implemented in storage subsystems or storage devices.
  • Storage-based virtualization adapts to the architecture of heterogeneous SANs (storage area networks) and is more adaptive to storage-centric environments.
  • Device-level virtual storage is independent of the host, and multiple hosts can be connected to the storage device, but the storage device itself should be isomorphic.
  • mirroring, RAID, instant snapshots, and data copies can all adopt storage-level virtual storage methods. Due to the poor compatibility and interoperability of storage devices among various vendors, device-level virtual storage can only provide one kind without the support of third-party virtual software. A complete virtual storage solution.
  • An interconnected network-based virtualization solution moves the virtual engine to the core of the SAN system, the interconnected network.
  • the specific implementation depends on the network device. It can be the switch itself, or a router, or it can be implemented by a dedicated server. Generally, if the switch or router mode is adopted, the in-band mode is generally adopted, that is, the data and the metadata are the same path. If the dedicated server is used, the two modes of the in-band mode and the out-of-band mode can be implemented.
  • the virtual control device In the in-band mode structure, the virtual control device is between the server and the storage device, and the storage management software running thereon manages and configures all the storage devices. In the out-of-band mode, the virtual control device is outside the system data path and does not directly participate in the data transmission.
  • the virtual control device configures all the storage devices and submits the configuration information to all servers.
  • the server accesses the storage device, the server no longer After the storage device.
  • Host-based virtual storage is much worse than network-based virtual storage when considering management and maintenance costs. The reason is that network-based virtual storage can provide a virtual resource management pool, through which various resources can be centrally managed, which can greatly reduce the workload of maintenance and management, and the corresponding management personnel can be greatly reduced.
  • Host-based storage virtualization technologies such as Veritas' volume management software (Volume Manager) are server-based virtualization software that can be installed on a server or host to virtualize multiple physical disks into logical volumes for users.
  • Virsto has developed a software solution that is installed on each host server and creates a high-performance disk or solid-state storage area called "vLog".
  • the read operation will point directly to the primary store, and the write operation will be done via vLog, which vLog distributes these writes asynchronously to the primary store. Similar to how caching works, vLog improves storage performance by reducing coupling at the storage front end, reducing latency for back-end storage.
  • Device-based storage virtualization technologies such as Hitachi's Data Systems' Universal Storage Platform enable virtualized applications by consolidating and managing other storage under their own storage array systems.
  • Network-based storage virtualization technologies such as EMC's Invista enable virtualization by plugging a server or intelligent switch device into a FC SAN or iSCSI SAN to intercept 1/0 from the network device to the storage controller.
  • DataCore's SANsymphony is a network-based in-band virtualization software that runs on dedicated hardware devices between servers and storage devices.
  • IBM's out-of-band virtualization solution with SVC Volume Controller + SAN File System has a high market share.
  • Tivoli's SANergy products and SGI's CXFS products are virtualization software based on SAN file system.
  • a storage virtualization manager is a monitor that manages multiple "storage pools” as virtual resources. As a type of virtual engine, it treats all the storage hardware it manages as a common platform, although these hardware may be different and incompatible. To do this, a storage virtualization manager must understand the performance, capacity, and service characteristics of its "underlying storage.”
  • "underlying storage” may refer to physical hardware, such as solid-state disks or hard disks; or storage architectures such as storage area networks (SANs), network attached storage (NAS), and direct attached storage (DAS). .
  • the Storage Virtualization Manager ensures that no business disruption occurs when adding new devices (such as a new array) or replacing resources in some or all of the existing storage pools. This means that during this process, the Storage Virtualization Manager fully understands which storage features and functions it will acquire and knows which storage features and functions it will ignore. In other words, the Storage Virtualization Manager is more than just a simple combination of a storage monitor (supe "viso") and storage virtualization. It is a higher level of intelligent software.
  • the Storage Virtualization Manager can be controlled not only. Device-level storage controllers, disk arrays, and virtualization middleware, and it also enables storage provisioning, providing snapshot and backup services, and managing policy-driven service level agreements (SLAs). Storage virtualization The manager also provides a solid foundation for further building software defined storage.
  • Patent US20100153617A1 "Storage Management System for Virtual Machines".
  • the applicant for this patent is Virsto. Its purpose is to provide a better storage management solution for virtual desktop systems. It solves two problems: a) Traditional server virtualization relies entirely on the virtualization manager, without storage management optimization, and performance bottlenecks are encountered when server virtualization in the data center grows. b) Traditional volume management often does not adequately address the functional requirements of server virtualization. The patent is applicable to virtual desktop systems.
  • VDI whose implementation type is host-based virtualization
  • Patent US 8504757 B1 "Method of Converting Virtual Storage Device Addresses to Physical Storage Device Addresses in a Dedicated Virtualization Manager".
  • the method is to allow third-party software or physical storage devices to add functionality to the dedicated virtualization manager's I/O handler based on each virtual storage device.
  • the proprietary virtualization manager described in this invention still refers to the server virtualization manager. It is worth noting that the Server Virtualization Manager adds II 0 processing capabilities to improve access storage efficiency and the "Storage Virtualization Manager" subordinate Different concept categories; server virtualization manager and storage virtualization manager are completely different in mechanism
  • the present invention provides a new method for intelligently utilizing Ceph to implement a storage virtualization manager (stored hypervisor), which is a distributed solution that uses Ceph cluster files/objects to control/manage individual devices. At the same time, this method allows Ceph's traditional storage functions to continue to be used.
  • a storage virtualization manager stored hypervisor
  • the storage virtualization manager of the present invention is implemented using Ceph's distributed mechanism.
  • the storage device object is abstracted into a Ceph file. Because this file is similar to a device file in Linux, we call it a device file.
  • a specific storage device corresponds to a storage device object, that is, corresponds to a specific device file.
  • the device file is the bridge between the upper and lower half of the storage virtualization manager. More specifically, a device control request in the upper half of the storage virtualization manager can be considered as a file write operation to the device file; and the response can be viewed as a device file through the Ceph interface (ie, the unified adapter). A file read operation.
  • the SMI-S of the present invention is used to manage SAN storage; and SSH is used to manage switches. If the storage user's management application controls or monitors a device through SMI-S or SSH, in some cases after virtualization, the storage virtualization manager needs to provide the corresponding SMI-S module or SSH module to match the real underlying device. . In order for the unified adapter to be unaffected, multiple instances of the module are distributed to Ceph's 0SD. Thus, the SMI-S module and the SSH module can be shared by different storage device objects, and also have fault tolerance due to their multiple instances being present in Ceph's 0SD.
  • the present invention provides a storage virtualization manager based on a Ceph-based distributed mechanism, wherein the storage virtualization manager includes at least:
  • the first portion of the chemistry manager is configured to be independent of the particular storage device and exists as an abstraction of an underlying storage device; the second portion of the storage virtualization manager is configured to be implemented using the inherent characteristics of the Ceph cluster,
  • the second part of the Storage Virtualization Manager includes various specific storage devices;
  • the specific storage device corresponds to a device file, and the first part of the storage virtualization manager is
  • the issued device control request is treated as a file write operation to the device file, and the response to the device control request is considered a file read operation to the device file by the Ceph client.
  • the first part of the storage virtualization manager includes:
  • a virtual finite state machine which serves as an abstraction of the underlying storage device, and provides various functions to the resource decision module as a basis for the decision making by the resource decision module; the virtual finite state machine also cooperates with the overall decision The modules work together to ensure the reliability of the entire storage virtualization manager system; the manager collects and provides device type, capability, QoS attributes to the resource decision group during the unified construction of the heterogeneous device Status information, wherein when the resource decision module sends a batch of resource reservation requests, the manager makes a decision whether the resource reservation request can be accepted; if the storage resource starts its function at a certain moment The performance meets the user requirements, and the topology indicates that the stored data link can be established, and the resource decision module will reply the message accepting the resource reservation to the user;
  • a unified adapter being implemented using a Ceph client mechanism configured to obtain information of all of the underlying storage devices and provide the information to the manager, the information including a topology of the underlying storage device Structure, function, and performance, the unified adapter provides a unified device operation/control interface for monitoring and control tasks.
  • the device file is a Ceph file.
  • the unified adapter is the Ceph client.
  • the monitoring and control tasks include monitoring or controlling a light switch port to allocate a volume from a storage area network (SAN).
  • SAN storage area network
  • the first portion of the storage virtualization manager further comprises: a data facility configured to provide operations on user data, the operations including disaster tolerance operations, compression release, and deletion redundancy.
  • the second part of the storage virtualization manager includes:
  • a storage management suggestion specification module configured to manage storage of a storage area network (SAN);
  • a secure shell protocol module configured to manage the switch;
  • the storage management suggestion specification module and the secure shell protocol module may be shared by different storage device objects, and the storage management suggestion specification module and the instance of the secure shell protocol module are distributed to the Ceph object storage device. Medium and fault tolerant.
  • the present invention also provides a storage virtualization manager system based on a Ceph-based distributed mechanism, the system having a control plane, a data plane, and a data stream, wherein the storage virtualization manager system includes: a storage virtualization manager as described above;
  • Live Resource secondary storage infrastructure domain including all storage area network (SAN) devices that work for Live Resources;
  • SAN storage area network
  • Customer storage infrastructure domain including all storage area network (SAN) devices that work for customers; and
  • SAN storage area network
  • a customer computing infrastructure domain including a host having all of the virtual machine clusters working for the client, the virtual machine cluster accessing its SAN device storage data within the customer storage infrastructure domain through the data plane, the sink
  • the Ceph client is on the host.
  • system further includes:
  • the resource decision module is a part of the real resource domain, and operates on the control plane, and the resource decision module determines whether the storage resource reservation can be successful according to the reality provided by the storage virtualization manager.
  • system further includes:
  • the overall decision making module is a part of the real resource domain, which works on the control plane.
  • the operation of the data plane does not occur in a resource reservation phase; when the virtual machine is started, the storage virtualization manager needs to establish a data link on the data plane, where the virtual This link is required to access the data of the client application stored by the SAN.
  • the data stream participates in a mirror library management and a mirror transmission
  • the image includes a client application system and an operating system on which it depends; when the virtual machine on the host machine is started, the virtual The template image required by the machine will be copied from the SAN device of the real resource attached storage infrastructure domain to the host, wherein the copying action is assisted by a Ceph client on the host.
  • the storage virtualization manager system has three layers: a unified adapter layer, a Ceph transport layer, and a lower layer physical device, where
  • the unified adapter layer includes the unified adapter, and the unified adapter layer establishes a device private protocol with the Ceph transport layer, and the unified adapter is responsible for communicating with the storage device of the lower layer physical device, and The unified adapter does not need to know where the storage device is;
  • the Ceph transport layer includes a Ceph cluster, and the Ceph transport layer is responsible for transmitting links, which are not concerned with data transmitted thereon;
  • the lower layer physical device includes each specific storage device, and the Ceph set in the Ceph transport layer Each object in the group is controlled, and the matching relationship between the storage device and the object is dynamic.
  • the storage virtualization manager system runs on a real resource (LR) service delivery platform, which is a resource management system based on the autonomic computing characteristics of the ACRA architecture.
  • LR real resource
  • the technical solution provided by the present invention is a distributed solution using a Ceph cluster file/object as a control/management device, and at the same time, the scheme allows Ceph's traditional storage function to continue to be used.
  • the highlight of the program is the Ceph cluster.
  • the storage virtualization manager of the present invention can implement storage virtualization and further implement an autonomous storage management system in combination with the ACRA architecture, thereby greatly improving the reliability and availability of the storage system.
  • FIG. 1 is an architectural diagram of an autonomous storage management system showing a storage virtualization manager and its working environment, in accordance with one embodiment of the present invention
  • FIG. 2 is a three level diagram involved in a storage virtualization manager workflow in accordance with one embodiment of the present invention
  • FIG. 3 is a block diagram of an implementation environment LR service delivery platform in accordance with one embodiment of the present invention.
  • FIG. 4 is an ACRA architecture referenced by an implementation environment LR service delivery platform in accordance with one embodiment of the present invention. detailed description
  • the present invention cleverly utilizes Ceph to implement a storage virtualization manager, the creativity of which is to use the files/objects of the Ceph cluster as a distributed solution for controlling/managing individual devices. At the same time, Ceph's traditional storage capabilities can continue to be used.
  • the storage virtualization manager of the present invention can implement storage virtualization And combined with the ACRA architecture to further implement an autonomous storage management system, thereby greatly improving the reliability and availability of the storage system.
  • Ceph is a Linux petabyte distributed file system. In general, Ceph is a high-performance, highly reliable and scalable cluster of multiple PCs. It can be roughly divided into four parts (see Figure 1):
  • Client Used by data users; Each Client instance provides a set of POSIX-like interfaces to hosts or processes; here, POSIX stands for Portable Operating System Interface;
  • Metadata Cluster Server Used to cache and synchronize distributed metadata; manage namespaces (file names and directory names) and coordinate security, consistency, and coupling;
  • Object Storage Cluster Contains multiple object storage devices (OSD) for storing all data and metadata.
  • Ceph uses a POSIX-like interface to ensure the scalability and consistency of the interface, which is consistent with the application and helps improve system performance. While achieving high performance, high reliability and high availability, Ceph achieves system scalability through three basic design features: decomposed data domain metadata, dynamic distributed metadata management, and automatic distributed object management.
  • the client uses the metadata server MDS for metadata operations to determine the data location.
  • Metadata Server MDS not only manages data locations but also arranges where to store new data. It is worth noting that the metadata itself is stored on the storage cluster and is identified as "metadata I/O". The actual file I/O occurs between the client and the object storage 0SD cluster. Thus, higher-level POSIX features (for example, open, close, rename) are managed through the metadata server MDS, while lower-level POSIX features (such as read and write) are managed directly by the object store 0SD cluster.
  • Ceph The intelligent control of the Ceph file system is distributed to each node, which not only simplifies the client interface, but also provides a large-scale dynamic expansion capability for Ceph.
  • Traditional storage often uses an allocation list method, in which metadata is used to map blocks on a disk to a specified file.
  • a file is assigned a node number from the metadata server. Gnode number) , the file is used as the unique identifier. Then the file is cut into several objects (the number is based on Depending on the size of the file). Using the file's node number IN ⁇ and the object number ⁇ N ⁇ (object number), each object is assigned an object identifier, the 0ID.
  • each object is assigned to a placement group.
  • the location group (identified as PGID) is a conceptual container for an object.
  • the location group is mapped to a series of object storage devices 0SD using a pseudo-random mapping algorithm called Controlled Replication Under Scalable Hashing (CRUSH).
  • CUSH Controlled Replication Under Scalable Hashing
  • MDS translates file names into file nodes (inodes) through a file system hierarchy and obtains node numbers IN0, schema, file size, and other metadata.
  • MDS If the file exists and the operation rights are available, MDS returns the node number, file length, and other file information in a hierarchical structure. MDS also gives the client client the right to operate. At present, there are four kinds of operation rights, which are represented by one bit: read, cache read, write, and buffer write. In the future, the right to operate will add security keywords for the client client to prove to 0SD that it can read and write data (the current strategy is allowed by all clients).
  • Metadata Server MDS manages file node (inode) space and converts file names into metadata. That is, the metadata server MDS translates the file name into an inode, file size, and striping data that the Ceph client uses for file I/O.
  • the task of the metadata server MDS is to manage the file system namespace (namespace).
  • namespace file system namespace
  • metadata is further split on a metadata server MDS cluster, and these metadata servers MDS can adaptively copy and allocate namespaces to avoid hot spots.
  • Metadata Server MDS manages individual segments of the namespace, and namespaces can overlap for reasons of redundancy and performance.
  • the mapping process of metadata from the MDS to the namespace in Ceph is implemented according to the dynamic subtree partitioning method, which allows Ceph to adjust according to the workload changes (that is, the namespace is migrated between the metadata servers MDS). ).
  • Ceph object storage 0SD As a type of object store, Ceph's storage nodes include not only storage but also intelligent control. Traditional drivers can only respond to commands simply.
  • the object storage device OSD is a smart device that can both request and respond, thereby enabling communication and cooperation with other object storage devices OSD.
  • the Ceph object storage device OSD implements an object-to-block mapping (this task is traditionally done in the client's file system layer). This design allows the local entity to decide on how to store an object in an optimal way.
  • the B-tree File System (BTRFS) can be applied to a storage node.
  • an MDS gives them the ability to read and cache the contents of the file; by file node number, level and file size, the client can name or assign all files containing the file.
  • the client opens the file for writing it gains the ability to use buffered writes and the data will be written to the appropriate object on the appropriate 0SD.
  • Ceph client client uses the CRUSH algorithm, which knows nothing about the block mapping of files on physical disks, the following storage device 0SD can safely manage object-to-block mappings. This allows the storage node to replicate data (especially when a device fails). Since fault detection and recovery are distributed, the Ceph storage system is very scalable. Ceph calls it RAD ⁇ S.
  • RADOS reliable autonomic distributed object store
  • the object store adheres to a simple principle: As part of the object store, all servers run software that manages and outputs the server's local disk space; all instances of the software collaborate on the cluster This provides a single, large data store that looks from the outside. In order to implement internal storage management, the object storage software no longer uses the original format of the data, but instead saves them on the storage nodes in the form of binary objects. More importantly, the number of individual nodes that create large object storage can be arbitrary; users can even dynamically add storage nodes during the run.
  • RAD0S can implement the object storage function as described above. Its key technology consists of three layers from bottom to top: 1. Object storage device (0SD). In RADOS, an OSD is always rendered as a folder of an existing file system. There is no hierarchical nesting in the 0SD folder, all files with UUID format names, no subfolders. Each 0SD is combined to form an object store; These binary objects are converted from stored files by RADOS and stored.
  • Monitoring Server constitutes the interface for RADOS storage and supports access to individual objects in the storage.
  • the monitoring server works in a decentralized manner, handling communication between all and external applications: that is, there is no limit to the number of monitoring servers, and any client can contact any monitoring server.
  • the monitoring server manages MONmap (list of all monitoring servers) and OSDmap (list of all OSDs). The information provided by these two tables allows client clients to calculate which OSD they need to contact when accessing a particular file.
  • MDS provides Ceph customers with P0SIX metadata for each object in the RADOS object store.
  • ACRA Autonomic Computing Reference Architecture
  • ACRA divides the autonomic computing system into three layers: at the bottom, the system components or managed resources 4300.
  • These managed resources 4300 can be any type of resource, including hardware or software.
  • the managed element can be a variety of internal resources, including databases, servers, routers, application modules, Web services or virtual machines, or other autonomous elements. These resources can have some embedded, self-managed Attributes.
  • Each managed resource 4300 generally provides a standardized interface (touchpoint). Each touchpoint X inch should be in a sensor/effector group.
  • a single autonomous element manages internal resources through autonomous managers.
  • it provides a standard interface (sensor/effector group) to be managed, including accepting strategies specified by IT managers and collaboration information with other autonomous elements. .
  • the parent autonomous element responsible for global orchestration can manage multiple subordinate autonomous elements.
  • the middle layer contains resource managers.
  • the 4200 is often divided into four categories: self-configuration, self-healing, self-optimization, and self-protection.
  • Each resource may have one or more resource managers 4300, each of which implements its own control loop.
  • the top level contains a global autonomic manager 4100 that coordinates various resource managers. These global autonomous managers 4100 achieve certain system-level management objectives through a system-wide large control loop to achieve system-wide autonomous management. See Figure 4.
  • the left side shows the Human Manager 4400, which provides IT professionals with a common system management interface through an integrated console.
  • a Knowledge Base 4500 from which the Human Manager 4400 and the Autonomous Managers 4100, 4200, 4300 can acquire and share all knowledge of the system.
  • FIG. 3 illustrates an implementation environment in accordance with one embodiment of the present invention: Live Resource Service Delivery Platform (referred to as LR).
  • the Live Resource Service Delivery Platform is an automated system that supports the scheduling of logical resource reservations.
  • Project Developer Project Operator
  • Application User System Operator
  • the project developer designs the project development required by the user. test environment. He/she creates, saves, publishes, edits, previews, deletes the environment design, and views the resources to which the user belongs.
  • the project manager deploys and uninstalls a project environment. He/she uses LR for environment deployment, offloading, backup management, re-deployment, operational state management, and project resource scheduling management, environmental topology management, and resource consumption statistics for the environment.
  • the application user deploys and accesses a project environment. He/she quickly deploys business environments through LR, fast SSH access to the environment, fast access deployment services, and self-security management services.
  • the system administrator performs asset management and monitors the operational status of the entire environment. He/she implements resource discovery through the network management system 302 in the LR, as well as physical server resources, physical network resources, physical storage resource management, and virtual server resources, virtual network resources, virtual storage resource management, and resource realization. Alarm management.
  • the LR service delivery platform includes three levels of scheduling:
  • Project delivery scheduling Includes requirements design services for computing, storage, network resources, system resource analysis services, virtual resource reservations, and deployment services. Support is completed by the Project Delivery Service Network 300. Closely related to the present invention is system resource analysis and virtual resource reservation 301.
  • the deployment process is the process of binding logical resources in a logical delivery environment to virtual resources. Logical resources are bound to virtual resources in a one-to-one manner, which is the first binding in the entire logical delivery environment for scheduled delivery automation.
  • Virtual resource scheduling Including the allocation, configuration, and provisioning of virtual resources. Completion is supported by resource engine component 304.
  • the binding process of virtual resources to physical resources must go through the resource engine 304, which is the second binding in the automated delivery of the entire logical delivery environment.
  • the resource engine 304 provides "capability" of various virtual resources by aggregating individual virtual resources.
  • the resource engine 304 also maintains a state model for each virtual resource, thereby completing the binding from the virtual resource to the physical resource.
  • the proxy 306, 307, 308 on the physical resource accepts the resource command of the resource engine 304, implements resource multiplexing, resource space sharing, and the resource state information is passed back to the resource engine 304 via the proxy 306, 307, 308.
  • the above functionality consists of a multi-physical delivery environment consisting of a data center physical resource service network 309 partitioned for the project.
  • the physical resource service network 309 supports scheduled delivery of the delivery environment, while supporting sharing by space and pressing Time shares physical resources, including many unallocated and allocated physical resources, such as network, storage, and computing resources.
  • System Operator In addition to managing the various physical resources of the physical data center, the division of the physical delivery environment is also implemented by the system administrator.
  • the resource engine 304 uses physical resource information provided by the NMS (Network Management System) to track physical resources to obtain the latest resource status; and map physical resources to virtual resources.
  • NMS Network Management System
  • Commercial network management systems used to manage physical resources generally provide information about state and performance, and all have the function of finding and searching for physical resources, so they are not described here.
  • Various storage resources include: storage area network, network attached storage, distributed file system, Ceph, mirroring, etc.
  • the above physical resource information is stored in the reference model 303, as shown in FIG. Additionally, a "push” or “pull” schedule can be selected between the project delivery service network 300 and the resource engine 304.
  • push resources are required to change regardless of the capabilities of the physical delivery environment, and support for parallelized resource provisioning.
  • pulse resource change requirements are only promised when the physical delivery environment capacity is ready, and parallelized resource provisioning is supported.
  • the resource engine 304 in the LR can perform binding of virtual resources to physical resources. Its main task is to virtualize physical resources, including the virtualization of various storage resources.
  • the "Storage Virtualization Manager" 305 is an important part of implementing storage virtualization. It is the focus of the present invention.
  • FIG. 1 is a diagram of an autonomous storage management system architecture showing a storage virtualization manager and its working environment, in accordance with one embodiment of the present invention.
  • the frame structure of the "Storage Virtualization Manager” 1000 is shown in the rounded rectangle.
  • the storage virtualization manager 1000 can be divided into two logical portions: a storage virtualization manager first portion (top half) 1001 and a storage virtualization manager second portion (lower half) 1002.
  • the upper part is independent of the specific device and exists as an abstraction layer of the underlying device. It is part of the laaS platform tool Live Resource Domain 1010.
  • the second half includes a variety of specific devices that are implemented using the inherent features of the Ceph Cluster 1020.
  • vFSM 101 1 vFSM 1011 is the (local) Virtual Finite State Machine. It has two important functions: One is the abstraction of the underlying storage device, and the various "capabilities" are provided to the user of the storage virtualization manager 1000, a resource decision maker 1100 (Resource Decision Maker, or funded The source decision module is used as the basis for decision making by the resource decision maker 1100; the other is the decision part of the smart storage device (Smart Storage Device), which cooperates with the overall system decision making module coreVFSM1200 to ensure the reliability of the whole system.
  • the intelligent storage device here, that is, the storage virtualization manager 1000 and all the SAN storages it controls, including the LR attached storage infrastructure domain 1300 and the customer storage infrastructure domain 1400 as a whole, provides intelligence to the upper layer users. Virtualized storage capabilities.
  • Manager 1012 Manager 1012 An important prerequisite for unified scheduling of managed storage resources is that the Unified Adapter 1014 knows almost all information about the underlying storage devices, such as topology, functions, Performance, etc. In the unified construction process of the heterogeneous device, the manager 1012 collects and provides the resource decision maker 1100 with information such as device type, capability, QoS, and status information.
  • the manager 1012 needs to analyze the information grasped by the hand and make a decision whether or not the resource reservation can be accepted. If the storage resource meets the user's needs at some point in time and the topology indicates that the stored data link can be established, the resource decider 1100 will reply the accept resource reservation message to the user. It should be noted that the data link for accessing the storage at this time has not yet been established, and the real data link needs to wait until at least one compute node is started before it can be established.
  • Data Facility 1013 is an optional module. Used for application App data, that is, user data operations, such as: disaster recovery operations such as backup and recovery, or compression release, redundancy, and other optimization operations.
  • Unified Adapter 1014 provides a unified device operation/control interface based on the following Device Adapter Layer that depends on the storage device in the Ceph Cluster 1020. Complete monitoring and control tasks, such as monitoring or controlling the Light Switch (FC SW) port, and allocating a volume (LUN) from the SAN.
  • the unified adapter 1014 itself is implemented using Ceph's client mechanism.
  • the storage virtualization manager 1000 of the present invention is implemented using a distributed mechanism of Ceph.
  • the storage device object is abstracted into a Ceph file. Because this file is similar to the device file in Linux, we call it device file 1030.
  • some A specific storage device corresponds to a storage device object, that is, corresponds to a specific device file.
  • the device file 1030 is a bridge between the storage virtualization manager first portion 1001 (top half) and the storage virtualization manager second portion 1002 (lower half). More specifically, a device control request to store the upper portion 1001 of the virtualization manager can be considered a write to the device file 1030; and the response can be considered to be through the Ceph interface (ie, the unified adapter 1014).
  • a file read operation of device file 1030 is implemented using a distributed mechanism of Ceph.
  • SMI-S Storage Management Initiative Specification
  • WBEM Web-based Enterprise Management
  • CIM Common Information Model
  • SSH is a Secure Shell protocol (Secure Shell) that is well known to those skilled in the art, so it will not be described.
  • SMI-S is used to manage SAN (storage area network) storage; SSH is used to manage switches. If the storage user's management application controls or monitors a device through SMI-S or SSH, in some cases after virtualization, the storage virtualization manager 1000 needs to provide a corresponding SMI-S module or SSH module to match the real underlying layer. device.
  • to unify the unified adapter 1014 multiple instances of the module can be distributed into the 0SD of Ceph.
  • SMI-S module 1021 and SSH module 1022 can be shared by different storage device objects, and also have fault tolerance due to their multiple instances being present in Ceph's 0SD.
  • Ceph's MDS does not need to be concerned by the storage virtualization manager, in one embodiment, Ceph's MDS can be omitted in Figure 1.
  • the back-end storage devices include all SAN (Storage Area Network) devices operating in the LR attached storage infrastructure domain 1300 for LR, and all SAN devices working in the customer storage infrastructure domain 1400 for customers. These backend devices are implemented by SMI-S module 1021 and SSH module 1022 via Controls 1604 and 1605. For example: Divide a volume (LUN) or delete a volume (LUN). We call it Fabric Management.
  • SAN Storage Area Network
  • Data Plane does not occur during the resource reservation phase.
  • the storage virtualization manager 1000 needs to establish a data link, from the source HBA1514 to the fiber switch (omitted in Figure 1) port, to the destination HBA1410... SAN1401.
  • the virtual machine VM1511 needs to access the data of the client APP stored in the SAN.
  • Data stream 1602 participates in image library management and image transfer.
  • the image belongs to APP Content, which contains the client's application system and the operating system it depends on. For example: When the virtual machine VM1511 on the host machine 1510 starts, it The required template image will be copied from the SAN of the LR attached storage infrastructure domain 1300 to the host. This copying is done by the Ceph client 1513 on the host 1510.
  • Ceph clusters within the LR attached storage infrastructure domain 1300 can alleviate communication congestion due to concurrent access. For example, when a large number of virtual machines are started at the same time, communication congestion often occurs.
  • the system in addition to the storage virtualization manager 1000, the system includes, but is not limited to, the following modules:
  • Resource Decision Module 1100 (Resource Decision Maker) is part of the LR and works on the control plane. For example: To run a new online banking system, a bank needs a certain amount of computing power, memory, and storage. These resources require an appointment. The resource decision module 1100 will determine whether the storage resource can meet the requirements of running a new online banking system according to the reality provided by the storage virtualization manager. For example, if the SAN is only 1% idle, the storage resource reservation cannot be successful.
  • the Coordination Decision Module 1200 (coreVFSM) is part of the LR and works on the control plane.
  • the Customer Storage Infrastructure Domain (1400) includes all SAN devices that work for customers, such as SAN devices that bank applications use to store their data. 5) Customer Computing Infrastructure Domain 1500
  • the Customer Computing Infrastructure Domain 1500 (Company Computing Infrastructure Domain) includes all computing resources that work for customers, such as: Support for virtual machine clusters running various applications of the bank. These virtual machines (e.g., VM 1511) access the SAN storage data within the customer storage infrastructure domain 1400 through the data plane 1601.
  • each storage device corresponds to a device file 1030, just like the ioctl mechanism in the Linux device driver mode.
  • the current device files are distributed to the OSD in the Ceph cluster.
  • the Storage Virtualization Manager has a distributed device file system.
  • Ioctl is the abbreviation of iocontrol, which is 10 controls. In simple terms, when writing a Linux device driver, you will encounter some I0 operations, which can neither be classified into read nor logically written, and those operations can be considered part of ioctl. Read and write should be used to write and read data, and should be handled as a simple way of data exchange. Ioctl is used to control certain options for read and write. For example: The user has designed a universal driver module for reading and writing the I0 port. Read and write are read and write data from the port, but change the port read and write, how should this operation be handled? Obviously it is reasonable to use ioctl.
  • the read and write of the port can be blocked, or can not be blocked; or the read and write of the device file can be concurrent, or can not be concurrent; these can be designed to be configured with ioctl.
  • the general parameter format of ioctl is the way of command word (constant) + command parameter.
  • the read and write parameters are both data buffer + data destination pointer + length.
  • the present invention creates a Ceph object as an abstraction for a Fibre Channel switch (FC SW).
  • the fiber switch can utilize a Brocade fiber switch.
  • We write a Control Message to the device file from the Unified Adapter Layer 2100 (refer to the Unified Adapter 1014 in Figure 1, where the Unified Adapter Layer 2100 is bound to the Ceph Client) (see the device file in Figure 1).
  • the device file referred to herein is more specifically an object in the Ceph cluster 2210, such as object 2211.
  • This triggers an SSH process running on a certain OSD (see SSH Module 1022 in Figure 1), which can successfully disable the port of the fabric switch (i.e., one of the storage 2310s) by controlling 1604 or 1605.
  • SSH then writes the resulting state of this disable operation back to the device file (eg, object 2211).
  • the write back operation The Unified Adapter Layer 2100 (see Uniform Adapter 1014 in Figure 1) will be triggered, which can get the result status of the operation from the device file (eg object 2211).
  • the object in FIG. 2 (for example, object 2211) is an example of a file in device file 1030 in FIG. 1, but the operation object is different from the operation file.
  • the former is a whole in Ceph, and is easy to be stored in the upper part of the virtualization manager 1001. Operation; the latter is a piece of fragmentation in Ceph, which is not conducive to being manipulated.
  • the workflow of the Storage Virtualization Manager involves three levels: Unified Adapter Layer 2100, Ceph Transport Layer 2200, and Lower Physical Device 2300. From a landscape perspective, these three levels apply not only to storage virtualization, but also to network virtualization and compute virtualization.
  • the workflow for requesting a volume (LUN) on a SAN through the Storage Virtualization Manager is as follows:
  • Step 1 Write "alloc-lun" to the device file device file in the form of message;
  • Step 2 The device file change will trigger the SSH process on the OSD; the SSH process is constantly listening for device files. Will read out the "alloc-lun” information;
  • Step 3 "alloc-lun” is executed by Ceph, and the SAN will allocate the required volume LUNs;
  • Step 4 The volume LUN allocation success message is written back to the device file device file
  • Step 5 After the unified adapter reads the device file device file and knows that the operation is completed, the user will be further notified.
  • BP In the case of dynamic heterogeneity of devices in the storage 2310, the original user 2000 (or, for example, the resource decision maker 1100) The specific device in the storage 2310 is directly accessed, and is changed to access an object in the Ceph cluster 2210, that is, a storage device object. Then, when the device in the storage 2310 changes, all user 2000 programs accessing the storage device object are not affected. If the original direct access method is used, the user 2000 needs to make corresponding adjustment changes.
  • FIG. 4 is a reference to an implementation environment LR service delivery platform in accordance with an embodiment of the present invention.
  • ACRA architecture The implementation environment LR service delivery platform of the present invention shown in FIG. 4 is a resource management system with autonomic computing features.
  • its implementation refers to IBM's ACRA architecture, and of course its storage resource management portion.
  • the lower half of the Ceph storage is the RADOS 4300, a reliable, autonomous, distributed object store. As the name suggests, it has a self-management attribute. They are one of the managed resources 4300.
  • the resource engine 304 includes a resource manager 4200 and a global autonomous manager 4100.
  • the storage virtualization manager 305 is one of the resource managers 4200, and the knowledge in the resource manager 4200 (autonomous elements) includes the vFSM.
  • the knowledge within the Global Autonomic Manager 4100 (autonomous elements) includes coreFSM.
  • vFSM is the (storage virtualization manager local) virtual finite state machine. It works in conjunction with the virtual finite state machine coreVFSM1200 of the entire system, virtual finite state machine to improve the availability and reliability of the entire storage system.
  • the storage virtualization manager of the present invention needs to support the autonomous computing feature that the resource manager 4200 has. Please refer to Figure 1.
  • the Ceph/RADOS mechanism has its own management attributes. With its support, the SAN can automatically report alarms. For example, the port of the fabric switch (that is, one of the storage devices in the 2310) is broken. The alarm is reported to the unified adapter 1014 through SSH 1022 and device file 1030.
  • VFSM 1011 is the virtual finite state machine that stores virtualization manager 100. It has two important functions: one is abstraction as the underlying storage device; the other is the decision-making part of the Smart Storage Device. Since the port loss of the fabric switch is within its control range, it can send a message requesting the fiber switch to replace a port.
  • the SAN device has a large-area damage alarm due to external reasons. This problem is beyond the scope of the storage virtualization manager 100.
  • the state of the vFSM 1011 needs to be reported to the coreVFSM1200. It is coordinated by the decision-making part of the LR service delivery platform of the higher level, and even requires manual intervention.
  • the autonomous storage management system provided by the present invention has the following features:
  • Self-configuration It can adapt to changes in the storage system. Such changes may include the deployment of new storage devices or the removal of existing storage devices; dynamic adaptations help ensure continuous operation of the storage devices/software.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Disclosed is a storage virtualization manager of a Ceph-based distributed mechanism, characterized in that the storage virtualization manager at least comprises a first part of the storage virtualization manager and a second part of the storage virtualization manager, wherein the first part of the storage virtualization manager is configured to exist as an abstract of a basic storage device independent of a specific storage device; and the second part of the storage virtualization manager is configured to be implemented using inherent characteristics of a Ceph cluster, and the second part of the storage virtualization manager comprises various specific storage devices, wherein the specific storage devices correspond to device files, a device control request which is sent by the first part of the storage virtualization manager is regarded as one file write operation for the device files, and a response to the device control request is regarded as one file read operation for the device files via a Ceph interface.

Description

一种基于 Ceph的分布式机制的存储虚拟化管理器及系统 技术领域  Storage virtualization manager and system based on Ceph distributed mechanism
本发明涉及计算机虚拟化技术和企业数据中心内物理存储资源和虚拟存储资 源的交付和部署。 更具体地, 涉及一种利用 Ceph实现存储虚拟化管理器 (存储 hypervisor)。 背景技术  The present invention relates to computer virtualization technologies and the delivery and deployment of physical storage resources and virtual storage resources within an enterprise data center. More specifically, it relates to a storage virtualization manager (storage hypervisor) implemented using Ceph. Background technique
计算机存储虚拟化技术就是通过对存储子系统或存储服务的内部功能进行抽 象或隔离, 使存储或数据的管理与应用、 服务器、 网络资源的管理分离, 从而实现 应用和网络的独立管理。  Computer storage virtualization technology separates the management of applications or servers, network resources, and the independent management of applications and networks by abstracting or isolating the internal functions of storage subsystems or storage services.
简而言之, 就是物理资源和逻辑资源不再是一一对应的关系, 可以是一对多或 者多对一, 而这种关系对于用户是透明的。 对于用户来说, 虚拟化的存储资源就像 是一个巨大的 "存储池", 用户不会看到具体的磁盘、 磁带, 也不必关心自己的数据 经过哪一条路径存放在哪一个具体的存储设备上。  In short, physical resources and logical resources are no longer one-to-one correspondence. They can be one-to-many or many-to-one, and this relationship is transparent to users. For users, virtualized storage resources are like a huge "storage pool". Users don't see specific disks or tapes, and they don't have to care which path their own data is stored in. on.
目前, 虚拟存储技术的发展尚无统一标准。根据虚拟化软件的实现位置可以分 为三个不同的实现: 基于主机的虚拟化, 基于互连网络的虚拟化以及基于存储设备 的虚拟化。  At present, there is no uniform standard for the development of virtual storage technology. Depending on where the virtualization software is implemented, it can be divided into three different implementations: host-based virtualization, interconnect-based virtualization, and storage-based virtualization.
基于主机的虚拟化中, 一般通过存储管理软件如逻辑卷管理软件 (LVM)加以实 现。 物理设备都被映射成连续顺序的逻辑存储空间。 通过对逻辑视图的管理, 用户 利用虚拟管理软件将存储介质映射为逻辑卷。基于主机的虚拟存储的好处是虚拟化 存储实现容易, 在纯软件的支持下实现各种各样的存储管理。 为了进一步提高存储 系统的可靠性, 安全性和管理性, 多台服务器可以利用集群技术实现共享存储。 其 不足之处是虚拟化需要通过软件来完成, 扩展性及兼容性不好, 调度工作会影响服 务器的应用性能。  Host-based virtualization is typically implemented through storage management software such as Logical Volume Management Software (LVM). Physical devices are mapped into a contiguous sequence of logical storage spaces. Through the management of logical views, users use virtual management software to map storage media to logical volumes. The benefit of host-based virtual storage is that virtualized storage is easy to implement, with a variety of storage management supported by pure software. To further improve the reliability, security, and management of storage systems, multiple servers can use clustering technology to achieve shared storage. The downside is that virtualization needs to be done through software. The scalability and compatibility are not good. The scheduling work will affect the application performance of the server.
在存储子系统或存储设备中也可以实现存储虚拟化,基于存储设备的虚拟化适 应异构 SAN (存储区域网络) 的架构, 且更适应于以存储为核心的环境。 设备级 的虚拟存储独立于主机, 多台主机可以连接到存储设备上, 但存储设备自身应该是 同构的。 依据采用的方案不一样, 镜像、 RAID、 即时快照、 数据拷贝都可以采纳 基于存储设备级的虚拟存储方法。由于存储设备在各个供应商之间兼容性和互操作 性较差, 若没有第三方的虚拟软件的支撑, 设备级的虚拟存储往往只能提供一种不 完全的虚拟存储化解决方案。 Storage virtualization can also be implemented in storage subsystems or storage devices. Storage-based virtualization adapts to the architecture of heterogeneous SANs (storage area networks) and is more adaptive to storage-centric environments. Device-level virtual storage is independent of the host, and multiple hosts can be connected to the storage device, but the storage device itself should be isomorphic. Depending on the solution used, mirroring, RAID, instant snapshots, and data copies can all adopt storage-level virtual storage methods. Due to the poor compatibility and interoperability of storage devices among various vendors, device-level virtual storage can only provide one kind without the support of third-party virtual software. A complete virtual storage solution.
基于互连网络的虚拟化解决方案是将虚拟引擎转移到 SAN 系统的核心即互 连网络中。 具体实现方式依赖于网络设备。 可以是交换机本身, 或者是路由器, 也 可以采用专用服务器来实现。通常如果采用交换机或路由器方式来实现一般采用带 内方式, 即数据和元数据是同一通路, 如果采用专用服务器来实现则又可以分为带 内方式和带外方式两种实现方式。在带内方式结构中, 虚拟控制设备处于服务器和 存储设备之间, 运行于其上的存储管理软件管理和配置所有存储设备。 带外方式结 构中, 虚拟控制设备处于系统数据通路外, 不直接参与数据的传输, 虚拟控制设备 对所有存储设备进行配置, 并将配置信息提交给所有服务器, 服务器在访问存储设 备时, 不再经过存储设备。 当考虑管理和维护费用时, 基于主机的虚拟存储比基于 网络的虚拟存储要差得多。原因是基于网络的虚拟存储可以提供一种虚拟资源管理 池, 通过该池可以集中管理各种资源, 可以大大减少维护与管理的工作量, 相应的 管理人员也可以大大减少。  An interconnected network-based virtualization solution moves the virtual engine to the core of the SAN system, the interconnected network. The specific implementation depends on the network device. It can be the switch itself, or a router, or it can be implemented by a dedicated server. Generally, if the switch or router mode is adopted, the in-band mode is generally adopted, that is, the data and the metadata are the same path. If the dedicated server is used, the two modes of the in-band mode and the out-of-band mode can be implemented. In the in-band mode structure, the virtual control device is between the server and the storage device, and the storage management software running thereon manages and configures all the storage devices. In the out-of-band mode, the virtual control device is outside the system data path and does not directly participate in the data transmission. The virtual control device configures all the storage devices and submits the configuration information to all servers. When the server accesses the storage device, the server no longer After the storage device. Host-based virtual storage is much worse than network-based virtual storage when considering management and maintenance costs. The reason is that network-based virtual storage can provide a virtual resource management pool, through which various resources can be centrally managed, which can greatly reduce the workload of maintenance and management, and the corresponding management personnel can be greatly reduced.
现在很多存储公司推出了上述各类存储虚拟化产品。基于主机的存储虚拟化技 术, 如 Veritas 公司推出的卷管理软件 (Volume Manager) 是基于服务器的虚拟 化软件, 它可以安装在服务器或主机上, 将多个物理磁盘虚拟成逻辑卷提供给用户 。 Virsto开发出了一款软件解决方案, 它安装在每台主机服务器上, 它同时创建一 个高性能磁盘或者固态存储区域, 称为" vLog"。 读操作会直接指向主存储, 而写操 作会通过 vLog进行, vLog将这些写操作异步地分布写入主存储。和缓存的工作方 式类似, vLog 通过在存储前端降低耦合度改善了存储性能, 降低了后端存储的延 迟。 基于设备的存储虚拟化技术, 如 Hitachi的 Data Systems' Universal Storage Platform (数据系统的通用存储平台) 通过在自有存储阵列系统下合并和管理其他 的存储实现虚拟化应用。  Many storage companies now offer these types of storage virtualization products. Host-based storage virtualization technologies, such as Veritas' volume management software (Volume Manager), are server-based virtualization software that can be installed on a server or host to virtualize multiple physical disks into logical volumes for users. Virsto has developed a software solution that is installed on each host server and creates a high-performance disk or solid-state storage area called "vLog". The read operation will point directly to the primary store, and the write operation will be done via vLog, which vLog distributes these writes asynchronously to the primary store. Similar to how caching works, vLog improves storage performance by reducing coupling at the storage front end, reducing latency for back-end storage. Device-based storage virtualization technologies such as Hitachi's Data Systems' Universal Storage Platform enable virtualized applications by consolidating and managing other storage under their own storage array systems.
基于网络的存储虚拟化技术,如 EMC的 Invista通过在 FC SAN或 iSCSI SAN 中插入一台服务器或智能交换机设备来截取从网络设备到存储控制器的 1/0, 从而 实现虚拟化功能。 DataCore 公司的 SANsymphony 是基于网络的带内虚拟化软 件, 它运行在服务器和存储设备之间的专用硬件设备上。 IBM 的推出了 SVC卷控 制器 +SAN 文件系统的带外虚拟化解决方案, 具有较高的市场占有率。 另外 Tivoli 公司的 SANergy产品和 SGI 公司的 CXFS 产品都是基于 SAN 文件系统方式 的虚拟化软件。  Network-based storage virtualization technologies such as EMC's Invista enable virtualization by plugging a server or intelligent switch device into a FC SAN or iSCSI SAN to intercept 1/0 from the network device to the storage controller. DataCore's SANsymphony is a network-based in-band virtualization software that runs on dedicated hardware devices between servers and storage devices. IBM's out-of-band virtualization solution with SVC Volume Controller + SAN File System has a high market share. In addition, Tivoli's SANergy products and SGI's CXFS products are virtualization software based on SAN file system.
实现存储虚拟化的途径也有多种, 其中之一就是运用"存储虚拟化管理器" (又 称: 存储 hypervisor)。 存储虚拟化管理器就是管理作为虚拟资源的多个 "存储池" 的监控程序。 作为虚拟引擎的一种, 它将所有它管理的存储硬件视为通用的平台, 尽管这些硬件可能是不相同的并且不相兼容。 要做到这一点, 一个存储虚拟化管理 器必须了解其 "底层存储"的性能, 容量和服务特性。 这里, "底层存储"可以指物理 硬件, 例如固态磁盘(solid-state disks)或硬盘; 也可以指存储架构, 例如存储区 域网络 (SAN ) , 网络附加存储 (NAS) 和直接连接存储 (DAS)。 There are many ways to implement storage virtualization, one of which is to use the "storage virtualization manager" (again Said: store hypervisor). A storage virtualization manager is a monitor that manages multiple "storage pools" as virtual resources. As a type of virtual engine, it treats all the storage hardware it manages as a common platform, although these hardware may be different and incompatible. To do this, a storage virtualization manager must understand the performance, capacity, and service characteristics of its "underlying storage." Here, "underlying storage" may refer to physical hardware, such as solid-state disks or hard disks; or storage architectures such as storage area networks (SANs), network attached storage (NAS), and direct attached storage (DAS). .
存储虚拟化管理器可以保证在添加新的设备(如一个新的阵列)或更换部分或 全部现有存储池的资源时, 不会造成业务中断。 这意味着在此过程中, 存储虚拟化 管理器完全掌握它将获得哪些存储特性和功能,并且知道它将忽略哪些存储特性和 功能。 换句话说, 存储虚拟化管理器不仅仅是一个存储监控程序 (supe「viso「) 和 存储虚拟化功能的简单组合, 它是一个更高水平的智能化软件。存储虚拟化管理器 不但可以控制设备级的存储控制器, 磁盘阵列和虚拟化中间件, 而且它还可以实现 存储供应, 提供诸如快照和备份服务, 并管理由策略来驱动的服务等级协议 SLA (service level agreement)。存储虚拟化管理器还能够为进一步构建由软件定义的 存储 (software defined storage) 提供了坚实的基础。  The Storage Virtualization Manager ensures that no business disruption occurs when adding new devices (such as a new array) or replacing resources in some or all of the existing storage pools. This means that during this process, the Storage Virtualization Manager fully understands which storage features and functions it will acquire and knows which storage features and functions it will ignore. In other words, the Storage Virtualization Manager is more than just a simple combination of a storage monitor (supe "viso") and storage virtualization. It is a higher level of intelligent software. The Storage Virtualization Manager can be controlled not only. Device-level storage controllers, disk arrays, and virtualization middleware, and it also enables storage provisioning, providing snapshot and backup services, and managing policy-driven service level agreements (SLAs). Storage virtualization The manager also provides a solid foundation for further building software defined storage.
总之,使用存储虚拟化管理器有诸多好处,包括:更好地利用现有的存储设施, 提高管理员生产力, 以及进一步提高存储的性能和可用性。 目前, 主要有三个厂商 提供存储虚拟化管理器, 它们是: ^M、 DataCore和 Virsto公司。 它们所采用的 技术方案均不公开。 目前, 与存储及虚拟化概念相关的专利有不少, 但大都没有涉 及"存储虚拟化管理器" (存储 Hype「viso「) 这一概念, 试举两例:  In summary, there are many benefits to using a storage virtualization manager, including: better leveraging existing storage facilities, increasing administrator productivity, and further improving storage performance and availability. Currently, there are three major vendors offering storage virtualization managers: ^M, DataCore, and Virsto. The technical solutions they use are not disclosed. At present, there are many patents related to the concept of storage and virtualization, but most of them do not involve the concept of "storage virtualization manager" (store Hype "viso"), two examples:
1 ) 专利 US20100153617A1 "虚拟机的存储管理系统"。 该专利的申请人是 Virsto公司。其目的是为虚拟桌面系统提供更好的存储管理方案。它解决两个问题: a) 传统服务器虚拟化完全依赖于虚拟化管理器, 而没有进行存储管理优化, 数据 中心内服务器虚拟化的规模扩大时会就遇到性能瓶颈问题。 b) 传统的卷管理往往 不能很好地满足服务器虚拟化的功能需求。 该专利的适用场合为与虚拟桌面系统 1) Patent US20100153617A1 "Storage Management System for Virtual Machines". The applicant for this patent is Virsto. Its purpose is to provide a better storage management solution for virtual desktop systems. It solves two problems: a) Traditional server virtualization relies entirely on the virtualization manager, without storage management optimization, and performance bottlenecks are encountered when server virtualization in the data center grows. b) Traditional volume management often does not adequately address the functional requirements of server virtualization. The patent is applicable to virtual desktop systems.
(VDI ) , 其实现类型属于基于主机的虚拟化, 均与本发明有很大的不同。 (VDI), whose implementation type is host-based virtualization, is quite different from the present invention.
2) 专利 US 8504757 B1 "在一个专属的虚拟化管理器中将虚拟存储设备地址 转换到物理的存储设备地址方法"。 其方法是允许第三方软件或物理存储设备以各 个虚拟存储设备为依据, 在专属的虚拟化管理器的 I / 0处理程序中添加功能。 该 发明所述专属的虚拟化管理器仍指服务器虚拟化管理器。值得指出的是, 服务器虚 拟化管理器添加 I I 0处理功能以提高访问存储的效率和 "存储虚拟化管理器 "分属 不同的概念范畴;服务器虚拟化管理器和存储虚拟化管理器在机制上是完全不同的 2) Patent US 8504757 B1 "Method of Converting Virtual Storage Device Addresses to Physical Storage Device Addresses in a Dedicated Virtualization Manager". The method is to allow third-party software or physical storage devices to add functionality to the dedicated virtualization manager's I/O handler based on each virtual storage device. The proprietary virtualization manager described in this invention still refers to the server virtualization manager. It is worth noting that the Server Virtualization Manager adds II 0 processing capabilities to improve access storage efficiency and the "Storage Virtualization Manager" subordinate Different concept categories; server virtualization manager and storage virtualization manager are completely different in mechanism
发明内容 Summary of the invention
本发明的一个目的是, 在数据中心内构建单个统一的存储模块(single unified storage component)。本发明提供了一种新的巧妙利用 Ceph来实现存储虚拟化管 理器(存储 Hypervisor)的方法, 该方法是一种运用 Ceph集群的文件 /对象来作为 控制 /管理各个设备的分布式解决方案,而同时,该方法又允许 Ceph传统意义上的 存储功能还能继续使用。  It is an object of the present invention to construct a single unified storage component within a data center. The present invention provides a new method for intelligently utilizing Ceph to implement a storage virtualization manager (stored hypervisor), which is a distributed solution that uses Ceph cluster files/objects to control/manage individual devices. At the same time, this method allows Ceph's traditional storage functions to continue to be used.
本发明的存储虚拟化管理器是利用了 Ceph的分布式机制来实现的。存储设备 对象 (storage device object)被抽象为一个 Ceph文件。 因为该文件类似于 Linux中 的设备文件, 我们就称其为设备文件。 一般来讲, 某一个具体的存储设备对应一个 存储设备对象, 也就是对应一个具体的设备文件。设备文件是存储虚拟化管理器上 半部分和下半部分之间的桥梁。更具体地说, 存储虚拟化管理器上半部分的一个设 备控制请求可以被视为对设备文件的一次文件写操作; 而响应则可以被视为通过 Ceph接口 (即统一适配器) 对设备文件的一次文件读操作。  The storage virtualization manager of the present invention is implemented using Ceph's distributed mechanism. The storage device object is abstracted into a Ceph file. Because this file is similar to a device file in Linux, we call it a device file. Generally speaking, a specific storage device corresponds to a storage device object, that is, corresponds to a specific device file. The device file is the bridge between the upper and lower half of the storage virtualization manager. More specifically, a device control request in the upper half of the storage virtualization manager can be considered as a file write operation to the device file; and the response can be viewed as a device file through the Ceph interface (ie, the unified adapter). A file read operation.
本发明的 SMI-S是用来管理 SAN存储的; 而 SSH是用来管理交换机的。 如果 存储用户的管理应用通过 SMI-S或 SSH控制或监视某个设备, 在虚拟化之后的有 些情况下, 存储虚拟化管理器需要提供相应的 SMI-S模块或 SSH模块来匹配真实 的底层设备。 为使统一适配器不受影响, 模块的多个实例被分布到 Ceph 的 0SD 中。 这样, SMI-S模块和 SSH模块既可以被不同的存储设备对象所共享, 而且还 由于它们的多个实例存在于 Ceph的 0SD中而具有容错功能。  The SMI-S of the present invention is used to manage SAN storage; and SSH is used to manage switches. If the storage user's management application controls or monitors a device through SMI-S or SSH, in some cases after virtualization, the storage virtualization manager needs to provide the corresponding SMI-S module or SSH module to match the real underlying device. . In order for the unified adapter to be unaffected, multiple instances of the module are distributed to Ceph's 0SD. Thus, the SMI-S module and the SSH module can be shared by different storage device objects, and also have fault tolerance due to their multiple instances being present in Ceph's 0SD.
具体而言, 本发明提供了一种基于 Ceph的分布式机制的存储虚拟化管理器, 其特征在于, 所述存储虚拟化管理器至少包括:  Specifically, the present invention provides a storage virtualization manager based on a Ceph-based distributed mechanism, wherein the storage virtualization manager includes at least:
存储虚拟化管理器第一部分 (例如, 存储虚拟化管理器的上半部分); 以及 存储虚拟化管理器第二部分 (例如, 存储虚拟化管理器的下半部分); 其中, 所述存储虚拟化管理器第一部分被配置成独立于具体存储设备, 是作为 一个基础存储设备的抽象而存在的;所述存储虚拟化管理器第二部分被配置成利用 Ceph 集群的固有特性来实现, 所述存储虚拟化管理器第二部分包括各种具体存储 设备;  a first portion of the storage virtualization manager (eg, the upper half of the storage virtualization manager); and a second portion of the storage virtualization manager (eg, the lower half of the storage virtualization manager); wherein the storage virtualization The first portion of the chemistry manager is configured to be independent of the particular storage device and exists as an abstraction of an underlying storage device; the second portion of the storage virtualization manager is configured to be implemented using the inherent characteristics of the Ceph cluster, The second part of the Storage Virtualization Manager includes various specific storage devices;
其中, 所述具体存储设备对应设备文件, 所述存储虚拟化管理器第一部分的所 发出的设备控制请求被视为对设备文件的一次文件写操作,而针对所述设备控制请 求的响应被视为通过 Ceph客户端对所述设备文件的一次文件读操作。 The specific storage device corresponds to a device file, and the first part of the storage virtualization manager is The issued device control request is treated as a file write operation to the device file, and the response to the device control request is considered a file read operation to the device file by the Ceph client.
在一个实施例中, 所述存储虚拟化管理器第一部分包括:  In one embodiment, the first part of the storage virtualization manager includes:
虚拟有限状态机, 所述虚拟有限状态机作为底层存储设备的抽象, 将各种功能 提供给资源决策模块, 以作为所述资源决策模块进行决策的依据; 所述虚拟有限状 态机还配合统筹决策模块协同工作, 以保证整个存储虚拟化管理器系统的可靠性; 管理器,所述管理器在异构设备统一构建过程中收集并向所述资源决策组模提 供设备类型、 能力、 QoS 属性以及现状信息, 其中, 当所述资源决策模块发来成 批的资源预留请求时, 所述管理器做出是否可以接受所述资源预留请求的决定; 如 果存储资源在某时刻起其功能和性能满足用户需求,并且拓扑结构表明存储的数据 链路可以建立, 则所述资源决策模块将把接受资源预留的消息回答给用户;  a virtual finite state machine, which serves as an abstraction of the underlying storage device, and provides various functions to the resource decision module as a basis for the decision making by the resource decision module; the virtual finite state machine also cooperates with the overall decision The modules work together to ensure the reliability of the entire storage virtualization manager system; the manager collects and provides device type, capability, QoS attributes to the resource decision group during the unified construction of the heterogeneous device Status information, wherein when the resource decision module sends a batch of resource reservation requests, the manager makes a decision whether the resource reservation request can be accepted; if the storage resource starts its function at a certain moment The performance meets the user requirements, and the topology indicates that the stored data link can be established, and the resource decision module will reply the message accepting the resource reservation to the user;
统一适配器, 所述统一适配器利用 Ceph客户端机制来实现, 所述统一适配器 被配置成获得所有基础存储设备的信息并将所述信息提供给所述管理器,所述信息 包括基础存储设备的拓扑结构、 功能、 性能, 所述统一适配器提供统一的设备操作 /控制接口, 完成监视和控制任务。  a unified adapter, the unified adapter being implemented using a Ceph client mechanism configured to obtain information of all of the underlying storage devices and provide the information to the manager, the information including a topology of the underlying storage device Structure, function, and performance, the unified adapter provides a unified device operation/control interface for monitoring and control tasks.
在一个实施例中, 所述设备文件为 Ceph文件。  In one embodiment, the device file is a Ceph file.
在一个实施例中, 所述统一适配器为所述 Ceph客户端。  In one embodiment, the unified adapter is the Ceph client.
在一个实施例中, 所述监视和控制任务包括监视或控制光线交换机端口, 从存 储区域网络 (SAN ) 中分配得到一个卷。  In one embodiment, the monitoring and control tasks include monitoring or controlling a light switch port to allocate a volume from a storage area network (SAN).
在一个实施例中, 所述存储虚拟化管理器第一部分还包括: 数据设施, 被配置 成提供对用户数据的操作, 所述操作包括容灾操作、 压缩释放、 删除冗余。  In one embodiment, the first portion of the storage virtualization manager further comprises: a data facility configured to provide operations on user data, the operations including disaster tolerance operations, compression release, and deletion redundancy.
在一个实施例中, 所述存储虚拟化管理器第二部分包括:  In one embodiment, the second part of the storage virtualization manager includes:
存储管理建议规范模块, 被配置成管理存储区域网络 (SAN ) 的存储; 安全外壳协议模块, 被配置成管理交换机;  a storage management suggestion specification module configured to manage storage of a storage area network (SAN); a secure shell protocol module configured to manage the switch;
其中,所述存储管理建议规范模块和所述安全外壳协议模块可被不同的存储设 备对象所共享,所述存储管理建议规范模块和所述安全外壳协议模块的实例被分布 到 Ceph的对象存储设备中而具有容错功能。  The storage management suggestion specification module and the secure shell protocol module may be shared by different storage device objects, and the storage management suggestion specification module and the instance of the secure shell protocol module are distributed to the Ceph object storage device. Medium and fault tolerant.
本发明还提供了一种基于 Ceph的分布式机制的存储虚拟化管理器系统,所述 系统具有控制平面、 数据平面和数据流, 其特征在于, 所述存储虚拟化管理器系统 包括: 如上所述的存储虚拟化管理器; The present invention also provides a storage virtualization manager system based on a Ceph-based distributed mechanism, the system having a control plane, a data plane, and a data stream, wherein the storage virtualization manager system includes: a storage virtualization manager as described above;
现实资源(Live Resource)附属存储基础架构域, 包括所有为现实资源(Live Resource) 工作的存储区域网络 (SAN ) 设备;  Live Resource secondary storage infrastructure domain, including all storage area network (SAN) devices that work for Live Resources;
客户存储基础架构域, 包括所有为客户工作的存储区域网络(SAN )设备; 以 及  Customer storage infrastructure domain, including all storage area network (SAN) devices that work for customers; and
客户计算基础架构域, 包括具有所有为客户工作的虚拟机集群的宿主机, 所述 虚拟机集群通过所述数据平面来访问其在所述客户存储基础架构域内的 SAN设备 存储数据, 所述宿主机上具有所述 Ceph客户端。  a customer computing infrastructure domain, including a host having all of the virtual machine clusters working for the client, the virtual machine cluster accessing its SAN device storage data within the customer storage infrastructure domain through the data plane, the sink The Ceph client is on the host.
在一个实施例中, 所述系统还包括:  In an embodiment, the system further includes:
资源决策模块, 所述资源决策模块是现实资源域的一部分, 其工作在所述控制 平面上,所述资源决策模块根据所述存储虚拟化管理器提供的现实情况决定存储资 源预约是否可以成功。  The resource decision module, the resource decision module is a part of the real resource domain, and operates on the control plane, and the resource decision module determines whether the storage resource reservation can be successful according to the reality provided by the storage virtualization manager.
在一个实施例中, 所述系统还包括:  In an embodiment, the system further includes:
统筹决策模块, 所述统筹决策模块是现实资源域的一部分, 其工作在所述控 制平面上。  The overall decision making module is a part of the real resource domain, which works on the control plane.
在一个实施例中, 所述数据平面的操作不发生在资源预留阶段; 当所述虚拟机 启动时, 所述存储虚拟化管理器需要在所述数据平面上建立数据链路, 所述虚拟机 访问 SAN所存储的客户应用程序的数据就需要通过此链路。  In an embodiment, the operation of the data plane does not occur in a resource reservation phase; when the virtual machine is started, the storage virtualization manager needs to establish a data link on the data plane, where the virtual This link is required to access the data of the client application stored by the SAN.
在一个实施例中, 所述数据流参与镜像库管理和镜像传输, 所述镜像包含有客 户的应用系统以及它所依赖的操作系统; 当所述宿主机上的虚拟机启动时, 所述虚 拟机所需的模板镜像将从所述现实资源附属存储基础架构域的所述 SAN设备中拷 贝到所述宿主机, 其中所述拷贝动作由所述宿主机上的 Ceph客户端协助完成。  In one embodiment, the data stream participates in a mirror library management and a mirror transmission, the image includes a client application system and an operating system on which it depends; when the virtual machine on the host machine is started, the virtual The template image required by the machine will be copied from the SAN device of the real resource attached storage infrastructure domain to the host, wherein the copying action is assisted by a Ceph client on the host.
在一个实施例中, 所述存储虚拟化管理器系统具有三个层: 统一适配器层、 Ceph 传输层、 下层物理设备, 其中,  In one embodiment, the storage virtualization manager system has three layers: a unified adapter layer, a Ceph transport layer, and a lower layer physical device, where
所述在统一适配器层包括所述统一适配器,所述统一适配器层与所述 Ceph 传 输层之间建立一个设备私有协议,所述统一适配器负责与所述下层物理设备的存储 设备进行通信, 而且所述统一适配器并不需要知道所述存储设备在哪里;  The unified adapter layer includes the unified adapter, and the unified adapter layer establishes a device private protocol with the Ceph transport layer, and the unified adapter is responsible for communicating with the storage device of the lower layer physical device, and The unified adapter does not need to know where the storage device is;
所述 Ceph 传输层包括 Ceph集群, 所述 Ceph传输层负责传输链接, 其并不 关心在其上所传输的数据;  The Ceph transport layer includes a Ceph cluster, and the Ceph transport layer is responsible for transmitting links, which are not concerned with data transmitted thereon;
所述下层物理设备包括各个具体存储设备, 由所述 Ceph传输层中的 Ceph集 群内的各个对象来控制, 所述存储设备与所述对象之间的匹配关系是动态的。 在一个实施例中, 所述存储虚拟化管理器系统运行在现实资源 (LR) 服务交 付平台上, 所述现实资源 (LR) 服务交付平台是一基于 ACRA架构的自主计算特 性的资源管理系统。 The lower layer physical device includes each specific storage device, and the Ceph set in the Ceph transport layer Each object in the group is controlled, and the matching relationship between the storage device and the object is dynamic. In one embodiment, the storage virtualization manager system runs on a real resource (LR) service delivery platform, which is a resource management system based on the autonomic computing characteristics of the ACRA architecture.
本发明所提供的技术方案是一种运用 Ceph集群的文件 /对象来作为控制 /管理 各个设备的分布式解决方案, 而同时, 该方案又允许 Ceph传统意义上的存储功能 还能继续使用。 该方案的亮点在于 Ceph集群。 本发明的存储虚拟化管理器可以实 现存储虚拟化, 并结合 ACRA架构进一步实现自主的存储管理系统, 从而大大提 高存储系统的可靠性和可用性。 附图说明  The technical solution provided by the present invention is a distributed solution using a Ceph cluster file/object as a control/management device, and at the same time, the scheme allows Ceph's traditional storage function to continue to be used. The highlight of the program is the Ceph cluster. The storage virtualization manager of the present invention can implement storage virtualization and further implement an autonomous storage management system in combination with the ACRA architecture, thereby greatly improving the reliability and availability of the storage system. DRAWINGS
本发明的以上发明内容以及下面的具体实施方式在结合附图阅读时会得到 更好的理解。 需要说明的是, 附图仅作为所请求保护的发明的示例。 在附图中, 相同的附图标记代表相同或类似的元素。  The above summary of the present invention and the following detailed description of the invention will be better understood. It should be noted that the drawings are only illustrative of the claimed invention. In the drawings, the same reference numerals indicate the same or similar elements.
图 1 是根据本发明的一个实施例的自主的存储管理系统架构图, 其示出了 存储虚拟化管理器 及其工作环境;  1 is an architectural diagram of an autonomous storage management system showing a storage virtualization manager and its working environment, in accordance with one embodiment of the present invention;
图 2是根据本发明的一个实施例的存储虚拟化管理器工作流程涉及的三个 层次图;  2 is a three level diagram involved in a storage virtualization manager workflow in accordance with one embodiment of the present invention;
图 3是根据本发明的一个实施例的实施环境 LR服务交付平台的结构图; 以及  3 is a block diagram of an implementation environment LR service delivery platform in accordance with one embodiment of the present invention;
图 4 是根据本发明的一个实施例的实施环境 LR 服务交付平台所参照的 ACRA架构。 具体实施方式  4 is an ACRA architecture referenced by an implementation environment LR service delivery platform in accordance with one embodiment of the present invention. detailed description
以下在具体实施方式中详细叙述本发明的详细特征以及优点,其内容足以使任 何本领域技术人员了解本发明的技术内容并据以实施,且根据本说明书所揭露的说 明书、 权利要求及附图, 本领域技术人员可轻易地理解本发明相关的目的及优点。  The detailed features and advantages of the present invention are set forth in the Detailed Description of the Detailed Description of the <RTIgt; </RTI> <RTIgt; The objects and advantages associated with the present invention will be readily understood by those skilled in the art.
本发明巧妙利用 Ceph来实现存储虚拟化管理器, 其创造性在于, 运用 Ceph 集群的文件 /对象来作为控制 /管理各个设备的分布式解决方案。而同时, Ceph传统 意义上的存储功能还能继续使用。本发明所述存储虚拟化管理器可以实现存储虚拟 化, 并结合 ACRA架构进一步实现自主的存储管理系统, 从而大大提高存储系统 的可靠性和可用性。 The present invention cleverly utilizes Ceph to implement a storage virtualization manager, the creativity of which is to use the files/objects of the Ceph cluster as a distributed solution for controlling/managing individual devices. At the same time, Ceph's traditional storage capabilities can continue to be used. The storage virtualization manager of the present invention can implement storage virtualization And combined with the ACRA architecture to further implement an autonomous storage management system, thereby greatly improving the reliability and availability of the storage system.
Ceph是一个 Linux petabyte级分布式文件系统。 通俗地讲, Ceph是一个由 多台 PC机组成的高性能、高可靠并且可扩展的集群。它大致可以划分为四部分(见 图 1 ):  Ceph is a Linux petabyte distributed file system. In general, Ceph is a high-performance, highly reliable and scalable cluster of multiple PCs. It can be roughly divided into four parts (see Figure 1):
1 . 客户端(Client) : 供数据用户使用; 每个 Client实例向主机或进程提供一 组类似于 POSIX 的接口; 这里, POSIX 表示可移植操作系统接口 (Portable Operating System Interface);  1. Client: Used by data users; Each Client instance provides a set of POSIX-like interfaces to hosts or processes; here, POSIX stands for Portable Operating System Interface;
2. 元数据簇服务器 (MDS, Metadata Cluster Server) : 用于缓存和同步分 布式元数据; 管理命名空间 (文件名和目录名) 并协调安全性、 一致性与耦合性;  2. Metadata Cluster Server (MDS, Metadata Cluster Server): Used to cache and synchronize distributed metadata; manage namespaces (file names and directory names) and coordinate security, consistency, and coupling;
3. —个对象存储集群 (Object Storage Cluster) : 包含多个对象存储设备 OSD (Object Storage Device), 用于存储所有的数据和元数据;  3. Object Storage Cluster: Contains multiple object storage devices (OSD) for storing all data and metadata.
4. 集群监视器 (MONs): 执行监视功能。  4. Cluster Monitor (MONs): Performs monitoring functions.
Ceph之所以使用类似于 POSIX的接口,是为了保证接口的扩展性与一致性, 便于与应用保持一致, 有利于提高系统性能。 Ceph在达到高性能、 高可靠性与高 可用性的同时, 通过下面三种基本设计特性实现系统的可扩展性: 分解的数据域元 数据, 动态分布式元数据管理, 以及自动分布式对象管理。  Ceph uses a POSIX-like interface to ensure the scalability and consistency of the interface, which is consistent with the application and helps improve system performance. While achieving high performance, high reliability and high availability, Ceph achieves system scalability through three basic design features: decomposed data domain metadata, dynamic distributed metadata management, and automatic distributed object management.
客户端使用元数据服务器 MDS来进行元数据操作以确定数据位置。 元数据服 务器 MDS不但管理数据位置而且安排何处存储新数据。 值得注意的是, 元数据本 身存储在存储集群上, 标识为"元数据 I/O"。 而实际文件的 I/O 发生在客户和对象 存储 0SD集群之间。 这样, 较高层次的 POSIX功能 (例如, 打开、 关闭、 更名 ) 就通过元数据服务器 MDS管理, 而低层的 POSIX 功能 (例如读和写) 则直接 由对象存储 0SD集群管理。  The client uses the metadata server MDS for metadata operations to determine the data location. Metadata Server MDS not only manages data locations but also arranges where to store new data. It is worth noting that the metadata itself is stored on the storage cluster and is identified as "metadata I/O". The actual file I/O occurs between the client and the object storage 0SD cluster. Thus, higher-level POSIX features (for example, open, close, rename) are managed through the metadata server MDS, while lower-level POSIX features (such as read and write) are managed directly by the object store 0SD cluster.
Ceph 客户端 Client  Ceph Client Client
Ceph文件系统的智能控制被分布到各节点上, 这不仅简化了客户端接口, 而 且为 Ceph 提供了大规模动态的扩充能力。 传统的存储常常采用分配列表 (allocation list) 的方法, 即由元数据将磁盘上的块 (block) 映射给一个指定文件 而在 Ceph中,一个文件会被分配一个来自元数据服务器的节点号 INO Gnode number) , 文件以此作为唯一的标识符。 然后文件被切割成几个对象 (个数根据 文件的大小而定) 。 使用文件的节点号 IN〇 和 对象号〇N〇 (object number) , 每个对象都会被分配到一个对象标识符即 0ID。 The intelligent control of the Ceph file system is distributed to each node, which not only simplifies the client interface, but also provides a large-scale dynamic expansion capability for Ceph. Traditional storage often uses an allocation list method, in which metadata is used to map blocks on a disk to a specified file. In Ceph, a file is assigned a node number from the metadata server. Gnode number) , the file is used as the unique identifier. Then the file is cut into several objects (the number is based on Depending on the size of the file). Using the file's node number IN〇 and the object number 〇N〇(object number), each object is assigned an object identifier, the 0ID.
运用基于对象标识符 0ID 的一个简单哈希表, 每个对象都被分配到一个位置 组 (placement group) 中。 这里, 位置组 (标识为 PGID) 是一个对象的概念化 容器。 最后, 使用一个叫做 Controlled Replication Under Scalable Hashing (CRUSH ) 的伪随机映射算法, 将该位置组映射到一系列对象存储设备 0SD 上 。 这样, 映射位置组 (以及副本) 到存储设备的过程就不必依赖于元数据, 而是依 据一个伪随机映射函数。上述方法是非常有用的, 因为它不仅把存储的开销加以最 小化, 而且简化了分配和查询数据的过程。  Using a simple hash table based on the object identifier 0ID, each object is assigned to a placement group. Here, the location group (identified as PGID) is a conceptual container for an object. Finally, the location group is mapped to a series of object storage devices 0SD using a pseudo-random mapping algorithm called Controlled Replication Under Scalable Hashing (CRUSH). Thus, the process of mapping location groups (and replicas) to a storage device does not have to rely on metadata, but rather on a pseudo-random mapping function. The above method is very useful because it not only minimizes the overhead of storage, but also simplifies the process of allocating and querying data.
于是, 当一个用户打开一个文件时, 客户端 Client向元数据服务器 MDS集群 发送一个请求。 MDS通过文件系统分层结构把文件名翻译成文件节点 (inode) , 并获得节点号 IN0、 模式、 文件大小与其它的元数据。  Thus, when a user opens a file, the client client sends a request to the metadata server MDS cluster. MDS translates file names into file nodes (inodes) through a file system hierarchy and obtains node numbers IN0, schema, file size, and other metadata.
如果文件存在并可以获得操作权, 则 MDS通过分层结构返回节点号、 文件长 度与其它文件信息。 MDS同时赋予客户端 Client操作权。 在目前, 操作权有四种, 分别通过一个 bit表示: 读 (read) 、 缓冲读 (cache read) 、 写 (write) 、 缓冲 写(buffer write) 。 在未来, 操作权会增加安全关键字, 用于客户端 client向 0SD 证明它可以对数据进行读写 (目前的策略是所有 client都允许) 。  If the file exists and the operation rights are available, MDS returns the node number, file length, and other file information in a hierarchical structure. MDS also gives the client client the right to operate. At present, there are four kinds of operation rights, which are represented by one bit: read, cache read, write, and buffer write. In the future, the right to operate will add security keywords for the client client to prove to 0SD that it can read and write data (the current strategy is allowed by all clients).
元数据服务器 MDS  Metadata server MDS
元数据服务器 MDS管理文件节点(inode)空间, 并将文件名转换成为元数据 。 也就是, 元数据服务器 MDS将文件名转变为一个节点 (inode) , 文件大小, 以 及 Ceph 客户端用于文件 I/O 的分段数据 (striping data) 。  Metadata Server MDS manages file node (inode) space and converts file names into metadata. That is, the metadata server MDS translates the file name into an inode, file size, and striping data that the Ceph client uses for file I/O.
实际上,元数据服务器 MDS的任务就是管理文件系统的命名空间(namespace ) 。 虽然元数据和数据两者都存储在对象存储 (object storage) 集群中, 但从可 扩展性考虑, 两者是分别进行管理的。 事实上, 元数据在一个元数据服务器 MDS 集群上被进一步拆分, 而这些元数据服务器 MDS能够自适应地复制和分配命名空 间, 以避免热点的出现。 元数据服务器 MDS管理命名空间的各个片段, 出于保持 冗余和提高性能的考虑, 命名空间可以发生重叠。 Ceph中元数据服务器 MDS到 命名空间的映射过程是根据动态子树分配 (dynamic subtree partitioning ) 的方法 实现的, 它允许 Ceph 依据工作负载的变化进行调整 (即命名空间在元数据服务 器 MDS之间迁移) 。  In fact, the task of the metadata server MDS is to manage the file system namespace (namespace). Although both metadata and data are stored in an object storage cluster, they are managed separately for scalability reasons. In fact, metadata is further split on a metadata server MDS cluster, and these metadata servers MDS can adaptively copy and allocate namespaces to avoid hot spots. Metadata Server MDS manages individual segments of the namespace, and namespaces can overlap for reasons of redundancy and performance. The mapping process of metadata from the MDS to the namespace in Ceph is implemented according to the dynamic subtree partitioning method, which allows Ceph to adjust according to the workload changes (that is, the namespace is migrated between the metadata servers MDS). ).
Ceph 对象存储 0SD 作为对象存储 (object store) 的一种, Ceph 的存储节点不仅包括存储功能, 还包括智能控制。 传统驱动只能简单地响应命令。 而对象存储设备 0SD是智能设 备, 它既能请求又能响应, 从而能够支持与其它对象存储设备 0SD之间的通信和 合作。 Ceph object storage 0SD As a type of object store, Ceph's storage nodes include not only storage but also intelligent control. Traditional drivers can only respond to commands simply. The object storage device OSD is a smart device that can both request and respond, thereby enabling communication and cooperation with other object storage devices OSD.
从存储角的度来看, Ceph 对象存储设备 OSD实现了从对象到块的映射 (这 项任务传统上是在客户端的文件系统层中完成的)。 这个设计允许由本地实体来决 定以怎样一种最佳方式来存储一个对象, 例如, B-tree 文件系统 (BTRFS) 可以 被应用在存储节点上。  From the perspective of the storage angle, the Ceph object storage device OSD implements an object-to-block mapping (this task is traditionally done in the client's file system layer). This design allows the local entity to decide on how to store an object in an optimal way. For example, the B-tree File System (BTRFS) can be applied to a storage node.
于是, 当一个或多个客户端 Client打开同一个文件进行读操作, 一个 MDS会 赋予他们读与缓存文件内容的能力; 通过文件节点号, 层级与文件大小, Client可 以命名或分配所有包含该文件数据的对象, 并直接从 0SD集群中读取; 任何不存 在的对象或字节序列被定义为文件空或 0。 同样的, 如果 Client打开文件进行写操 作,它获得使用缓冲写的能力,数据将被写到合适的 0SD上的合适的对象中。 Client 关闭文件时, 会自动放弃这些能力。  Thus, when one or more client clients open the same file for reading, an MDS gives them the ability to read and cache the contents of the file; by file node number, level and file size, the client can name or assign all files containing the file. The object of the data, and read directly from the 0SD cluster; any non-existent object or byte sequence is defined as a file empty or 0. Similarly, if the client opens the file for writing, it gains the ability to use buffered writes and the data will be written to the appropriate object on the appropriate 0SD. These capabilities are automatically discarded when the Client closes the file.
在此过程中, 因为 Ceph 客户端 Client运用了 CRUSH算法, 它对物理磁盘 上文件的块映射一无所知, 因此下面的存储设备 0SD就能够安全地管理对象到块 之间的映射。 这就允许存储节点复制数据 (特别是当某一个设备出现故障时)。 由 于故障检测和恢复都是分布式的, 因此 Ceph存储系统具有很好的可扩展性。 Ceph 称其为 RAD〇S。  In this process, because the Ceph client client uses the CRUSH algorithm, which knows nothing about the block mapping of files on physical disks, the following storage device 0SD can safely manage object-to-block mappings. This allows the storage node to replicate data (especially when a device fails). Since fault detection and recovery are distributed, the Ceph storage system is very scalable. Ceph calls it RAD〇S.
RADOS (reliable autonomic distributed object store) 即可靠的自主的分布式 对象存储。 其中, 对象存储 (object store)遵守一个简单的原则: 作为对象存储的 一部分, 所有的服务器都同时运行管理和输出服务器的本地磁盘空间的软件; 该软 件的所有实例 (instances) 在集群上相互协作, 从而提供了一个从外部看上去单 一的, 大型的数据存储。 为了实现内部存储管理, 对象存储软件不再用数据的原始 格式, 而是将用二进制对象的方式将它们保存在各个存储节点上。 更重要的是, 创 建大型对象存储的单个节点的数目可以是任意的;用户甚至可以在运行过程中动态 添加存储节点。  RADOS (reliable autonomic distributed object store) is a reliable autonomous distributed object store. The object store adheres to a simple principle: As part of the object store, all servers run software that manages and outputs the server's local disk space; all instances of the software collaborate on the cluster This provides a single, large data store that looks from the outside. In order to implement internal storage management, the object storage software no longer uses the original format of the data, but instead saves them on the storage nodes in the form of binary objects. More importantly, the number of individual nodes that create large object storage can be arbitrary; users can even dynamically add storage nodes during the run.
RAD0S可以实现如上所述的对象存储功能, 其关键技术从下到上包含三层: 1 . 对象存储设备(0SD)。 在 RADOS中, 一个 0SD始终呈现为一个现存文 件系统的文件夹。 在 0SD文件夹内没有层次嵌套, 全部是带 UUID格式名称的文 件, 没有子文件夹。 各个 0SD联合在一起共同构成了对象存储 (object store); 这些二进制对象 (object) 是由 RADOS从被存储文件转换而成, 并存储 (store) 起来的。 RAD0S can implement the object storage function as described above. Its key technology consists of three layers from bottom to top: 1. Object storage device (0SD). In RADOS, an OSD is always rendered as a folder of an existing file system. There is no hierarchical nesting in the 0SD folder, all files with UUID format names, no subfolders. Each 0SD is combined to form an object store; These binary objects are converted from stored files by RADOS and stored.
2. 监控服务器(MONs) : 监控服务器构成 RADOS存储的接口, 并支持实现 访问存储内的各个对象。监控服务器以非集中的方式工作, 处理所有和外部应用之 间的通信: 即监控服务器的个数没有限制, 任何客户端可以跟任何一个监控服务器 联络。 监控服务器管理 MONmap (所有监控服务器的列表) 和 OSDmap (所有 OSD 的列表)。 这两个表所提供的信息可以让客户 clients在访问特定文件时计算 他们需要联系哪一个 OSD。  2. Monitoring Server (MONs): The Monitoring Server constitutes the interface for RADOS storage and supports access to individual objects in the storage. The monitoring server works in a decentralized manner, handling communication between all and external applications: that is, there is no limit to the number of monitoring servers, and any client can contact any monitoring server. The monitoring server manages MONmap (list of all monitoring servers) and OSDmap (list of all OSDs). The information provided by these two tables allows client clients to calculate which OSD they need to contact when accessing a particular file.
3. 元数据服务器 (MDS): MDS向 Ceph的客户提供了在 RADOS对象存 储中各个对象的 P0SIX元数据。  3. Metadata Server (MDS): MDS provides Ceph customers with P0SIX metadata for each object in the RADOS object store.
本发明涉及的另一个概念是自主计算参考架构(ACRA, Autonomic Computing Reference Architecture)。  Another concept related to the present invention is the Autonomic Computing Reference Architecture (ACRA).
如图 4所示, ACRA将自主计算系统分为三层:在底层的是系统组件或被管资 源 4300。这些被管的资源 4300可以是任何类型的资源, 包括硬件或软件。被管元 素既可以是各种内部资源, 包括数据库、 服务器、 路由器、 应用模块、 Web服务 或虚拟机器等, 也可以是其他自主元素, 这些资源可以会有一些嵌入式 (embedded), 自主管理的属性。 每个被管理资源 4300 —般都提供标准化的接口 (touchpoint) 。 每个 touchpoint X寸应于一个传感器 (sensor)/效应器 (effector)组。 单 个自主元素一方面通过自主管理者管理内部资源, 另一方面, 它向外提供标准接口 (传感器 /效应器组)接受管理, 包括接受 IT 管理者指定的策略以及与其他自主元素 的协作信息等。 例如负责全局编排的父自主元素可以管理多个下级的子自主元素。  As shown in Figure 4, ACRA divides the autonomic computing system into three layers: at the bottom, the system components or managed resources 4300. These managed resources 4300 can be any type of resource, including hardware or software. The managed element can be a variety of internal resources, including databases, servers, routers, application modules, Web services or virtual machines, or other autonomous elements. These resources can have some embedded, self-managed Attributes. Each managed resource 4300 generally provides a standardized interface (touchpoint). Each touchpoint X inch should be in a sensor/effector group. On the one hand, a single autonomous element manages internal resources through autonomous managers. On the other hand, it provides a standard interface (sensor/effector group) to be managed, including accepting strategies specified by IT managers and collaboration information with other autonomous elements. . For example, the parent autonomous element responsible for global orchestration can manage multiple subordinate autonomous elements.
中间层包含资源管理器 4200往往分为四类: 自我配置、 自我修复、 自我优化 和自我保护。每个资源可能有一个或多个资源管理器 4300,每个管理器 4300实现 自己的控制循环。  The middle layer contains resource managers. The 4200 is often divided into four categories: self-configuration, self-healing, self-optimization, and self-protection. Each resource may have one or more resource managers 4300, each of which implements its own control loop.
顶层包含有协调各种资源管理器的全局自主管理器 4100。 这些全局自主管理 器 4100通过系统范围的大控制循环, 实现某些系统级的管理目标, 达到全系统的 自主管理。请参见图 4, 左侧所示的是人工管理器 4400, 通过一个集成控制台它给 IT专业人员提供了一个通用的系统管理界面。 请参见图 4, 右侧所示的是一个知识 库 4500, 人工管理器 4400和自主管理器 4100, 4200, 4300可以从那里获取和分 享有关系统的所有知识。  The top level contains a global autonomic manager 4100 that coordinates various resource managers. These global autonomous managers 4100 achieve certain system-level management objectives through a system-wide large control loop to achieve system-wide autonomous management. See Figure 4. The left side shows the Human Manager 4400, which provides IT professionals with a common system management interface through an integrated console. Referring to Figure 4, on the right is a Knowledge Base 4500, from which the Human Manager 4400 and the Autonomous Managers 4100, 4200, 4300 can acquire and share all knowledge of the system.
图 3示出根据本发明的一个实施例的实施环境: Live Resource服务交付平台 (简称 LR)。 Live Resource服务交付平台是一种支持逻辑资源预约交付的自动化 系统。 该平台有四类不同的用户: 项目开发者 (Project Developer), 项目管理者 (Project Operator)、 应用用户 (Application User), 系统管理者 (System Operator) 项目开发者设计用户所需的项目开发、 测试环境。 他 /她通过 LR创建、 保存、 发布、 编辑、 预览、 删除环境设计、 查看用户所属的资源。 Figure 3 illustrates an implementation environment in accordance with one embodiment of the present invention: Live Resource Service Delivery Platform (referred to as LR). The Live Resource Service Delivery Platform is an automated system that supports the scheduling of logical resource reservations. There are four different types of users on the platform: Project Developer, Project Operator, Application User, System Operator. The project developer designs the project development required by the user. test environment. He/she creates, saves, publishes, edits, previews, deletes the environment design, and views the resources to which the user belongs.
项目管理者部署、 卸载一个项目环境。 他 /她通过 LR进行环境部署、 卸载、 备 份管理、 重复部署、 运行状态管理、 以及项目资源调度管理、 环境拓扑管理和环境 的资源消耗统计。  The project manager deploys and uninstalls a project environment. He/she uses LR for environment deployment, offloading, backup management, re-deployment, operational state management, and project resource scheduling management, environmental topology management, and resource consumption statistics for the environment.
应用用户部署并访问一个项目环境。 他 /她通过 LR 快速部署业务环境, 快速 SSH访问环境、 快速访问部署业务、 以及自我安全管理服务。  The application user deploys and accesses a project environment. He/she quickly deploys business environments through LR, fast SSH access to the environment, fast access deployment services, and self-security management services.
系统管理者进行资产管理, 并监控整个环境的运行状态。他 /她通过 LR内的网 管系统 302实现资源的发现, 以及物理服务器资源, 物理网络资源, 物理存储资源 的管理, 和虚拟服务器资源, 虚拟网络资源, 虚拟存储资源的管理, 另外还实现资 源的告警管理。  The system administrator performs asset management and monitors the operational status of the entire environment. He/she implements resource discovery through the network management system 302 in the LR, as well as physical server resources, physical network resources, physical storage resource management, and virtual server resources, virtual network resources, virtual storage resource management, and resource realization. Alarm management.
如图 3所示, LR服务交付平台包括三个层次的调度:  As shown in Figure 3, the LR service delivery platform includes three levels of scheduling:
( 1 )项目交付调度。 包括计算、 存储、 网络资源的需求设计服务, 系统资源分 析服务, 虚拟资源预约和部署服务。 由项目交付服务网络 300支持完成。和本发明 密切相关的是系统资源分析和虚拟资源预约 301。部署过程是逻辑交付环境中的逻 辑资源绑定到虚拟资源的过程。 逻辑资源是以 1对 1 的方式绑定到虚拟资源上的, 这是整个逻辑交付环境预约交付自动化过程中的第一次绑定。  (1) Project delivery scheduling. Includes requirements design services for computing, storage, network resources, system resource analysis services, virtual resource reservations, and deployment services. Support is completed by the Project Delivery Service Network 300. Closely related to the present invention is system resource analysis and virtual resource reservation 301. The deployment process is the process of binding logical resources in a logical delivery environment to virtual resources. Logical resources are bound to virtual resources in a one-to-one manner, which is the first binding in the entire logical delivery environment for scheduled delivery automation.
(2)虚拟资源调度。 包括虚拟资源的分配、 配置、 供应服务。 由资源引擎组件 304支持完成。 虚拟资源到物理资源的绑定过程必须经过资源引擎 304, 这是整个 逻辑交付环境预约交付自动化过程中的第二次绑定。资源引擎 304通过将各个虚拟 资源进行汇聚来提供各种虚拟资源的"能力" (Capability) 。此外, 资源引擎 304还保 存了各个虚拟资源的状态模型, 从而完成从虚拟资源到物理资源的绑定。  (2) Virtual resource scheduling. Including the allocation, configuration, and provisioning of virtual resources. Completion is supported by resource engine component 304. The binding process of virtual resources to physical resources must go through the resource engine 304, which is the second binding in the automated delivery of the entire logical delivery environment. The resource engine 304 provides "capability" of various virtual resources by aggregating individual virtual resources. In addition, the resource engine 304 also maintains a state model for each virtual resource, thereby completing the binding from the virtual resource to the physical resource.
(3) 物理资源调度。 物理资源上的代理 306、 307、 308接受资源引擎 304的 设置资源指令, 实现资源多路复用, 资源空间共享, 资源状态信息经过代理 306、 307、 308传回资源引擎 304。  (3) Physical resource scheduling. The proxy 306, 307, 308 on the physical resource accepts the resource command of the resource engine 304, implements resource multiplexing, resource space sharing, and the resource state information is passed back to the resource engine 304 via the proxy 306, 307, 308.
上述功能由为项目划分的数据中心物理资源服务网络 309包含了多物理交付环 境。该物理资源服务网络 309支持交付环境的预约交付, 同时支持按空间分享和按 时间分享物理资源, 包括许多未分配和已分配的物理资源, 例如网络、 存储、 计算 资源。系统管理者 (System Operator)除了管理物理数据中心的各种物理资源外, 物 理交付环境的划分也是由系统管理者负责实施的。 The above functionality consists of a multi-physical delivery environment consisting of a data center physical resource service network 309 partitioned for the project. The physical resource service network 309 supports scheduled delivery of the delivery environment, while supporting sharing by space and pressing Time shares physical resources, including many unallocated and allocated physical resources, such as network, storage, and computing resources. System Operator In addition to managing the various physical resources of the physical data center, the division of the physical delivery environment is also implemented by the system administrator.
参考图 3, 资源引擎 304使用网管系统 302 ( NMS, Network Management System)提供的物理资源信息来跟踪物理资源, 以获得最新的资源状态; 将物理资 源映射到虚拟资源。 用来管理物理资源的商用网管系统一般都能提供关于状态 (state)和性能 (performance)的信息, 都具有发现查找物理资源的功能, 故不赘述。  Referring to FIG. 3, the resource engine 304 uses physical resource information provided by the NMS (Network Management System) to track physical resources to obtain the latest resource status; and map physical resources to virtual resources. Commercial network management systems used to manage physical resources generally provide information about state and performance, and all have the function of finding and searching for physical resources, so they are not described here.
各类存储资源包括: 存储区域网络、 网络附加存储、分布式文件系统、 Ceph、 镜像等。 上述的物理资源的信息, 均存放于参考模型 303中, 如图 3所示。 另外, 项目交付服务网络 300和资源引擎 304之间可以选择进行"推"或"拉"调度。若是"推 "调度, 则不管物理交付环境的能力而承诺资源的变化要求, 并支持并行化的资源 供应。 若是"拉"调度, 则只有当物理的交付环境容量准备好时才会承诺资源的变化 要求, 并支持并行化的资源供应。  Various storage resources include: storage area network, network attached storage, distributed file system, Ceph, mirroring, etc. The above physical resource information is stored in the reference model 303, as shown in FIG. Additionally, a "push" or "pull" schedule can be selected between the project delivery service network 300 and the resource engine 304. In the case of "push" scheduling, resources are required to change regardless of the capabilities of the physical delivery environment, and support for parallelized resource provisioning. In the case of "pull" scheduling, resource change requirements are only promised when the physical delivery environment capacity is ready, and parallelized resource provisioning is supported.
参考图 3, LR中的资源引擎 304可以进行虚拟资源到物理资源的绑定。 其最 主要的任务就是实现物理资源的虚拟化, 包括各类存储资源的虚拟化。 而"存储虚 拟化管理器" 305就是实现存储虚拟化的重要部件。 它是本发明的重点。  Referring to FIG. 3, the resource engine 304 in the LR can perform binding of virtual resources to physical resources. Its main task is to virtualize physical resources, including the virtualization of various storage resources. The "Storage Virtualization Manager" 305 is an important part of implementing storage virtualization. It is the focus of the present invention.
返回参考图 1,图 1是根据本发明的一个实施例的自主的存储管理系统架构 图, 其中示出了存储虚拟化管理器 及其工作环境。  Referring back to FIG. 1, FIG. 1 is a diagram of an autonomous storage management system architecture showing a storage virtualization manager and its working environment, in accordance with one embodiment of the present invention.
如图 1所示, 圆角矩形内所示即是"存储虚拟化管理器" 1000的框架结构图。 在一个实施例中,存储虚拟化管理器 1000可分为 2个逻辑部分:存储虚拟化管理器 第一部分 (上半部分) 1001 和存储虚拟化管理器第二部分 (下半部分) 1002。 上半部 分独立于具体设备, 是作为一个基础设备的抽象层而存在的, 它是 laaS平台工具 Live Resource 域 1010的一部分。 下半部则相反, 包括了各种具体设备, 它是利 用 Ceph 集群 1020的固有特性来实现的。  As shown in Figure 1, the frame structure of the "Storage Virtualization Manager" 1000 is shown in the rounded rectangle. In one embodiment, the storage virtualization manager 1000 can be divided into two logical portions: a storage virtualization manager first portion (top half) 1001 and a storage virtualization manager second portion (lower half) 1002. The upper part is independent of the specific device and exists as an abstraction layer of the underlying device. It is part of the laaS platform tool Live Resource Domain 1010. The second half, on the other hand, includes a variety of specific devices that are implemented using the inherent features of the Ceph Cluster 1020.
下面简单地描述存储虚拟化管理器 1000的主要组件和这些组件之间的相互作 用, 请参阅图 1 :  The following briefly describes the main components of Storage Virtualization Manager 1000 and the interactions between these components, see Figure 1:
1 ) vFSM 101 1 vFSM 1011 即是 (本地) 虚拟有限状态机 (Virtual Finite State Machine )。 它 有两个重要的功能: 一个是作为底层存储设备的抽象, 将各种"能力"提供给存储虚 拟化管理器 1000的用户一资源决策者 1100 ( Resource Decision Maker, 或称资 源决策模块),作为资源决策者 1100进行决策的依据; 另一个是作为智能存储设备 ( Smart Storage Device ) 的决策部分, 配合整个系统的统筹决策模块 coreVFSM1200 协同工作, 以保证整个系统可靠性。 这里的智能存储设备, 即是 将存储虚拟化管理器 1000和它控制的所有 SAN存储, 包括在 LR附属存储基础架 构域 1300和客户存储基础架构域 1400作为一个整体来看待, 向上层用户提供智 能的虚拟化存储功能。 1) vFSM 101 1 vFSM 1011 is the (local) Virtual Finite State Machine. It has two important functions: One is the abstraction of the underlying storage device, and the various "capabilities" are provided to the user of the storage virtualization manager 1000, a resource decision maker 1100 (Resource Decision Maker, or funded The source decision module is used as the basis for decision making by the resource decision maker 1100; the other is the decision part of the smart storage device (Smart Storage Device), which cooperates with the overall system decision making module coreVFSM1200 to ensure the reliability of the whole system. The intelligent storage device here, that is, the storage virtualization manager 1000 and all the SAN storages it controls, including the LR attached storage infrastructure domain 1300 and the customer storage infrastructure domain 1400 as a whole, provides intelligence to the upper layer users. Virtualized storage capabilities.
2) 管理器 1012 管理器 1012 ( Manager) 能够对所管理的存储资源进行统一调度的一个重要 的前提是统一适配器 1014 ( Unified Adapter) 知道几乎所有基础存储设备的一切 信息, 例如拓扑结构, 功能, 性能等。 在异构设备统一构建过程中, 管理器 1012 收集并向资源决策者 1100提供设备类型、 能力、 QoS等属性以及现状的信息。  2) Manager 1012 Manager 1012 (Manager) An important prerequisite for unified scheduling of managed storage resources is that the Unified Adapter 1014 knows almost all information about the underlying storage devices, such as topology, functions, Performance, etc. In the unified construction process of the heterogeneous device, the manager 1012 collects and provides the resource decision maker 1100 with information such as device type, capability, QoS, and status information.
例如: 当资源决策者 1100发来成批的资源预留请求时, 管理器 1012需要分 析手上掌握的信息, 做出是否可以接受资源预留的决定。如果存储资源在某时刻起 其功能和性能满足用户需求, 并且拓扑结构表明存储的数据链路可以建立, 资源决 策者 1100将把接受资源预留消息回答给用户。 需要注意的是, 在这个时候访问存 储的数据链路尚未建立,真正的数据链路需要等到至少有一个计算节点启动后才能 够建立起来。  For example: When the resource decision maker 1100 sends a batch of resource reservation requests, the manager 1012 needs to analyze the information grasped by the hand and make a decision whether or not the resource reservation can be accepted. If the storage resource meets the user's needs at some point in time and the topology indicates that the stored data link can be established, the resource decider 1100 will reply the accept resource reservation message to the user. It should be noted that the data link for accessing the storage at this time has not yet been established, and the real data link needs to wait until at least one compute node is started before it can be established.
3) 数据设施 1013  3) Data facilities 1013
数据设施 1013 ( Data Facility) 是可选择模块。 用于应用程序 App数据, 即 用户数据的操作, 例如: 备份和恢复等容灾操作, 或者压缩释放, 删除冗余等优化 操作。  Data Facility 1013 (Data Facility) is an optional module. Used for application App data, that is, user data operations, such as: disaster recovery operations such as backup and recovery, or compression release, redundancy, and other optimization operations.
4) 统一适配器 1014 统一适配器 1014 ( Unified Adapter) 基于下面的依赖于 Ceph 集群 1020中 存储设备的适配器层 (Device Adapter Layer), 提供统一的设备操作 /控制接口。 完成监视和控制任务, 例如, 监视或控制光线交换机 (FC SW) 端口, 从 SAN中 分配得到一个卷(LUN )。统一适配器 1014本身即是利用 Ceph的客户端机制来实 现的。 4) Unified Adapter 1014 Unified Adapter 1014 ( Unified Adapter) provides a unified device operation/control interface based on the following Device Adapter Layer that depends on the storage device in the Ceph Cluster 1020. Complete monitoring and control tasks, such as monitoring or controlling the Light Switch (FC SW) port, and allocating a volume (LUN) from the SAN. The unified adapter 1014 itself is implemented using Ceph's client mechanism.
本发明的存储虚拟化管理器 1000是利用了 Ceph的分布式机制来实现的。 存 储设备对象 (storage device object)被抽象为一个 Ceph文件。 因为该文件类似于 Linux中的设备文件(device file) , 我们就称其为设备文件 1030。 一般来讲, 某一 个具体的存储设备对应一个存储设备对象, 也就是对应一个具体的设备文件。设备 文件 1030是存储虚拟化管理器第一部分 1001 (上半部分) 和存储虚拟化管理器 第二部分 1002 (下半部分) 之间的桥梁。 更具体地说, 存储虚拟化管理器上半部 分 1001 的一个设备控制请求可以被视为对设备文件 1030的一次文件写操作; 而 响应则可以被视为通过 Ceph接口(即统一适配器 1014)对设备文件 1030的一次 文件读操作。 The storage virtualization manager 1000 of the present invention is implemented using a distributed mechanism of Ceph. The storage device object is abstracted into a Ceph file. Because this file is similar to the device file in Linux, we call it device file 1030. Generally speaking, some A specific storage device corresponds to a storage device object, that is, corresponds to a specific device file. The device file 1030 is a bridge between the storage virtualization manager first portion 1001 (top half) and the storage virtualization manager second portion 1002 (lower half). More specifically, a device control request to store the upper portion 1001 of the virtualization manager can be considered a write to the device file 1030; and the response can be considered to be through the Ceph interface (ie, the unified adapter 1014). A file read operation of device file 1030.
5) SMI-S模块 1021和 SSH模块 1022  5) SMI-S module 1021 and SSH module 1022
网络管理人员管理来自不同厂商的 SAN时需要多种独立的管理应用。 这些管 理应用由不同厂商开发,连接多种硬件管理 API。因此美国存储网络工业协会 (SNIA) 提出了存储管理建议规范 (SMI-S, Storage Management Initiative Specification) 。 SMI-S的主旨就是把存储网络的管理对象, 以及用来管理对象的工具统一起来, 最终让所有的存储网络部件都可以利用本地的 SMI-S接口加以部署。 这样可以使 所有的部件都采用一种通用的接口, 管理功能的实现就更方便。 SMI-S建立在基于 Web的企业管理 (WBEM)技术和公共信息模型 (CIM)的基础上, 实际上就是安装在 所管理的对象与所管理的应用之间的中间件。 例如 SMI-S代理可以询问一台设备, 如交换机、 主机或存储阵列, 从具有 CIM 功能的设备中提取相关管理数据, 并将 数据提供给请求者。  Network managers need multiple independent management applications to manage SANs from different vendors. These management applications are developed by different vendors and connect to multiple hardware management APIs. Therefore, the Storage Networking Industry Association (SNIA) has proposed the Storage Management Initiative Specification (SMI-S). The main purpose of SMI-S is to unify the management objects of the storage network and the tools used to manage the objects, and finally allow all storage network components to be deployed using the local SMI-S interface. This allows all components to adopt a common interface, making management functions more convenient. Based on Web-based Enterprise Management (WBEM) technology and Common Information Model (CIM), SMI-S is actually a middleware installed between managed objects and managed applications. For example, an SMI-S agent can query a device, such as a switch, host, or storage array, to extract relevant management data from a CIM-enabled device and provide the data to the requester.
SSH 即是业内技术人员熟知的安全外壳协议 (Secure Shell ) , 故不赘述。 在本发明的一个实施例中, SMI-S是用来管理 SAN (存储区域网络) 存储的 ; 而 SSH是用来管理交换机的。 如果存储用户的管理应用通过 SMI-S或 SSH控制 或监视某个设备, 在虚拟化之后的有些情况下, 存储虚拟化管理器 1000需要提供 相应的 SMI-S模块或 SSH模块来匹配真实的底层设备。 在一个实施例中, 为使统 一适配器 1014不受影响, 模块的多个实例可以被分布到 Ceph的 0SD中。 这样, SMI-S模块 1021和 SSH模块 1022既可以被不同的存储设备对象所共享,而且还 由于它们的多个实例存在于 Ceph的 0SD中而具有容错功能。 另外, 由于 Ceph 的 MDS不需要被存储虚拟化管理器所关心, 故,在一个实施例中, Ceph 的 MDS 可以在图 1 中被略去。  SSH is a Secure Shell protocol (Secure Shell) that is well known to those skilled in the art, so it will not be described. In one embodiment of the invention, SMI-S is used to manage SAN (storage area network) storage; SSH is used to manage switches. If the storage user's management application controls or monitors a device through SMI-S or SSH, in some cases after virtualization, the storage virtualization manager 1000 needs to provide a corresponding SMI-S module or SSH module to match the real underlying layer. device. In one embodiment, to unify the unified adapter 1014, multiple instances of the module can be distributed into the 0SD of Ceph. Thus, SMI-S module 1021 and SSH module 1022 can be shared by different storage device objects, and also have fault tolerance due to their multiple instances being present in Ceph's 0SD. In addition, since Ceph's MDS does not need to be concerned by the storage virtualization manager, in one embodiment, Ceph's MDS can be omitted in Figure 1.
后台的存储设备包括 LR附属存储基础架构域 1300内所有为 LR工作的 SAN (存储区域网络)设备, 以及客户存储基础架构域 1400内所有为客户工作的 SAN 设备。 这些后台设备由 SMI-S模块 1021和 SSH模块 1022通过控制 (Control ) 1604和 1605来实现。 例如: 具体划分一个卷 (LUN ) , 或删除一个卷 (LUN )。 我们称其为 Fabric Management。 The back-end storage devices include all SAN (Storage Area Network) devices operating in the LR attached storage infrastructure domain 1300 for LR, and all SAN devices working in the customer storage infrastructure domain 1400 for customers. These backend devices are implemented by SMI-S module 1021 and SSH module 1022 via Controls 1604 and 1605. For example: Divide a volume (LUN) or delete a volume (LUN). We call it Fabric Management.
6) 数据平面 1601  6) Data plane 1601
数据平面 1601 ( Data Plane) 的操作不发生在资源预留阶段。 当一个计算服 务开始时, 例如, 启动一个虚拟机 VM1511 , 存储虚拟化管理器 1000需要建立数 据链路, 由源 HBA1514 到光纤交换机(图 1 中略去)端口, 再到目的 HBA1410... 最后到 SAN1401。 虚拟机 VM1511访问 SAN所存储的客户 APP的数据就需要通 过此链路。  The operation of data plane 1601 (Data Plane) does not occur during the resource reservation phase. When a computing service starts, for example, to start a virtual machine VM1511, the storage virtualization manager 1000 needs to establish a data link, from the source HBA1514 to the fiber switch (omitted in Figure 1) port, to the destination HBA1410... SAN1401. The virtual machine VM1511 needs to access the data of the client APP stored in the SAN.
7) 数据流 1602  7) Data stream 1602
数据流 1602 (Streaming ) 参与镜像库管理和镜像传输, 这里的镜像属于 APP Content, 包含有客户的应用系统以及它所依赖的操作系统, 例如: 当宿主机 1510 上的虚拟机 VM1511启动时,它所需的模板镜像将从 LR附属存储基础架构域 1300 的 SAN中拷贝到主机。该项拷贝工作由宿主机 1510上的 Ceph客户端 1513协助 完成。  Data stream 1602 (Streaming) participates in image library management and image transfer. The image here belongs to APP Content, which contains the client's application system and the operating system it depends on. For example: When the virtual machine VM1511 on the host machine 1510 starts, it The required template image will be copied from the SAN of the LR attached storage infrastructure domain 1300 to the host. This copying is done by the Ceph client 1513 on the host 1510.
这里, LR附属存储基础架构域 1300内运用 Ceph集群可以减轻由于并发访问 而引起的通信拥堵。 例如, 当大量的虚拟机同时启动, 常常会发生通信拥堵。  Here, the use of Ceph clusters within the LR attached storage infrastructure domain 1300 can alleviate communication congestion due to concurrent access. For example, when a large number of virtual machines are started at the same time, communication congestion often occurs.
在图 1所示的存储管理系统中, 除了存储虚拟化管理器 1000之外, 该系统还 包括, 但不限于, 以下各模块:  In the storage management system shown in FIG. 1, in addition to the storage virtualization manager 1000, the system includes, but is not limited to, the following modules:
1 ) 资源决策模块 1100  1) Resource Decision Module 1100
资源决策模块 1100 ( Resource Decision Maker) 是 LR的一部分, 工作在控 制平面上。 例如: 银行要运行一个新的网银系统, 就需要一定数量的计算能力, 内 存, 和存储。 这些资源都需要预约。 资源决策模块 1100将根据存储虚拟化管理器 提供的现实情况决定存储资源是否可以满足运行新的网银系统的要求, 例如: SAN 只有 1 %空闲, 那么存储资源预约就无法成功。  Resource Decision Module 1100 (Resource Decision Maker) is part of the LR and works on the control plane. For example: To run a new online banking system, a bank needs a certain amount of computing power, memory, and storage. These resources require an appointment. The resource decision module 1100 will determine whether the storage resource can meet the requirements of running a new online banking system according to the reality provided by the storage virtualization manager. For example, if the SAN is only 1% idle, the storage resource reservation cannot be successful.
2) coreVFSM 1200  2) coreVFSM 1200
统筹决策模块 1200 (coreVFSM ) 是 LR的一部分, 工作在控制平面上。 The Coordination Decision Module 1200 (coreVFSM) is part of the LR and works on the control plane.
3) LR附属存储基础架构域 1300 3) LR Auxiliary Storage Infrastructure Domain 1300
LR附属存储基础架构域 1300 ( LR-Attached Storage Infrastructure Domain LR Auxiliary Storage Infrastructure Domain 1300 (LR-Attached Storage Infrastructure Domain
) 包括所有为 LR工作的 SAN设备。 ) Includes all SAN devices that work for LR.
4) 客户存储基础架构域 1400  4) Customer Storage Infrastructure Domain 1400
客户存储基础架构域 1400 (Customer Storage Infrastructure Domain )包括 所有为客户工作的 SAN设备, 例如: 银行应用系统用来存储其数据的 SAN设备。 5) 客户计算基础架构域 1500 The Customer Storage Infrastructure Domain (1400) includes all SAN devices that work for customers, such as SAN devices that bank applications use to store their data. 5) Customer Computing Infrastructure Domain 1500
客户计算基础架构域 1500 (Customer Computing Infrastructure Domain ) 包括所有为客户工作的计算资源, 例如: 支持运行银行各个应用系统的虚拟机集群 。 这些虚拟机(例如 VM 1511 )通过数据平面 1601来访问它在客户存储基础架构 域 1400内的 SAN存储数据。  The Customer Computing Infrastructure Domain 1500 (Company Computing Infrastructure Domain) includes all computing resources that work for customers, such as: Support for virtual machine clusters running various applications of the bank. These virtual machines (e.g., VM 1511) access the SAN storage data within the customer storage infrastructure domain 1400 through the data plane 1601.
以下将进一步详细介绍本发明所提供的利用 Ceph 实现自主的存储管理系 统的具体实施方式。  The specific implementation of the autonomous storage management system using Ceph provided by the present invention will be further described in detail below.
如图 1 所示, 从存储虚拟化管理器上半部分 1001 向下的角度看, 每个存储 设备对应一个设备文件 1030, 就像是 Linux设备驱动模式中的 ioctl机制。 现在的 设备文件被分布到 Ceph集群中 OSD上, 在 Ceph固有机制的作用下, 存储虚拟 化管理器拥有一个分布式的设备文件系统。  As shown in Figure 1, from the perspective of the upper part of the storage virtualization manager 1001, each storage device corresponds to a device file 1030, just like the ioctl mechanism in the Linux device driver mode. The current device files are distributed to the OSD in the Ceph cluster. Under the Ceph inherent mechanism, the Storage Virtualization Manager has a distributed device file system.
ioctl是 iocontrol的缩写, 就是 10控制。 简单来说, 在写 Linux设备驱动程 序时候, 会碰到一些 I0操作, 在逻辑上既不能归类到 read, 又不能归类到 write, 那些操作就可以认为是 ioctl的部分。 read和 write应该是写入和读出数据用的,应 而作为单纯的数据交换的方式来处理。 而 ioctl则是用于控制 read和 write的某些 选项的。例如: 用户设计了一个通用的读写 I0端口的驱动模块。 read和 write是从 端口读写数据的, 但是更改读写的端口, 这个操作应该如何处理呢? 显然用 ioctl 来实现比较合理。 例如: 端口的 read和 write是可以阻塞的, 或者不能阻塞的; 或 者对设备文件的读写是可以并发的, 或者是不可以并发的; 这些都可以设计成可以 用 ioctl来配置的情况。 参数上, ioctl的一般参数格式就是命令字 (常量) +命令参 数的方式。 而 read和 write的参数格式都是数据缓冲区 +数据目的地指针 +长度。  Ioctl is the abbreviation of iocontrol, which is 10 controls. In simple terms, when writing a Linux device driver, you will encounter some I0 operations, which can neither be classified into read nor logically written, and those operations can be considered part of ioctl. Read and write should be used to write and read data, and should be handled as a simple way of data exchange. Ioctl is used to control certain options for read and write. For example: The user has designed a universal driver module for reading and writing the I0 port. Read and write are read and write data from the port, but change the port read and write, how should this operation be handled? Obviously it is reasonable to use ioctl. For example: The read and write of the port can be blocked, or can not be blocked; or the read and write of the device file can be concurrent, or can not be concurrent; these can be designed to be configured with ioctl. On the parameter, the general parameter format of ioctl is the way of command word (constant) + command parameter. The read and write parameters are both data buffer + data destination pointer + length.
图 2是根据本发明的一个实施例的存储虚拟化管理器工作流程涉及的三个层 次图。 如图 2所示, 本发明创建了一个 Ceph的对象, 以作为光纤交换机(FC SW ) 的一个抽象。 在一个实施例中, 该光纤交换机可以采用 Brocade 公司的光纤交 换机。我们从统一适配器层 2100 ( Unified Adapter Layer, 请对应参阅图 1 中的统 一适配器 1014,其中统一适配器层 2100与 Ceph客户端绑定)写一个控制 Message 到设备文件(请参阅图 1 中的设备文件 1030), 这里所说的设备文件更具体地说就 是 Ceph 集群 2210中的某个对象, 例如对象 2211。 由此触发运行在某个 0SD上 的 SSH进程 (请参阅图 1 中的 SSH模块 1022), 该进程通过控制 1604或 1605 能够成功地禁用光纤交换机(即存储 2310中的一个设备)的端口。正如预期那样, 此后 SSH将此禁用操作的结果状态写回设备文件 (例如对象 2211 )。 该写回操作 将触发统一适配器层 2100 (请参阅图 1 中的统一适配器 1014), 它可以通过设备 文件 (例如对象 2211 ) 得到操作的结果状态。 我们可以通过其他途径来验证指定 的端口已经被禁用, 例如, 人工登录到光纤交换机的管理终端进行查询。 2 is a three level diagram involved in a storage virtualization manager workflow in accordance with one embodiment of the present invention. As shown in Figure 2, the present invention creates a Ceph object as an abstraction for a Fibre Channel switch (FC SW). In one embodiment, the fiber switch can utilize a Brocade fiber switch. We write a Control Message to the device file from the Unified Adapter Layer 2100 (refer to the Unified Adapter 1014 in Figure 1, where the Unified Adapter Layer 2100 is bound to the Ceph Client) (see the device file in Figure 1). 1030), the device file referred to herein is more specifically an object in the Ceph cluster 2210, such as object 2211. This triggers an SSH process running on a certain OSD (see SSH Module 1022 in Figure 1), which can successfully disable the port of the fabric switch (i.e., one of the storage 2310s) by controlling 1604 or 1605. As expected, SSH then writes the resulting state of this disable operation back to the device file (eg, object 2211). The write back operation The Unified Adapter Layer 2100 (see Uniform Adapter 1014 in Figure 1) will be triggered, which can get the result status of the operation from the device file (eg object 2211). We can use other means to verify that the specified port has been disabled, for example, manually log in to the management terminal of the fabric switch for query.
图 2中的对象 (例如对象 2211 ) 就是图 1 中的设备文件 1030中的一个文件 实例, 但操作对象和操作文件不同, 前者在 Ceph中是个整体, 易于被存储虚拟化 管理器上半部分 1001操作; 而后者在 Ceph中是一段段碎片, 不利于被操作。  The object in FIG. 2 (for example, object 2211) is an example of a file in device file 1030 in FIG. 1, but the operation object is different from the operation file. The former is a whole in Ceph, and is easy to be stored in the upper part of the virtualization manager 1001. Operation; the latter is a piece of fragmentation in Ceph, which is not conducive to being manipulated.
请参阅图 2, 从纵向看, 存储虚拟化管理器的工作流程涉及三个层次: 统一适 配器层 2100、 Ceph 传输层 2200、 下层物理设备 2300。 从横向看, 这三个层次 不仅适用于存储虚拟化, 也适用于网络虚拟化和计算虚拟化。  Referring to Figure 2, from a vertical perspective, the workflow of the Storage Virtualization Manager involves three levels: Unified Adapter Layer 2100, Ceph Transport Layer 2200, and Lower Physical Device 2300. From a landscape perspective, these three levels apply not only to storage virtualization, but also to network virtualization and compute virtualization.
很明显,在统一适配器层 2100和 Ceph 传输层 2200之间建立一个设备私有 协议并同其设备通讯是统一适配器 2110的责任, 而统一适配器 2110并不需要知 道设备在哪里。而 Ceph 传输层 2200中的 Ceph集群主要是起到传输链接的作用, 它并不关心它上面所传输的数据。 最后, 下层物理设备 2300中的各个存储设备, 由 Ceph 传输层 2200中的 Ceph集群内的各个对象来控制。 而存储设备和对象之 间的匹配关系是动态的。  It is apparent that establishing a device proprietary protocol between the unified adapter layer 2100 and the Ceph transport layer 2200 and communicating with its device is the responsibility of the unified adapter 2110, and the unified adapter 2110 does not need to know where the device is. The Ceph cluster in the Ceph transport layer 2200 is mainly used to transmit links. It does not care about the data transmitted on it. Finally, each of the lower physical devices 2300 is controlled by each object within the Ceph cluster in the Ceph transport layer 2200. The matching relationship between storage devices and objects is dynamic.
请参阅图 1 , 有关通过存储虚拟化管理器在 SAN上申请一个卷 (LUN ) 的工 作流程如下:  Referring to Figure 1, the workflow for requesting a volume (LUN) on a SAN through the Storage Virtualization Manager is as follows:
第 1步: 将" alloc—lun"以发 message的方式写入设备文件 device file中; 第 2步: 设备文件的变化将触发 OSD上的 SSH进程; SSH进程是不断监听 设备文件的, 此时将读出" alloc— lun"信息;  Step 1: Write "alloc-lun" to the device file device file in the form of message; Step 2: The device file change will trigger the SSH process on the OSD; the SSH process is constantly listening for device files. Will read out the "alloc-lun" information;
第 3步: "alloc—lun"被 Ceph执行, SAN将分配所要求的卷 LUN ;  Step 3: "alloc-lun" is executed by Ceph, and the SAN will allocate the required volume LUNs;
第 4步: 卷 LUN分配成功的消息被写回设备文件 device file中;  Step 4: The volume LUN allocation success message is written back to the device file device file;
第 5步: 统一适配器在查询设备文件 device file后得知操作完成, 将进一步 通知用户。  Step 5: After the unified adapter reads the device file device file and knows that the operation is completed, the user will be further notified.
请参阅图 2, 上述方法体现了本发明所实现存储虚拟化管理器的最根本思想, BP : 在存储 2310中的设备动态异构的情况下, 由原来用户 2000 (或例如资源决策 者 1100)直接访问存储 2310中的具体的设备, 改变为访问 Ceph 集群 2210中的 对象, 即存储设备对象 (storage device object)。 那么, 当存储 2310中的设备发生 变化时, 所有访问存储设备对象的用户 2000程序都不会受影响。 而如果采用原来 直接访问的方法, 用户 2000就需要作相应的调整变化。  Referring to FIG. 2, the above method embodies the most fundamental idea of the storage virtualization manager implemented by the present invention. BP: In the case of dynamic heterogeneity of devices in the storage 2310, the original user 2000 (or, for example, the resource decision maker 1100) The specific device in the storage 2310 is directly accessed, and is changed to access an object in the Ceph cluster 2210, that is, a storage device object. Then, when the device in the storage 2310 changes, all user 2000 programs accessing the storage device object are not affected. If the original direct access method is used, the user 2000 needs to make corresponding adjustment changes.
图 4 是根据本发明的一个实施例的实施环境 LR 服务交付平台所参照的 ACRA架构。 图 4示出的本发明的实施环境 LR服务交付平台是一个具有自主计 算特性的资源管理系统, 在一个实施例中, 其实现参照了 IBM的 ACRA架构, 当 然也包含了其存储资源管理部分。 4 is a reference to an implementation environment LR service delivery platform in accordance with an embodiment of the present invention. ACRA architecture. The implementation environment LR service delivery platform of the present invention shown in FIG. 4 is a resource management system with autonomic computing features. In one embodiment, its implementation refers to IBM's ACRA architecture, and of course its storage resource management portion.
参考图 3和图 4, Ceph存储的下半部分是 RADOS 4300即可靠的自主的分 布式对象存储。 顾名思义, 它具有自主管理的属性。 他们是被管理资源 4300的一 种。而资源引擎 304则包括了资源管理器 4200和全局自主管理器 4100。存储虚 拟化管理器 305是资源管理器 4200的一种, 资源管理器 4200 (自主元素) 内的 知识包括了 vFSM。 全局自主管理器 4100 (自主元素) 内的知识包括了 coreFSM 。 vFSM即是 (存储虚拟化管理器本地) 虚拟有限状态机。 它配合整个系统的虚拟 有限状态机 coreVFSM1200协同工作, 虚拟有限状态机, 以提高整个存储系统的 可用性和可靠性。  Referring to Figures 3 and 4, the lower half of the Ceph storage is the RADOS 4300, a reliable, autonomous, distributed object store. As the name suggests, it has a self-management attribute. They are one of the managed resources 4300. The resource engine 304 includes a resource manager 4200 and a global autonomous manager 4100. The storage virtualization manager 305 is one of the resource managers 4200, and the knowledge in the resource manager 4200 (autonomous elements) includes the vFSM. The knowledge within the Global Autonomic Manager 4100 (autonomous elements) includes coreFSM. vFSM is the (storage virtualization manager local) virtual finite state machine. It works in conjunction with the virtual finite state machine coreVFSM1200 of the entire system, virtual finite state machine to improve the availability and reliability of the entire storage system.
本发明的存储虚拟化管理器需要支持完成资源管理器 4200所具有的自主计 算特性功能。 请参阅图 1, 如前所述, Ceph/RADOS机制具有自主管理的属性, 在其支持下, SAN可以自动告警上报。 例如: 光纤交换机 (即存储 2310中的一 个设备) 的端口坏了, 通过 SSH 1022, 设备文件 1030, 告警上报到统一适配器 1014。 VFSM 1011 即是存储虚拟化管理器 100的虚拟有限状态机。 它有两个重要 的功能: 一个是作为底层存储设备的抽象; 另一个是作为智能存储设备 (Smart Storage Device) 的决策部分。 由于光纤交换机的端口损坏在其控制范围之内, 它 可以下发消息, 要求光纤交换机更换一个端口。 又例如: SAN设备由于外来原因, 发生大面积损坏报警, 这样的问题超出了存储虚拟化管理器 100所能解决的范围, 这时 vFSM 1011 的状态需要上报到 coreVFSM1200。 由上一级的 LR服务交付平 台决策部分协调解决, 甚至要求人工干预。  The storage virtualization manager of the present invention needs to support the autonomous computing feature that the resource manager 4200 has. Please refer to Figure 1. As mentioned above, the Ceph/RADOS mechanism has its own management attributes. With its support, the SAN can automatically report alarms. For example, the port of the fabric switch (that is, one of the storage devices in the 2310) is broken. The alarm is reported to the unified adapter 1014 through SSH 1022 and device file 1030. VFSM 1011 is the virtual finite state machine that stores virtualization manager 100. It has two important functions: one is abstraction as the underlying storage device; the other is the decision-making part of the Smart Storage Device. Since the port loss of the fabric switch is within its control range, it can send a message requesting the fiber switch to replace a port. For example, the SAN device has a large-area damage alarm due to external reasons. This problem is beyond the scope of the storage virtualization manager 100. The state of the vFSM 1011 needs to be reported to the coreVFSM1200. It is coordinated by the decision-making part of the LR service delivery platform of the higher level, and even requires manual intervention.
由此, 本发明所提供的自主的存储管理系统具有以下的特点:  Thus, the autonomous storage management system provided by the present invention has the following features:
(i)自配置: 其能够适应存储系统中的变化。 这样的变化可包括新存储设备的 部署或者除去已有的存储设备; 动态的适配帮助确保存储设备 /软件的连续运转。  (i) Self-configuration: It can adapt to changes in the storage system. Such changes may include the deployment of new storage devices or the removal of existing storage devices; dynamic adaptations help ensure continuous operation of the storage devices/software.
(ii)自优化: 能够自动地监视和协调存储资源以满足终端用户或者企业需求, 从而提供高性能的存储服务;  (ii) Self-optimization: Ability to automatically monitor and coordinate storage resources to meet end-user or enterprise needs, thereby providing high-performance storage services;
(iii)自修复: 能够检测存储系统故障和启动预定的恢复动作而不中断存储系统 的其余部分。 使得存储系统有更高的可靠性和可用性。  (iii) Self-healing: Ability to detect storage system failures and initiate scheduled recovery actions without disrupting the rest of the storage system. Make storage systems more reliable and available.
上述这些自主特征都可以由本发明的实施环境 Live Resoiurce服务交付平台 在所描述的存储虚拟化管理器参与下共同完成。 这里采用的术语和表述方式只是用于描述, 本发明并不应局限于这些术语和 表述。 使用这些术语和表述并不意味着排除任何示意和描述 (或其中部分)的等效特 征, 应认识到可能存在的各种修改也应包含在权利要求范围内。 其他修改、 变化和 替换也可能存在。 相应的, 权利要求应视为覆盖所有这些等效物。 These autonomous features described above can all be performed jointly by the implementation environment Live Resoiurce service delivery platform of the present invention with the participation of the described storage virtualization manager. The terms and expressions employed herein are for illustrative purposes only and the invention is not limited to the terms and expressions. The use of these terms and expressions is not intended to be exhaustive or to limit the scope of the invention. Other modifications, changes, and replacements may also exist. Accordingly, the claims are to be construed as covering all such equivalents.
同样, 需要指出的是, 虽然本发明已参照当前的具体实施例来描述, 但是本 技术领域中的普通技术人员应当认识到, 以上的实施例仅是用来说明本发明, 在没 有脱离本发明精神的情况下还可做出各种等效的变化或替换, 因此, 只要在本发明 的实质精神范围内对上述实施例的变化、变型都将落在本申请的权利要求书的范围 内。  Also, it should be noted that although the present invention has been described with reference to the present embodiments, it will be understood by those skilled in the art that Various equivalent changes or substitutions may be made in the case of the spirit, and variations and modifications of the above-described embodiments are intended to fall within the scope of the claims of the present application.

Claims

权 利 要 求 Rights request
1 . 一种基于 Ceph的分布式机制的存储虚拟化管理器, 其特征在于, 所述存储 虚拟化管理器至少包括: A storage virtualization manager based on a Ceph-based distributed mechanism, wherein the storage virtualization manager includes at least:
存储虚拟化管理器第一部分; 以及  The first part of the storage virtualization manager;
存储虚拟化管理器第二部分;  The second part of the storage virtualization manager;
其中, 所述存储虚拟化管理器第一部分被配置成独立于具体存储设备, 是作为 具体存储设备的抽象而存在的; 所述存储虚拟化管理器第二部分被配置成利用 The first part of the storage virtualization manager is configured to be independent of a specific storage device and exists as an abstraction of a specific storage device; the second part of the storage virtualization manager is configured to utilize
Ceph 集群的固有特性来实现, 所述存储虚拟化管理器第二部分包括各种具体存储 设备; The inherent features of the Ceph cluster are implemented. The second part of the storage virtualization manager includes various specific storage devices.
其中, 所述具体存储设备对应设备文件, 所述存储虚拟化管理器第一部分所发 出的设备控制请求被视为对所述设备文件的一次文件写操作,而针对所述设备控制 请求的响应被视为通过 Ceph客户端对所述设备文件的一次文件读操作。  The specific storage device corresponds to the device file, and the device control request sent by the first part of the storage virtualization manager is regarded as a file write operation to the device file, and the response to the device control request is It is considered as a file read operation of the device file through the Ceph client.
2. 如权利要求 1所述的存储虚拟化管理器, 其特征在于, 所述存储虚拟化管 理器第一部分包括: 2. The storage virtualization manager of claim 1, wherein the first portion of the storage virtualization manager comprises:
虚拟有限状态机, 所述虚拟有限状态机作为底层存储设备的抽象, 将各种功能 提供给资源决策模块, 以作为所述资源决策模块进行决策的依据; 所述虚拟有限状 态机还配合统筹决策模块协同工作, 以保证整个存储虚拟化管理器系统的可靠性; 管理器,所述管理器在异构设备统一构建过程中收集并向所述资源决策组模提 供设备类型、 能力、 QoS 属性以及现状信息, 其中, 当所述资源决策模块发来成 批的资源预留请求时, 所述管理器做出是否可以接受所述资源预留请求的决定; 如 果存储资源在某时刻起其功能和性能满足用户需求,并且拓扑结构表明存储的数据 链路可以建立, 则所述资源决策模块将把接受资源预留的消息回答给用户;  a virtual finite state machine, which serves as an abstraction of the underlying storage device, and provides various functions to the resource decision module as a basis for the decision making by the resource decision module; the virtual finite state machine also cooperates with the overall decision The modules work together to ensure the reliability of the entire storage virtualization manager system; the manager collects and provides device type, capability, QoS attributes to the resource decision group during the unified construction of the heterogeneous device Status information, wherein when the resource decision module sends a batch of resource reservation requests, the manager makes a decision whether the resource reservation request can be accepted; if the storage resource starts its function at a certain moment The performance meets the user requirements, and the topology indicates that the stored data link can be established, and the resource decision module will reply the message accepting the resource reservation to the user;
统一适配器, 所述统一适配器利用 Ceph的客户端机制来实现, 所述统一适配 器被配置成能获得所有基础存储设备的信息并将所述信息提供给所述管理器,所述 信息包括基础存储设备的拓扑结构、 功能、 性能, 所述统一适配器提供统一的设备 操作 /控制接口, 完成监视和控制任务。  a unified adapter, the unified adapter being implemented using a client mechanism of Ceph, the unified adapter being configured to obtain information of all of the underlying storage devices and provide the information to the manager, the information including the underlying storage device Topology, functionality, and performance, the unified adapter provides a unified device operation/control interface to perform monitoring and control tasks.
3. 如权利要求 2 所述的存储虚拟化管理器, 其特征在于, 所述设备文件为 Ceph文件。 3. The storage virtualization manager of claim 2, wherein the device file is Ceph file.
4. 如权利要求 2所述的存储虚拟化管理器, 其特征在于, 所述统一适配器为 所述 Ceph客户端。 4. The storage virtualization manager of claim 2, wherein the unified adapter is the Ceph client.
5. 如权利要求 2所述的存储虚拟化管理器, 其特征在于, 所述监视和控制任 务包括监视或控制光线交换机端口, 从存储区域网络 (SAN ) 中分配得到一个卷。 5. The storage virtualization manager of claim 2, wherein the monitoring and controlling tasks comprise monitoring or controlling a light switch port, and allocating a volume from a storage area network (SAN).
6. 如权利要求 2所述的存储虚拟化管理器, 其特征在于, 所述存储虚拟化管 理器第一部分还包括: The storage virtualization manager of claim 2, wherein the first part of the storage virtualization manager further comprises:
数据设施, 被配置成提供对用户数据的操作, 所述操作包括容灾操作、 压缩释 放、 删除冗余。  The data facility is configured to provide operations on user data, including disaster tolerance operations, compressed release, and redundancy removal.
7. 如权利要求 1所述的存储虚拟化管理器, 其特征在于, 所述存储虚拟化管 理器第二部分包括: 7. The storage virtualization manager of claim 1, wherein the second portion of the storage virtualization manager comprises:
存储管理建议规范模块, 被配置成管理存储区域网络 (SAN ) 的存储; 安全外壳协议模块, 被配置成管理交换机;  a storage management suggestion specification module configured to manage storage of a storage area network (SAN); a secure shell protocol module configured to manage the switch;
其中,所述存储管理建议规范模块和所述安全外壳协议模块可被不同的存储设 备对象所共享,所述存储管理建议规范模块和所述安全外壳协议模块的实例被分布 到 Ceph集群的对象存储设备中而具有容错功能。  The storage management suggestion specification module and the secure shell protocol module may be shared by different storage device objects, and the storage management suggestion specification module and the instance of the secure shell protocol module are distributed to the object storage of the Ceph cluster. Fault-tolerant in the device.
8. —种基于 Ceph 的分布式机制的存储虚拟化管理器系统, 所述系统具有 控制平面、 数据平面和数据流, 其特征在于, 所述存储虚拟化管理器系统包括: 如权利要求 1 -7中任一项所述的存储虚拟化管理器; 8. A storage virtualization manager system based on a Ceph-based distributed mechanism, the system having a control plane, a data plane, and a data stream, wherein the storage virtualization manager system comprises: The storage virtualization manager of any of 7;
现实资源 (Live Resource ) 附属存储基础架构域, 包括为现实资源 (Live Resource) 工作的存储区域网络 (SAN ) 设备;  Live Resource An attached storage infrastructure domain, including storage area network (SAN) devices that work for Live Resources;
客户存储基础架构域, 包括为客户工作的存储区域网络 (SAN ) 设备; 以及 客户计算基础架构域, 包括具有为客户工作的虚拟机集群的宿主机, 所述虚拟 机集群通过所述数据平面来访问其在所述客户存储基础架构域内的 SAN设备的存 储数据, 所述宿主机上具有所述 Ceph客户端。 a customer storage infrastructure domain, including a storage area network (SAN) device that works for a customer; and a customer computing infrastructure domain, including a host having a virtual machine cluster working for a client, the virtual machine cluster passing through the data plane Accessing stored data of its SAN device within the customer storage infrastructure domain, the host having the Ceph client.
9. 如权利要求 8所述的存储虚拟化管理器系统, 其特征在于, 还包括: 资源决策模块, 所述资源决策模块是现实资源域的一部分, 其工作在所述控制 平面上,所述资源决策模块根据所述存储虚拟化管理器提供的现实情况决定存储资 源预约是否可以成功。 9. The storage virtualization manager system of claim 8, further comprising: a resource decision module, the resource decision module being part of a real resource domain, operating on the control plane, The resource decision module determines whether the storage resource reservation can be successful according to the reality provided by the storage virtualization manager.
10. 如权利要求 8所述的存储虚拟化管理器系统, 其特征在于, 还包括: 统筹决策模块, 所述统筹决策模块是现实资源域的一部分, 其工作在所述控 制平面上。 10. The storage virtualization manager system of claim 8, further comprising: a unified decision module, the coordinated decision module being part of a real resource domain, operating on the control plane.
11 . 如权利要求 8所述的存储虚拟化管理器系统, 其特征在于, 所述数据平 面的操作不发生在资源预留阶段; 当所述虚拟机启动时, 所述存储虚拟化管理器需 要在所述数据平面上建立数据链路, 所述虚拟机访问 SAN所存储的客户应用程序 的数据就需要通过所述数据链路。 11. The storage virtualization manager system of claim 8, wherein the operation of the data plane does not occur in a resource reservation phase; when the virtual machine is started, the storage virtualization manager needs A data link is established on the data plane, and the virtual machine accesses data of the client application stored by the SAN through the data link.
12. 如权利要求 8所述的存储虚拟化管理器系统, 其特征在于, 所述数据流 参与镜像库管理和镜像传输,所述镜像包含有客户的应用系统以及它所依赖的操作 系统; 当所述宿主机上的虚拟机启动时, 所述虚拟机所需的模板镜像将从所述现实 资源附属存储基础架构域的所述 SAN设备中拷贝到所述宿主机, 其中所述拷贝动 作由所述宿主机上的 Ceph客户端协助完成。 12. The storage virtualization manager system of claim 8, wherein the data stream participates in a mirror library management and a mirror transmission, the image comprising a client application system and an operating system on which it depends; When the virtual machine on the host is started, the template image required by the virtual machine is copied from the SAN device of the real resource attached storage infrastructure domain to the host, where the copy action is performed by The Ceph client on the host assists with the completion.
13. 如权利要求 8所述的存储虚拟化管理器系统, 其特征在于, 所述存储虚拟 化管理器系统具有三个层: 统一适配器层、 Ceph 传输层、 下层物理设备; 其中, 所述在统一适配器层包括所述统一适配器,所述统一适配器层与所述 Ceph 传 输层之间建立一个设备私有协议,所述统一适配器负责与所述下层物理设备的存储 设备进行通信, 而且所述统一适配器并不需要知道所述存储设备在哪里; 13. The storage virtualization manager system according to claim 8, wherein the storage virtualization manager system has three layers: a unified adapter layer, a Ceph transport layer, and a lower layer physical device; wherein, the The unified adapter layer includes the unified adapter, and the unified adapter layer establishes a device private protocol with the Ceph transport layer, the unified adapter is responsible for communicating with the storage device of the lower layer physical device, and the unified adapter It is not necessary to know where the storage device is;
所述 Ceph 传输层包括 Ceph集群, 所述 Ceph传输层负责传输链接, 其不需 要关心在其上所传输的数据;  The Ceph transport layer includes a Ceph cluster, and the Ceph transport layer is responsible for transmitting links, which do not need to care about data transmitted thereon;
所述下层物理设备包括各个具体存储设备, 由所述 Ceph传输层中的 Ceph集 群内的各个对象来控制, 所述存储设备与所述对象之间的匹配关系是动态的。 The lower layer physical device includes each specific storage device, and is controlled by each object in the Ceph cluster in the Ceph transport layer, and the matching relationship between the storage device and the object is dynamic.
14. 如权利要求 8所述的存储虚拟化管理器系统, 其特征在于, 所述存储虚拟 化管理器系统运行在现实资源 (LR) 服务交付平台上, 所述现实资源 (LR) 服务 交付平台是一基于 ACRA架构的自主计算特性的资源管理系统。 14. The storage virtualization manager system of claim 8, wherein the storage virtualization manager system runs on a real resource (LR) service delivery platform, the real resource (LR) service delivery platform It is a resource management system based on the autonomic computing features of the ACRA architecture.
PCT/CN2014/072707 2014-02-28 2014-02-28 Storage virtualization manager and system of ceph-based distributed mechanism WO2015127647A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2014/072707 WO2015127647A1 (en) 2014-02-28 2014-02-28 Storage virtualization manager and system of ceph-based distributed mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2014/072707 WO2015127647A1 (en) 2014-02-28 2014-02-28 Storage virtualization manager and system of ceph-based distributed mechanism

Publications (1)

Publication Number Publication Date
WO2015127647A1 true WO2015127647A1 (en) 2015-09-03

Family

ID=54008162

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2014/072707 WO2015127647A1 (en) 2014-02-28 2014-02-28 Storage virtualization manager and system of ceph-based distributed mechanism

Country Status (1)

Country Link
WO (1) WO2015127647A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105450734A (en) * 2015-11-09 2016-03-30 上海爱数信息技术股份有限公司 Distributed storage CEPH data distribution optimization method
CN109327544A (en) * 2018-11-21 2019-02-12 新华三技术有限公司 A kind of determination method and apparatus of leader node
CN109783438A (en) * 2018-12-05 2019-05-21 南京华讯方舟通信设备有限公司 Distributed NFS system and its construction method based on librados
CN111209090A (en) * 2020-04-17 2020-05-29 腾讯科技(深圳)有限公司 Method and assembly for creating virtual machine in cloud platform and server

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102103518A (en) * 2011-02-23 2011-06-22 运软网络科技(上海)有限公司 System for managing resources in virtual environment and implementation method thereof
CN102752364A (en) * 2012-05-22 2012-10-24 华为终端有限公司 Data transmission method and device
CN102857363A (en) * 2012-05-04 2013-01-02 运软网络科技(上海)有限公司 Automatic computing system and method for virtual networking
CN103034684A (en) * 2012-11-27 2013-04-10 北京航空航天大学 Optimizing method for storing virtual machine mirror images based on CAS (content addressable storage)

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102103518A (en) * 2011-02-23 2011-06-22 运软网络科技(上海)有限公司 System for managing resources in virtual environment and implementation method thereof
CN102857363A (en) * 2012-05-04 2013-01-02 运软网络科技(上海)有限公司 Automatic computing system and method for virtual networking
CN102752364A (en) * 2012-05-22 2012-10-24 华为终端有限公司 Data transmission method and device
CN103034684A (en) * 2012-11-27 2013-04-10 北京航空航天大学 Optimizing method for storing virtual machine mirror images based on CAS (content addressable storage)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105450734A (en) * 2015-11-09 2016-03-30 上海爱数信息技术股份有限公司 Distributed storage CEPH data distribution optimization method
CN105450734B (en) * 2015-11-09 2019-01-25 上海爱数信息技术股份有限公司 The data distribution optimization method of distributed storage CEPH
CN109327544A (en) * 2018-11-21 2019-02-12 新华三技术有限公司 A kind of determination method and apparatus of leader node
CN109327544B (en) * 2018-11-21 2021-06-18 新华三技术有限公司 Leader node determination method and device
CN109783438A (en) * 2018-12-05 2019-05-21 南京华讯方舟通信设备有限公司 Distributed NFS system and its construction method based on librados
CN111209090A (en) * 2020-04-17 2020-05-29 腾讯科技(深圳)有限公司 Method and assembly for creating virtual machine in cloud platform and server
CN111209090B (en) * 2020-04-17 2020-07-24 腾讯科技(深圳)有限公司 Method and assembly for creating virtual machine in cloud platform and server

Similar Documents

Publication Publication Date Title
US10963289B2 (en) Storage virtual machine relocation
JP6199452B2 (en) Data storage systems that export logical volumes as storage objects
JP6208207B2 (en) A computer system that accesses an object storage system
JP6219420B2 (en) Configuring an object storage system for input / output operations
JP5985642B2 (en) Data storage system and data storage control method
JP6492226B2 (en) Dynamic resource allocation based on network flow control
JP4448719B2 (en) Storage system
US20060095705A1 (en) Systems and methods for data storage management
US10241712B1 (en) Method and apparatus for automated orchestration of long distance protection of virtualized storage
US11836115B2 (en) Gransets for managing consistency groups of dispersed storage items
US10855556B2 (en) Methods for facilitating adaptive quality of service in storage networks and devices thereof
US10852985B2 (en) Persistent hole reservation
US11907261B2 (en) Timestamp consistency for synchronous replication
WO2015127647A1 (en) Storage virtualization manager and system of ceph-based distributed mechanism
US20130166670A1 (en) Networked storage system and method including private data network
JP2013003691A (en) Computing system and disk sharing method in computing system
US10129081B1 (en) Dynamic configuration of NPIV virtual ports in a fibre channel network
US9612769B1 (en) Method and apparatus for automated multi site protection and recovery for cloud storage
US10001939B1 (en) Method and apparatus for highly available storage management using storage providers
US10798159B2 (en) Methods for managing workload throughput in a storage system and devices thereof
US10768834B2 (en) Methods for managing group objects with different service level objectives for an application and devices thereof
JP5278254B2 (en) Storage system, data storage method and program
KR20140060962A (en) System and method for providing multiple service using idle resource of network distributed file system

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14883922

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 14883922

Country of ref document: EP

Kind code of ref document: A1