WO2024001299A1 - 基于云技术的故障处理方法、云管理平台和相关设备 - Google Patents

基于云技术的故障处理方法、云管理平台和相关设备 Download PDF

Info

Publication number
WO2024001299A1
WO2024001299A1 PCT/CN2023/081036 CN2023081036W WO2024001299A1 WO 2024001299 A1 WO2024001299 A1 WO 2024001299A1 CN 2023081036 W CN2023081036 W CN 2023081036W WO 2024001299 A1 WO2024001299 A1 WO 2024001299A1
Authority
WO
WIPO (PCT)
Prior art keywords
service
cloud
grid
tenant
management platform
Prior art date
Application number
PCT/CN2023/081036
Other languages
English (en)
French (fr)
Inventor
温嘉佳
岳宇
陈龙飞
江涛
Original Assignee
华为云计算技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN202211214539.8A external-priority patent/CN117376103A/zh
Application filed by 华为云计算技术有限公司 filed Critical 华为云计算技术有限公司
Publication of WO2024001299A1 publication Critical patent/WO2024001299A1/zh

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0654Management of faults, events, alarms or notifications using network fault recovery
    • H04L41/0668Management of faults, events, alarms or notifications using network fault recovery by dynamic selection of recovery network elements, e.g. replacement by the most appropriate element after failure

Definitions

  • This application relates to the field of cloud computing, and in particular to fault handling methods, cloud management platforms and related equipment based on cloud technology.
  • the area is divided into several units (cells), and each unit is a self-contained service system that can operate independently.
  • the tenants are divided so that each unit corresponds to some tenants. If a unit fails, the services corresponding to the unit will be unavailable and cannot provide services to the tenants corresponding to the unit.
  • the architecture of the cloud service includes a first service layer.
  • the first service layer includes multiple service grids, and the same cloud service data is stored in multiple service grids of the same service layer.
  • the cloud management platform receives the first call request issued by the tenant and determines that the first call request corresponds to the first service layer. If the first service grid in the first service layer fails, the call request can be forwarded to the same service layer.
  • the second service grid is processed and is in normal working condition, which reduces the explosion radius of the fault, thereby ensuring the normal operation of the cloud service.
  • the first aspect of this application provides a fault handling method based on cloud technology, including:
  • the cloud service architecture provided by the cloud management platform includes at least a first service layer, and the first service layer at least includes a first service grid and a second service grid.
  • the same cloud service data is stored in the first service grid and the second service grid.
  • the so-called cloud service data refers to data related to the cloud services provided by the cloud management platform, including tenant information (such as account number, password), authentication information (verification code) and other information that needs to be used in the process of calling cloud services. There are no specific limitations here.
  • the tenant will initiate a call request when using the cloud service provided by the cloud management platform. That is to say, the cloud management platform will receive the first call request from the tenant and group the first call request to the first service layer.
  • the cloud management platform will forward the first call request to the second service grid in the first service layer that is in a normal working state.
  • the same cloud service data is stored in multiple service grids of the same service layer.
  • the cloud management platform receives the first call request issued by the tenant and determines that the first call request corresponds to the first service layer. If the first service grid in the first service layer fails, the call request can be forwarded to the same.
  • the second service grid processing in the first service layer and in normal working condition reduces the explosion radius of the fault, thereby ensuring the normal operation of the cloud service.
  • the cloud management platform can also store and update status information corresponding to each service layer, where the status information indicates whether each service grid in the service layer is faulty.
  • the cloud management platform will update the status information corresponding to the first service layer where the first service grid is located.
  • the updated status information indicates that the first service cell failed.
  • the cloud management platform will forward the first call request to the second service grid that is in a normal working state based on the updated status information corresponding to the first service grid.
  • the cloud management platform when a service grid fails, the cloud management platform will update the status information of the service layer corresponding to the service grid, and forward the call request based on the updated status information to avoid forwarding the call request to the failed service grid. resulting in service unavailability. That is to say, it provides an implementation basis for reducing the impact scope of the fault and improves the realizability of the technical solution of this application.
  • the first call request issued by the tenant carries an identifier of the tenant, and the identifier is used to uniquely indicate the tenant.
  • the cloud management platform forwards the first call request to the second service grid, it can be determined that the tenant corresponds to the first service layer based on the tenant's identifier carried in the first call request and the mapping relationship between the identifier and the service layer. It is also determined that the first call request corresponds to the first service layer. That is to say, it is determined that the first call request issued by the tenant needs to be forwarded to the service grid in the first service layer.
  • the call request initiated by the tenant carries the tenant's identity, and there is a mapping relationship between the tenant's identity and the service layer, so that the cloud management platform can map the call requests initiated by different tenants to their respective service layers.
  • the cloud management platform can offload tenants and determine the service layer corresponding to the call request initiated by each tenant through the tenant identification, so as to isolate faults between service layers and reduce the explosion radius of faults to a certain extent. .
  • the cloud management platform after the cloud management platform determines that the first call request corresponds to the first service layer, if each service grid in the first service layer is in a normal working state, the cloud management platform will Send the first call request to the first service grid.
  • the first service grid here may be the service grid in the first service layer that is closest to the address of the tenant or has the smallest network delay with the tenant. There is no specific limitation here.
  • the cloud management platform can select the service grid closest to the tenant address to forward the call request to improve the reliability of the cloud service; or The cloud management platform selects the service grid with the lowest network latency to forward the call request to improve the efficiency and response speed of the cloud service.
  • the cloud management platform can provide a configuration interface to the tenant, and the configuration interface is used to obtain the cloud service input or selected by the tenant.
  • the cloud services provided by the cloud management platform can match the actual needs of tenants and are suitable for the conditions or choices input by tenants.
  • the cloud services provided by the cloud management platform can flexibly adapt to the needs of tenants, enrich the application scenarios of the technical solution of this application, and improve the practicality.
  • the cloud management platform can confirm whether the service grid is faulty through multiple methods.
  • the cloud management platform will continuously send detection information to the first service grid multiple times, and the detection information is used to detect the status of the first service grid.
  • the status of the first service grid refers to whether the first service grid is faulty. If the cloud management platform receives abnormal response information from the first service grid multiple times in a row, it is determined that the first service grid is faulty; If the response information from the first service grid is not received within the preset time period, it is determined that the first service grid is faulty.
  • the cloud management platform determines whether the first service grid is faulty through various methods, which enriches the implementation methods of the technical solution of this application.
  • the cloud management platform requires multiple consecutive tests to confirm a service grid failure, which can avoid identifying occasional situations (such as temporary network instability, etc.) as service grid failures, improving the accuracy of fault detection.
  • the cloud service includes an elastic cloud service, a cloud hard disk service, a virtual private cloud service, a cloud database service or a distributed cache service.
  • the cloud management platform can provide multiple types of cloud services to better meet the diversity of tenants' business development.
  • the cloud management platform may also receive a second calling request sent by the tenant, where the second calling request corresponds to the first service layer.
  • the cloud service corresponding to the second call request may be the same as the cloud service corresponding to the first call request, or may be different, and the details are not limited here.
  • the cloud management platform will determine the current status information of the first service layer. When the current status information of the first service layer indicates that multiple service grids are in normal working status, it will determine the target service grid from these multiple service grids. and forwards the second call request to the target service grid.
  • the target service grid is the service grid that is closest to the tenant location or has the lowest delay among multiple service grids. There is no specific limit here.
  • the cloud management platform can select the service grid closest to the tenant's location to forward the call request to improve the reliability of the cloud service; or The management platform selects the service grid with the lowest latency to forward the call request to improve the efficiency and response speed of cloud services.
  • the second aspect of this application provides a cloud management platform.
  • the cloud service architecture provided by the cloud management platform at least includes a first service layer.
  • the first service layer at least includes a first service grid and a second service grid.
  • the first service grid and the third service grid are The two service grids store the same cloud service data.
  • the cloud management platform includes: a transceiver module, used to receive the first call request sent by the tenant; a processing module, used to determine that the first call request corresponds to the first service layer; a transceiver module, also used in the case of a failure of the first service grid Next, the first call request is forwarded to the second service grid, and the second service grid is in normal working status.
  • the second aspect or any implementation of the second aspect is the device implementation corresponding to the first aspect or any implementation of the first aspect.
  • the description in the first aspect or any implementation of the first aspect is applicable to the second aspect Or any implementation method of the second aspect, which will not be described again here.
  • a third aspect of the present application provides a computing device cluster, including at least one computing device, each computing device including a processor and a memory; the processor of the at least one computing device is used to execute instructions stored in the memory of the at least one computing device, so that The computing device cluster implements the method disclosed in the first aspect and any possible implementation manner of the first aspect.
  • the fourth aspect of this application provides a computer program product containing instructions.
  • the cluster of computer equipment implements the method disclosed in the first aspect and any possible implementation manner of the first aspect.
  • a fifth aspect of the present application provides a computer-readable storage medium, including computer program instructions.
  • the computing device cluster causes the computing device cluster to execute the first aspect and any possible implementation manner of the first aspect. Revealing method.
  • Figure 1 is an architectural schematic diagram of a fault handling method based on cloud technology provided by an embodiment of the present application
  • Figure 2 is a schematic diagram of the service layer and cloud service grid provided by the embodiment of this application.
  • Figure 3 is a schematic flow chart of a fault handling method based on cloud technology provided by the embodiment of the present application.
  • Figure 4 is a schematic diagram of data synchronization provided by the embodiment of the present application.
  • Figure 5 is another schematic diagram of data synchronization provided by the embodiment of the present application.
  • Figure 6 is a schematic interface diagram of the cloud management platform provided by the embodiment of the present application.
  • Figure 7 is another schematic interface diagram of the cloud management platform provided by the embodiment of the present application.
  • Figure 8 is a schematic diagram of the cloud service architecture provided by the embodiment of the present application.
  • Figure 9 is a schematic structural diagram of a computing device provided by an embodiment of the present application.
  • Figure 10 is a schematic structural diagram of the cloud management platform provided by the embodiment of the present application.
  • Figure 11 is another schematic structural diagram of a computing device provided by an embodiment of the present application.
  • Figure 12 is a schematic structural diagram of a computing device cluster provided by an embodiment of the present application.
  • the cloud service architecture includes multiple service layers, each service layer includes multiple service grids, and the same cloud service data will be stored in multiple service grids of the same service layer.
  • the cloud management platform receives the first call request issued by the tenant and determines that the first call request corresponds to the first service layer. If the first service grid in the first service layer fails, the first call request can be forwarded to the same.
  • the second service grid processing in the service layer reduces the explosion radius of faults, thereby ensuring the normal operation of cloud services.
  • At least one of a, b, or c can mean: a, b, c, a-b, a-c, b-c, or a-b-c, where a, b, c can be single or multiple .
  • Region Divided from the dimensions of geographical location and network latency, the same resource pool is used in the same region (which can be understood as shared elastic computing, block storage, object storage, virtual private cloud (VPC) network, Elastic public network Internet protocol (Internet protocol, IP) and mirroring and other public services).
  • VPC virtual private cloud
  • IP Elastic public network Internet protocol
  • An AZ is a collection of one or more data centers.
  • An AZ has independent water and electricity.
  • computing, network, storage and other resources can be logically divided into multiple clusters. Multiple AZs in a region can be connected through high-speed optical fibers to meet tenants' needs for building high-availability systems across AZs.
  • the data center is a cloud computing architecture based on coupling of computing, storage and network resources.
  • the cloud data center is equipped with multiple servers. Through the computing power and network and storage resources provided by multiple servers, virtualization technology is used to combine computing, network and network resources. , Storage resources are provided to different tenants in a mutually isolated manner.
  • the cloud management platform can provide access interfaces (such as interfaces or application programming interfaces (API)). Tenants can operate the client to remotely access the access interface to register a cloud account and password on the cloud management platform, and log in to the cloud management platform. , after the cloud management platform successfully authenticates the cloud account and password, the tenant can further select and purchase a virtual machine with specific specifications (processor, memory, disk) on the cloud management platform. After the successful purchase, the cloud management platform provides the purchased With the remote login account and password of the virtual machine, the client can remotely log in to the virtual machine and install and run tenant applications in the virtual machine.
  • the cloud management platform can be divided based on logical functions: tenant console, computing management service, network management service, storage management service, authentication service, and image management service.
  • the tenant console provides an interface or API to interact with tenants.
  • the computing management service is used to manage servers running virtual machines and containers and bare metal servers.
  • the network management service is used to manage network services (such as gateways, firewalls, etc.).
  • the storage management service is used to manage Manage storage services (such as data bucket services), authentication services are used to manage tenant accounts and passwords, and image management services are used to manage virtual machine images.
  • the cloud management platform client can implement the following functions: receive control plane commands sent by the cloud management platform, create on the server according to the control plane control commands, and perform full life cycle management of virtual machines. Therefore, tenants or managers can create, manage, log in and operate virtual machines in the cloud data center through the cloud management platform.
  • virtual machines can also be called cloud servers (elastic compute service, ECS), elastic instances, etc. There are no specific limitations here.
  • Figure 1 is a schematic architectural diagram of fault processing based on cloud technology provided by an embodiment of the present application.
  • the cloud service architecture provided by the cloud management platform includes multiple service layers (deck), and each service layer includes multiple service grids (grid).
  • the service layer can be understood as a logical collection based on data dimension division.
  • the tenant dimension can be selected, and the tenants can be divided into different service layer slices according to certain combination attributes. That is, a service layer corresponds to a group of tenants with the same attributes.
  • the number of service grids included in different service layers can be the same or different, and there is no specific limit here.
  • each service grid in a service layer has complete business capabilities to complete the slicing of the service layer.
  • the tenant purchases cloud services on the cloud management platform through the client, and the tenant sends a call request to the cloud management platform.
  • the call request is used to request cloud services from the cloud management platform.
  • the cloud management platform determines the service layer corresponding to the call request based on the tenant ID carried in the call request and the mapping relationship between the tenant ID and the service layer. Then, based on the status information corresponding to the service layer, the call request is forwarded to the service grid in the normal working state of the service grid.
  • the status information corresponding to the service layer indicates whether each service grid corresponding to the service layer is faulty.
  • the cloud management platform will periodically detect the status of each service grid, and update the status information of the corresponding service layer when the status of the service grid changes.
  • the state change of the service grid includes that the service grid changes from a fault to a normal working state, or from a normal working state to a fault. Since the cloud management platform will detect and update the status of each service layer, if a service grid that provides cloud services fails, the cloud management platform will switch the service grid that provides cloud services to the same service layer. service grid in normal working condition, thereby achieving fault isolation and reducing the explosion radius of the fault (that is, the fault (the impact scope of the failure), ensuring the smooth progress of cloud services.
  • the cloud management platform can be deployed on computing devices, which can have a variety of device forms, including: switches, routers, or chips, etc. There are no specific limitations here. Among them, the switch or router can be either a physical network element or a virtual network element (that is, a combination of one or more functional modules implemented by pure software), and the details are not limited here.
  • the division of service grids is in principle based on computer clusters where faults do not affect each other.
  • the clusters here include computer rooms, AZs, cabinets, fault-isolated virtual machine clusters, etc. There are no specific limitations here.
  • the intersection of the service layer and the AZ can be used as a service grid.
  • Figure 2 is a schematic diagram of the service layer and service grid provided by the embodiment of the present application.
  • the deck in the vertical dimension, there are multiple independent AZs for wind, fire, hydropower and electricity in a region, which are divided based on AZ; in the horizontal dimension, the deck can be divided according to certain data dimensions. In this way, the intersection of deck and AZ can be determined as grid. For example, the intersection of deck0 and AZ1 shown in Figure 2 is grid0-1. Based on the division method shown in Figure 2, fault isolation between grids is achieved.
  • the architecture shown in Figure 2 is called a service grid architecture (grid architecture), which is widely applicable to public clouds, private clouds, and industry cloud scenarios to reduce the impact of service failures (i.e., failure Explosion radius) cloud computing infrastructure architecture.
  • grid architecture a service grid architecture
  • service failures i.e., failure Explosion radius
  • Figure 2 is only an example of dividing service grids.
  • the cloud management platform can also provide more or fewer service layers, and each service layer can also include a greater number of services. format, there are no specific restrictions here.
  • Figure 3 is a schematic flowchart of a fault handling method based on cloud technology provided by an embodiment of the present application, including the following steps:
  • the cloud management platform receives the first call request from the client.
  • the tenant sends a first calling request to the cloud management platform through the client, and the first calling request is used to request cloud services provided by the cloud management platform.
  • the cloud management platform determines that the service layer corresponding to the first call request is the first service layer.
  • the first call request carries the tenant's identifier, which is used to uniquely indicate the tenant.
  • the cloud management platform can determine that the tenant corresponds to the first service layer based on the tenant's identifier carried in the first call request and the mapping relationship between the identifier and the service layer. That is to say, it is determined that the first call request issued by the tenant needs to be forwarded to the service grid in the first service layer.
  • the tenant identifier can be an Internet Protocol (IP) address or a Media Access Control (MAC) address.
  • IP Internet Protocol
  • MAC Media Access Control
  • it can also be other identifiers that serve as unique indicators, such as ID number, mobile phone number, a combination of enterprise code and employee number, etc. There are no specific restrictions here.
  • the cloud management platform can store the mapping relationship between the tenant's identity and the service layer, so that the cloud management platform can map the call requests initiated by different tenants to their respective service layers, divert the tenants, and pass the tenant
  • the identification determines the service layer corresponding to the call request initiated by each tenant, so that fault isolation between service layers can also reduce the explosion radius of faults to a certain extent.
  • the cloud management platform updates the status information corresponding to the first service layer.
  • the cloud management platform can also store and update the status information corresponding to each service layer.
  • the status information indicates whether each service grid in the service layer is faulty.
  • the cloud management platform will update the status information corresponding to the first service layer where the first service grid is located, and the updated status information indicates that the first service grid is faulty.
  • the cloud management platform can confirm whether the service grid is faulty through multiple methods.
  • the cloud management platform will continuously send detection information to the first service grid multiple times, and the detection information is used to detect the status of the first service grid.
  • the status of the first service grid refers to whether the first service grid is faulty. If the cloud management platform receives abnormal response information from the first service grid multiple times in a row, it is determined that the first service grid is faulty; or, the cloud management platform fails to receive abnormal response information from the first service grid multiple times in a row within a preset time period. response information, it is determined that the first service grid is faulty.
  • the scenarios where the cloud management platform sends detection information to the first service grid multiple times include: the cloud management platform sends detection information to the first service grid, and the first service grid replies with an abnormal response or within a preset time period.
  • the cloud management platform did not receive the response information from the first service grid.
  • the cloud management platform will re-send the detection information to the first service grid. If the result of multiple retries is that the first cloud service grid replies with an abnormal response message or fails to receive response information from the first service grid within a preset time period, it can be determined that the first service grid is faulty (that is, unhealthy).
  • the cloud management platform determines whether the first service grid is faulty through various methods, which enriches the implementation methods of the technical solution of this application.
  • the cloud management platform requires multiple consecutive tests to confirm a service grid failure, which can avoid identifying occasional situations (such as temporary network instability, etc.) as service grid failures, improving the accuracy of fault detection.
  • the cloud management platform will update the status information of the first service layer corresponding to the first service grid. The updated status information indicates the first service grid failure and avoids forwarding the first call request.
  • the service unavailability caused by a faulty service grid provides an implementation basis for reducing the impact scope of the fault and improves the feasibility of the technical solution of this application.
  • the cloud management platform sends the first calling request to the second service grid.
  • the cloud management platform When the updated status information indicates that the first service grid is faulty, the cloud management platform will forward the first call request to the second service grid that is in a normal working state based on the updated status information corresponding to the first service grid.
  • the second service grid when the second service grid is in normal working state, it can also be called the second service grid healthy.
  • the reason why the cloud management platform can forward the first call request to the second service grid is not only because the second service grid is in normal working condition, but also because in the technical solution of this application, each service grid in the same service layer will Store the same cloud service data.
  • the so-called cloud service data refers to data related to the cloud services provided by the cloud management platform, including tenant information (such as account number, password), authentication information (verification code) and other information that needs to be used in the process of calling cloud services. There are no specific limitations here.
  • each service grid in the same service layer can store the same cloud service data through data synchronization.
  • the synchronization method includes real-time synchronization or quasi-real-time synchronization, which is not limited here.
  • Figure 4 and Figure 5 is a schematic diagram of data synchronization provided by the embodiment of the present application.
  • a transaction log queue (transaction log queue) can be maintained between each service layer.
  • DB database
  • cache or message queue of the first service grid when writing data, synchronize the transaction logs of these software (database/cache/message queue) to one or more service grids through the transaction log queue to complete data synchronization.
  • Figures 4 and 5 only take a service layer including two service grids as an example. In actual applications, there can be more service grids in a service layer. The synchronization method is similar and will not be described again here.
  • the same cloud service data is stored in multiple service grids of the same service layer.
  • the cloud management platform receives the first call request corresponding to the first service layer issued by the tenant. If the first service grid in the first service layer fails, the call request can be forwarded to the same service layer and is in normal working status.
  • the second service grid processing reduces the explosion radius of faults, thereby ensuring the normal operation of cloud services.
  • the cloud management platform determines that the first call request corresponds to the first service layer (that is, after step 302), if each service grid in the first service layer is in normal working status, The cloud management platform will send the first call request to the first service grid.
  • the first service grid here may be the service grid in the first service layer that is closest to the address of the tenant or has the smallest network delay with the tenant. There is no specific limitation here.
  • the cloud management platform can select the service grid closest to the tenant address to forward the call request to improve the reliability of the cloud service; or The cloud management platform selects the service grid with the lowest network latency to forward the call request to improve the efficiency and response speed of the cloud service.
  • the cloud management platform can obtain the status information corresponding to the service layer by detecting the status of each service grid. That is to say, the cloud management platform can also perform the following operations:
  • the cloud management platform sends the first detection information to the first service grid.
  • the cloud management platform can carry the first detection information in a hypertext transfer protocol (HTTP) request, and the first detection information is used to detect whether the first service grid fails.
  • HTTP hypertext transfer protocol
  • the cloud management platform can also carry the first detection information in other requests, such as tenant datagram protocol (user datagram protocol, UDP) requests, which are not limited here.
  • tenant datagram protocol user datagram protocol, UDP
  • the cloud management platform receives the first response information from the first service grid.
  • the first service grid after receiving the first detection information, the first service grid will send first response information to the cloud management platform, and the first response information indicates that the first service grid is in a normal working state.
  • the cloud management platform sends the second detection information to the second service grid.
  • the cloud management platform receives the second response information from the second service grid.
  • Steps 307 to 308 are similar to steps 305 to 306, and will not be described again here.
  • the cloud management platform stores status information corresponding to the first service layer.
  • the cloud management platform After receiving the first response information and the second response information, the cloud management platform will store the status information corresponding to the first service layer. It can be understood that based on the foregoing simple description of steps 305 to 308, in step 309, the status information corresponding to the first service layer stored by the cloud management platform indicates the first service grid and the second service corresponding to the first service layer. The grids are all in normal working order.
  • the cloud management platform can select a service grid that provides cloud services based on a certain policy.
  • a service grid that provides cloud services based on a certain policy.
  • the cloud management platform may also receive a second calling request sent by the tenant, where the second calling request corresponds to the first service layer.
  • the cloud service corresponding to the second call request may be the same as the cloud service corresponding to the first call request, or may be different, and the details are not limited here.
  • the cloud management platform will determine the current status information of the first service layer. When the current status information of the first service layer indicates that multiple service grids are in normal working status, it will determine the target service grid from these multiple service grids. and forwards the second call request to the target service grid.
  • the target service grid is the service grid among multiple service grids that is closest to the tenant address or has the lowest network delay with the tenant. There is no specific limit here.
  • the cloud management platform can select the service grid closest to the tenant address to forward the call request to improve the reliability of the cloud service; or The cloud management platform selects the service grid with the lowest network latency between the tenant and the tenant to forward the call request to improve the efficiency and response speed of the cloud service.
  • the cloud management platform also provides a configuration interface to the tenant, and the configuration interface is used to obtain the cloud service input or selected by the tenant.
  • the cloud services provided by the cloud management platform can match the actual needs of tenants and are suitable for the conditions or choices input by tenants.
  • Figures 6 and 7 are both schematic interface diagrams of the cloud management platform provided by embodiments of the present application.
  • the interface of the cloud management platform includes an input box 601.
  • the tenant can enter the setting conditions of the cloud service architecture in the input box 601, and the cloud management platform will respond to the operation instructions for the input box 601 to configure the cloud service architecture that meets the setting conditions.
  • tenant A wants to deploy a relatively high-reliability service on the cloud. He first uses an automated tool to create a service stack (including all resources and applications of the service), and then copies three copies of the stack. , got 4 identical service stacks.
  • tenant A wants to place four identical service stacks evenly in two AZs and synchronize service stack data in different AZs. Then, tenant A can enter the configuration conditions in the input box 601 as follows: define 4 identical service stacks as 4 service grids, place the 4 identical service stacks evenly in two AZs, and then place any 2 of the same service stacks in two AZs.
  • the service grid in the same AZ is divided into one service layer, and the two service layers formed are named service layer 0 and service layer 1.
  • the cloud management platform service will set the stateful middleware in two different service grid stacks belonging to the same service layer into a mutual backup relationship by default.
  • the tenant can also configure the routing scenario of the service. For example, if the tenant ID of tenant A's service is divided by 2, the remainder 0 will be assigned to service layer 0, and the remainder 1 will be assigned to service layer 1. Then based on this configuration, the service will automatically route requests from different tenants using the cloud service to service layer 0 and service layer 1.
  • this application can provide a certain service in the form of a cloud service whether it is in a public cloud or a private cloud.
  • This service allows customers to define logical collections (that is, decks) of data dimension divisions and A collection of cluster deployments (i.e. grid); configurations can also be defined for customers to route messages to different decks and grids.
  • the interface of the cloud management platform also includes a preview control 602.
  • the cloud management platform can display a schematic diagram of the cloud service architecture that meets the setting conditions input by the tenant.
  • the schematic diagram can be similar to Figure 1
  • the schematic diagram of the cloud service architecture shown allows tenants to intuitively feel the cloud service architecture.
  • the interface of the cloud management platform also includes a confirmation control 603. If the tenant clicks the confirmation control 603, the cloud management platform can actually deploy the cloud service architecture that meets the setting conditions input by the tenant.
  • the configuration information input by the tenant is also used to indicate the mapping relationship between tenant identification, service layer, and service grid. Specifically, it includes the mapping relationship between the tenant identifier and the service layer, and the mapping relationship between the service layer and the service grid.
  • the mapping relationship between the tenant's identity and the service layer can be used to offload tenants; the mapping relationship between the service layer and the service grid includes information such as the number of service grids included in a service layer.
  • the cloud management platform will deploy at least two tenant identities and at least two service grids corresponding to each service layer.
  • the configuration information may be a certain rule or a customized algorithm, so that the computing device divides tenant identities into different service layers as evenly as possible. For example, for a cloud service that purchases tickets, tenants can be shunted based on the last number of the tenant account. For example, tenants with odd numbers ending in an odd number can be mapped to the same service layer, and tenants with even numbers ending in an even number can be mapped to another service layer. .
  • the number of service grids included in each service layer is not limited, as long as it is greater than one.
  • the interface of the cloud management platform includes optional boxes 701, each of which indicates a cloud service architecture.
  • the cloud management platform will respond to the touch command on the details control 702 and display the detailed information of the corresponding cloud service architecture (such as the number of service layers included in the cloud service architecture, the service grid included in each service layer). quantity, etc.).
  • the interface of the cloud management platform also includes a confirmation control 703.
  • the cloud management platform can actually deploy the cloud service architecture that satisfies the tenant's selection.
  • Figures 6 and 7 are only examples of the interface of the cloud management platform and do not limit the specific content of the interface. In actual applications, it can be set flexibly, and there are no specific limitations here.
  • tenants do not need to be aware of the data offloading process when calling cloud services.
  • the cloud service call request initiated by the tenant is a ticket purchase request.
  • the tenant logs in to the ticket purchase software, clicks to purchase tickets, and the terminal device directly displays the ticket purchase interface.
  • what is actually done is to send the cloud service call request corresponding to the ticket purchase to the computing device.
  • the computing device determines the corresponding service layer based on the tenant identification, and then forwards the request to the service grid in the normal working state of the service layer. middle.
  • each service grid in the service layer can provide cloud services for multiple tenants. The following is explained based on specific application scenarios.
  • the first cloud service requested by the first tenant is a ticket buying service
  • the first tenant corresponds to the first service layer.
  • the first service grid included in the first service layer is in a normal working state
  • the first service grid provides the ticket purchasing service.
  • the terminal logged in with the first tenant's account can display a refresh interface, allowing the first tenant to re-initiate the ticket purchase service.
  • the cloud management platform will switch the service grid that provides ticket buying business to the second service grid, so that the first renter Users can successfully use the ticket buying service. If the terminal logged in with the second tenant's account also initiates the ticket purchase service, after the failover, the second tenant can also complete the ticket purchase through the second service grid.
  • a service grid can provide different cloud services.
  • the first tenant can request the ticket buying service and the music listening service successively.
  • the first tenant corresponds to the target service layer.
  • the first service grid When the first service grid is in a normal working state, the first service grid provides the ticket buying service and the music listening service.
  • the cloud services provided by the cloud management platform include elastic cloud services, cloud hard disk services, virtual private cloud services, cloud database services or distributed cache services, which are not limited here.
  • the cloud management platform can provide multiple types of cloud services to better meet the diversity of tenants' business development.
  • FIG. 8 is a schematic diagram of the cloud service architecture provided by an embodiment of the present application.
  • Figure 8 is an example of splitting a certain cloud service architecture into two service layers, and each service layer has two service grids.
  • cloud services can correspond to a larger number of service layers.
  • Each service layer can correspond to a larger number of service grids, and there is no specific limit here.
  • a fully functional business cluster is deployed in each service grid, including all business microservices and related middleware, data storage, etc.
  • data storage can be implemented in a variety of ways, which can be a database (data base, DB) or a message queue (message queue).
  • DB data base
  • message queue messages queue
  • cache cache
  • the proxy is used for data distribution. After receiving the cloud service call request, the cloud service call request is distributed to the corresponding microservice for processing.
  • Different service grids deployed in a service layer can synchronize data using "messages" to form a read-multiple-write or master-standby relationship.
  • the data synchronized in different service grids is cloud service data, including tenant information (such as account number, password, etc.), authentication information (such as verification code), permission information, etc., or other information that may be used when a tenant initiates a cloud service call request. , there is no specific limit here.
  • the computing device running the cloud management platform is deployed outside the service layer and service grid as a general component for routing requests for cloud services.
  • Computing devices can route cloud service call requests to any service grid in any service layer.
  • Figure 8 is just an example. In actual applications, it may also include a greater or lesser number of computing devices, and there is no specific limitation here.
  • the computing devices may be logically divided.
  • the following description is based on a schematic diagram. Please refer to FIG. 9 , which is a schematic structural diagram of a computing device provided by an embodiment of the present application.
  • the computing device 900 includes a routing module (router) 901, a metadata service module 902, a detector module 903, and a naming service module 904.
  • route routing module
  • metadata service module 902 a metadata service module 902
  • detector module 903 a naming service module 904.
  • the routing module 901 can be considered as the core module of the computing device 900, and is used to allocate cloud service invocation requests to different service layers/service grids according to logical data partitions.
  • the routing module 901 can be a layer 7 proxy that can route HTTP/HTTPS (1.1/2.0) requests to different backends.
  • the routing module 901 can also route other protocols to different backends, as long as the message routing function can be completed, and the details are not limited here.
  • the metadata service module 902 is used to save the logical data partition relationship, that is, to save the mapping relationship between the service layer and the tenant identifier. For example, in a typical tenant-based data partitioning implementation, the metadata service module 902 can save the mapping relationship between the tenant ID and the service layer, and the mapping relationship between the service layer and the service grid.
  • rules/configurations can be used to determine the data partition relationship (that is, the mapping relationship between tenant identifiers and service layers).
  • the mapping relationship between each tenant identifier and the service layer can also be stored in the metadata service module 902 as a key-value pair.
  • the metadata service module 902 can maintain a protocol to push the update to the routing module 901 when the data of the metadata service module 902 is updated (that is, the mapping relationship between the tenant identifier and the service layer is updated). , so that the routing module 901 also maintains the cache of data in the metadata service module 902. In other words, the routing module 901 can also store and update the mapping relationship between the tenant identifier and the service layer. This can ensure that the crash of the metadata service module 902 does not affect the normal operation of the computing device.
  • the detection module 903 is used to dynamically detect the health of each service grid, and update the data in the naming service module 904 after the service grid fails, that is, update the status information corresponding to the service layer where the failed service grid is located.
  • the detection module 903 can continuously poll each service grid by setting certain fixed test requests (such as HTTPS requests). When it is determined that a certain service grid is faulty, update the naming service module 904 data.
  • the detection module 903 can also be notified through manual input of commands or other means, and the detection module 903 can update the data in the naming service module 904.
  • the naming service module 904 is used to save the health status of each service grid in the service layer, that is, to store the status information corresponding to each service layer.
  • the routing module 901 When the routing module 901 wishes to forward a cloud service invocation request, it first obtains the service layer to which the cloud service invocation request should be sent. Then, the health status of each service grid in the service layer is obtained from the naming service module 904, thereby selecting a healthy service grid and completing the mapping of the cloud service invocation request to the service grid.
  • a healthy service grid refers to a service grid that is in normal working condition.
  • the implementation of the technical solution of this application in addition to the public cloud or private cloud mentioned above, can provide certain services in the form of cloud services.
  • This service allows tenants to define the logic of data dimension division. Collection (that is, deck) and cluster deployment collection (grid); you can also define configurations for tenants to route messages to different decks and grids.
  • the technical solution of this application can also provide a certain architecture implementation based on software.
  • This architecture implementation allows tenants to define logical collections (decks) of data dimension division and cluster deployment collections (grids). Among them, cluster deployment collections Individual elements (clusters) can synchronize state with each other in some way.
  • Some general software or hardware can also be provided through software and hardware. The general software or hardware can define configurations for customers to route messages to logical collections (decks) and cluster deployment collections (grids) divided by different data dimensions. , there is no specific limit here.
  • embodiments of the present invention further disclose the internal structure of the cloud management platform. For details, see below:
  • Figure 10 is a schematic structural diagram of a cloud management platform provided by an embodiment of the present application.
  • the cloud service architecture provided by the cloud management platform includes at least the first service layer, and the first service layer includes at least the first service grid. and the second service grid.
  • the first service grid and the second service grid store the same cloud service data.
  • the cloud management platform 1000 includes: a processing module 1001 and a transceiver module 1002.
  • the transceiver module 1002 is used to receive the first call request sent by the tenant.
  • the user determines that the first calling request corresponds to the first service layer.
  • the transceiver module 1002 is also used to forward the first call request to the second service grid when the first service grid fails, and the second service grid is in a normal working state.
  • the processing module 1001 is also configured to update status information of each service grid in the first service layer, where the status information indicates whether each service grid in the first service layer is faulty.
  • the first call request carries the tenant's identity.
  • the processing module 1001 is also configured to determine that the tenant corresponds to the first service layer based on the tenant's identifier and the mapping relationship between the identifier and the service layer.
  • the processing module 1001 is also configured to send the first call request to the first service grid when each service grid in the first service layer is in a normal working state.
  • the service grid is closest to the tenant's address or the network delay between the first service grid and the tenant is the smallest.
  • the processing module 1001 is also used to provide a configuration interface to the tenant, and the configuration interface is used to obtain the cloud service input or selected by the tenant.
  • the transceiving module 1002 is specifically configured to send detection information to the first service grid multiple times in a row, and the detection information is used to detect the status of the first service grid.
  • the processing module 1001 is specifically configured to determine that the first service grid is faulty if abnormal response information from the first service grid is received multiple times in succession. Alternatively, if the response information from the first service grid is not received within a preset time period for several consecutive times, it is determined that the first service grid is faulty.
  • the cloud service includes an elastic cloud service, a cloud hard disk service, a virtual private cloud service, a cloud database service or a distributed cache service.
  • the transceiver module 1002 is also configured to receive a second call request sent by the tenant, where the second call request corresponds to the first service layer.
  • the cloud service corresponding to the second call request may be the same as the cloud service corresponding to the first call request, or may be different, and the details are not limited here.
  • the processing module 1001 is also used to determine the current status information of the first service layer. When the current status information of the first service layer indicates that multiple service grids are in normal working status, the target will be determined from the multiple service grids. Service grid. Optionally, the target service grid is the service grid that is closest to the tenant location or has the lowest delay among multiple service grids. There is no specific limit here.
  • the transceiver module 1002 is also used to forward the second call request to the target service grid.
  • both the processing module 1001 and the transceiver module 1002 can be implemented by software, or can be implemented by hardware. Illustratively, the following takes the processing module 1001 as an example to introduce the implementation of the processing module 1001. Similarly, the implementation of the transceiver module 1002 may refer to the implementation of the processing module 1001.
  • the processing module 1001 may be an application program or code block running on a computer device.
  • the computer device may be at least one of a physical host, a virtual machine, a container, and other computing devices. Further, the above computer equipment may be one or more.
  • the processing module 1001 may be an application running on multiple hosts/virtual machines/containers. It should be noted that multiple hosts/virtual machines/containers used to run the application can be distributed in the same AZ or in different AZs. Multiple hosts/virtual machines/containers used to run the application can be distributed in the same region or in different regions. There is no specific limit here.
  • multiple hosts/VMs/containers used to run the application can be distributed in the same virtual private cloud (VPC) or across multiple VPCs.
  • VPC virtual private cloud
  • the processing module 1001 may include at least one computing device, such as a server.
  • the processing module 1001 may also be a device implemented using an application-specific integrated circuit (ASIC) or a programmable logic device (PLD).
  • ASIC application-specific integrated circuit
  • PLD programmable logic device
  • the above-mentioned PLD can be a complex programmable logical device (CPLD), a field-programmable gate array (field-programmable gate array, FPGA), a general array logic (generic array logic, GAL), or any combination thereof.
  • CPLD complex programmable logical device
  • FPGA field-programmable gate array
  • GAL general array logic
  • Multiple computing devices included in the processing module 1001 may be distributed in the same AZ or in different AZs. Multiple computing devices included in the processing module 1001 may be distributed in the same region or in different regions. Similarly, multiple computing devices included in the processing module 1001 may be distributed in the same VPC or in multiple VPCs.
  • the plurality of computing devices may be any combination of computing devices such as servers, ASICs, PLDs, CPLDs, FPGAs, and GALs.
  • the cloud management platform 1000 can perform the operations performed by the cloud management platform in the embodiments shown in FIGS. 1 to 9 , which will not be described again here.
  • Figure 11 is a schematic structural diagram of a computing device provided by an embodiment of the present application.
  • computing device 1000 includes: bus 1003, processor 1005, memory 1004, and communication interface 1006.
  • the processor 1005, the memory 1004 and the communication interface 1006 communicate through the bus 1003.
  • Computing device 1000 may be a server or a terminal device. It should be understood that the present invention does not limit the number of processors and memories in the computing device 1000.
  • the bus 1003 may be a peripheral component interconnect (PCI) bus or an extended industry standard architecture (EISA) bus, etc.
  • the bus can be divided into address bus, data bus, control bus, etc. For ease of presentation, only one line is used in Figure 11, but it does not mean that there is only one bus or one type of bus.
  • Bus 1003 may include a path that carries information between various components of computing device 1000 (eg, memory 1004, processor 1005, communications interface 1006).
  • the processor 1005 may include a central processing unit (CPU), a graphics processing unit (GPU), a microprocessor (MP) or a digital signal processor (DSP). any one or more of them.
  • CPU central processing unit
  • GPU graphics processing unit
  • MP microprocessor
  • DSP digital signal processor
  • Memory 1004 may include volatile memory, such as random access memory (RAM).
  • the processor 1005 may also include non-volatile memory (non-volatile memory), such as read-only memory (ROM), flash memory, mechanical hard disk drive (hard disk drive, HDD) or solid state drive (solid state drive). drive, SSD).
  • ROM read-only memory
  • HDD hard disk drive
  • SSD solid state drive
  • the memory 1004 stores executable program code, and the processor 1005 executes the executable program code to implement the functions of the aforementioned processing module 1001 and transceiver module 1002 respectively, thereby implementing a fault handling method based on cloud technology. That is, the memory 1004 stores instructions for the cloud management platform to execute the fault handling method based on cloud technology.
  • the communication interface 1006 uses transceiver units such as, but not limited to, network interface cards and transceivers to implement computing devices. Communication between device 1000 and other devices or communication networks.
  • An embodiment of the present invention also provides a computing device cluster.
  • the computing device cluster includes at least one computing device.
  • the computing device may be a server, such as a central server, an edge server, or a local server in a local data center.
  • the computing device may also be a terminal device such as a desktop computer, a laptop computer, or a smartphone.
  • Figure 12 is a schematic structural diagram of a computing device cluster provided by an embodiment of the present invention.
  • a computing device cluster includes at least one computing device 1000.
  • the memory 1004 of one or more computing devices 1000 in the computing device cluster may store instructions of the same cloud management platform for executing the fault handling method based on cloud technology.
  • the memory 1004 in different computing devices 1000 in the computing device cluster can store different instructions for executing some functions of the cloud management platform. That is, the instructions stored in the memory 1004 in different computing devices 1000 can implement the functions of one or more modules in the storage processing module 1001 and the transceiver module 1002.
  • An embodiment of the present invention also provides a computer program product containing instructions.
  • the computer program product may be a software or program product containing instructions capable of running on a computing device or stored in any available medium.
  • the computer program product is run on at least one computer device, at least one computer device is caused to execute the above-mentioned fault handling method applied to a cloud management platform for executing a cloud technology-based fault handling method.
  • An embodiment of the present invention also provides a computer-readable storage medium.
  • the computer-readable storage medium may be any available medium that a computing device can store or a data storage device such as a data center that contains one or more available media.
  • the available media may be magnetic media (eg, floppy disk, hard disk, tape), optical media (eg, DVD), or semiconductor media (eg, solid state drive), etc.
  • the computer-readable storage medium includes instructions that instruct the computing device to execute the above-mentioned fault handling method applied to the cloud management platform for performing cloud technology-based fault handling methods.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

本申请实施例公开了一种基于云技术的故障处理方法、云管理平台和相关设备,用于降低故障的爆炸半径,保证云服务的正常运行。本申请实施例方法应用于云服务,云服务的架构至少包括第一服务层,第一服务层至少包括第一服务格和第二服务格,第一服务格和第二服务格中存储相同的云服务数据。该方法包括:接收租户发送的第一调用请求,并确定调用请求对应于第一服务层。在第一服务格故障的情况下,向处于正常工作状态的第二服务格转发第一调用请求。

Description

基于云技术的故障处理方法、云管理平台和相关设备
本申请要求于2022年06月30日提交中国国家知识产权局、申请号为CN202210762448.1、发明名称为“一种支持控制爆炸半径的云计算通用组件和集群划分方法”的中国专利申请的优先权,以及于2022年09月30日提交中国国家知识产权局、申请号为CN202211214539.8、发明名称为“基于云技术的故障处理方法、云管理平台和相关设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及云计算领域,尤其涉及基于云技术的故障处理方法、云管理平台和相关设备。
背景技术
随着云计算领域的持续发展,单一云计算服务区域(region)的设备数量不断增加,甚至会超过百万台设备。在这种情况下,如果单一云服务软件崩溃,会导致整个区域中的某个云服务不可用。
在一种云服务的处理方法中,将区域划分为若干个单元(cell),每个单元是一个自包容的服务体系,能够独立运行。同时,将租户进行划分,使得每个单元对应部分租户。如果某个单元发生故障,那么该单元所对应的服务将不可用,无法为该单元对应的租户提供服务。
发明内容
本申请提供了基于云技术的故障处理方法、云管理平台和相关设备。在基于云技术的故障处理方法中,云服务的架构包括第一服务层,第一服务层中包括多个服务格,同一个服务层的多个服务格中存储相同的云服务数据。云管理平台接收租户发出的第一调用请求,并确定第一调用请求对应于第一服务层,如果第一服务层中的第一服务格发生故障,可以将该调用请求转发给同一个服务层中且处于正常工作状态的第二服务格处理,降低了故障的爆炸半径,从而保证云服务的正常运行。
本申请第一方面提供了一种基于云技术的故障处理方法,包括:
云管理平台提供的云服务架构中包括了至少包括了第一服务层,第一服务层至少包括了第一服务格和第二服务格。第一服务格和第二服务格中存储了相同的云服务数据。所谓云服务数据,是指与云管理平台提供的云服务相关的数据,包括租户信息(例如账号、密码)、鉴权信息(验证码)以及其他在调用云服务的过程中需要使用的信息,具体此处不做限定。租户在使用云管理平台提供的云服务的过程中,会发起调用请求,也就是说,云管理平台会接收来自于租户的第一调用请求,并群第一调用请求对应于第一服务层。如果第一服务层中第一服务格故障,意味着第一服务格无法提供云服务,也就不能处理第一调用请求。因此,在确定第一服务层中的第一服务格故障的情况下,云管理平台会将第一调用请求转发至第一服务层中处于正常工作状态的第二服务格。
从以上技术方案可以看出,本申请具有以下优点:同一个服务层的多个服务格中存储相同的云服务数据。云管理平台接收租户发出的第一调用请求,并确定第一调用请求对应于第一服务层,如果第一服务层中的第一服务格发生故障,可以将该调用请求转发给同一 个服务层中且处于正常工作状态的第二服务格处理,降低了故障的爆炸半径,从而保证云服务的正常运行。
在第一方面的一种可能的实现方式中,云管理平台还可以存储并更新每个服务层所对应的状态信息,该状态信息指示的是服务层中每个服务格是否故障。在确定第一服务格故障的情况下,云管理平台会更新第一服务格所在的第一服务层对应的状态信息。更新后的状态信息指示第一服务格故障。云管理平台会根据更新后的第一服务格对应的状态信息,向处于正常工作状态的第二服务格转发第一调用请求。
基于上述方法,云管理平台在服务格故障的情况下,会更新该服务格对应的服务层的状态信息,并根据更新后的状态信息转发调用请求,避免将调用请求转发至故障的服务格而导致的服务不可用。也即为降低故障的影响范围提供了实现基础,提升了本申请技术方案的可实现性。
在第一方面的一种可能的实现方式中,租户发出的第一调用请求中携带租户的标识,该标识用于唯一指示租户。在云管理平台将第一调用请求转发至第二服务格之前,能够根据第一调用请求携带的租户的标识,以及标识与服务层之间的映射关系,确定该租户对应于第一服务层,也就确定了第一调用请求对应于第一服务层。也就是说,确定该租户发出的第一调用请求,需要转发至第一服务层中的服务格。换言之,租户发起的调用请求中携带租户的标识,租户的标识与服务层之间存在映射关系,使得云管理平台能够把不同租户发起的调用请求对应到各自的服务层。
基于上述方法,云管理平台能够对租户进行分流,通过租户标识确定每个租户发起的调用请求所对应的服务层,使得服务层之间的故障隔离,也能在一定程度上降低故障的爆炸半径。
在第一方面的一种可能的实现方式中,云管理平台在确定第一调用请求对应第一服务层之后,如果第一服务层中的每个服务格都处于正常工作状态,云管理平台会将第一调用请求发送至第一服务格。可选的,此处的第一服务格可以是第一服务层中与租户的地址位置最近或者与租户之间的网络时延最小的服务格,具体此处不做限定。
基于上述方法,在第一服务层中的多个服务格均处于正常工作状态的情况下,云管理平台可以选择距离租户地址位置最近的服务格转发调用请求,以提高云服务的可靠性;或者云管理平台选择网络时延最低的服务格转发调用请求,以提高云服务的效率和响应速度。
在第一方面的一种可能的实现方式中,云管理平台可以向租户提供配置接口,配置接口用于获取租户输入或选择的云服务。也就是说,云管理平台提供的云服务可以匹配于租户的实际需求,适用于租户输入的条件或者选择。
基于上述方法,云管理平台所提供的云服务能够灵活适应租户的需求,丰富了本申请技术方案的应用场景,提升了实用性。
在第一方面的一种可能的实现方式中,云管理平台能够通过多种方式确认服务格是否故障。云管理平台会连续多次向第一服务格发送检测信息,检测信息用于检测第一服务格的状态。其中,第一服务格的状态是指第一服务格是否故障。如果云管理平台连续多次接收来自于第一服务格的异常响应信息,则确定第一服务格故障;或者,云管理平台连续多 次未在预设时间段内接收来自于第一服务格的响应信息,则确定第一服务格故障。
基于上述方法,云管理平台通过多种方式确定第一服务格是否故障,丰富了本申请技术方案的实现方式。另外,云管理平台确认服务格故障需要连续多次检测,能够避免将偶发情况(例如网络暂时不稳定等)认定为服务格故障,提升了故障检测的准确度。
在第一方面的一种可能的实现方式中,云服务包括弹性云服务、云硬盘服务、虚拟私有云服务、云数据库服务或分布式缓存服务。
基于上述方法,云管理平台能够提供多种类型的云服务,更好地满足了租户业务开展的多样性。
在第一方面的一种可能的实现方式中,云管理平台还可以接收租户发出的第二调用请求,第二调用请求对应于第一服务层。其中,第二调用请求对应的云服务可以与第一调用请求对应的云服务相同,也可以不同,具体此处不做限定。云管理平台会确定第一服务层的当前状态信息,在第一服务层的当前状态信息指示多个服务格均处于正常工作状态的情况下,会从这多个服务格中确定目标服务格,并向目标服务格转发第二调用请求。可选的,目标服务格为多个服务格中距离租户位置最近或者时延最低的服务格,具体此处不做限定。
基于上述方法,在第一服务层中的多个服务格均处于正常工作状态的情况下,云管理平台可以选择距离租户位置最近的服务格转发调用请求,以提高云服务的可靠性;或者云管理平台选择时延最低的服务格转发调用请求,以提高云服务的效率和响应速度。
本申请第二方面提供了一种云管理平台,云管理平台提供的云服务架构至少包括第一服务层,第一服务层至少包括第一服务格和第二服务格,第一服务格和第二服务格存储相同的云服务数据。云管理平台包括:收发模块,用于接收租户发出的第一调用请求;处理模块,用于确定第一调用请求对应于第一服务层;收发模块,还用于在第一服务格故障的情况下,将第一调用请求转发至第二服务格,第二服务格处于正常工作状态。
第二方面或第二方面任意一种实现方式是第一方面或第一方面任意一种实现方式对应的装置实现,第一方面或第一方面任意一种实现方式中的描述适用于第二方面或第二方面任意一种实现方式,在此不再赘述。
本申请第三方面提供一种计算设备集群,包括至少一个计算设备,每个计算设备包括处理器和存储器;至少一个计算设备的处理器用于执行至少一个计算设备的存储器中存储的指令,以使得计算设备集群实现第一方面及第一方面任一种可能的实现方式所揭示的方法。
本申请第四方面提供一种包含指令的计算机程序产品,当指令被计算机设备集群运行时,使得计算机设备集群实现第一方面及第一方面任一种可能的实现方式所揭示的方法。
本申请第五方面提供一种计算机可读存储介质,包括计算机程序指令,当计算机程序指令由计算设备集群执行时,使得计算设备集群执行第一方面及第一方面任一种可能的实现方式所揭示的方法。
本申请第二方面至第五方面所示的有益效果与第一方面以及第一方面任一种可能的实现方式类似,此处不再赘述。
附图说明
图1为本申请实施例提供的基于云技术的故障处理方法的一个架构示意图;
图2为本申请实施例提供的服务层和云服务格一个示意图;
图3为本申请实施例提供的基于云技术的故障处理方法的一个流程示意图
图4为本申请实施例提供的数据同步的一个示意图;
图5为本申请实施例提供的数据同步的另一个示意图;
图6为本申请实施例提供的云管理平台的一个界面示意图;
图7为本申请实施例提供的云管理平台的另一个界面示意图;
图8为本申请实施例提供的云服务架构的示意图;
图9为本申请实施例提供的计算设备的一个结构示意图;
图10为本申请实施例提供的云管理平台的一个结构示意图;
图11为本申请实施例提供的计算设备的另一个结构示意图;
图12为本申请实施例提供的计算设备集群的一个结构示意图。
具体实施方式
本申请提供了基于云技术的故障处理方法、云管理平台和相关设备。在基于云技术的故障处理方法中,云服务的架构包括多个服务层,每个服务层中包括多个服务格,同一个服务层的多个服务格中会存储相同的云服务数据。云管理平台接收租户发出的第一调用请求,并确定第一调用请求对应于第一服务层,如果第一服务层中的第一服务格发生故障,那么可以将第一调用请求转发给同一个服务层中的第二服务格处理,降低了故障的爆炸半径,从而保证云服务的正常运行。
下面结合附图,对本申请的实施例进行描述。本领域普通技术人员可知,随着技术的发展和新场景的出现,本申请实施例提供的技术方案对于类似的技术问题,同样适用。
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的术语在适当情况下可以互换,这仅仅是描述本申请的实施例中对相同属性的对象在描述时所采用的区分方式。此外,术语“包括”和“具有”以及他们的任何变形,其目的在于覆盖不排他的包含,以便包含一系列单元的过程、方法、系统、产品或设备不必限于那些单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它单元。另外,“至少一个”是指一个或者多个,“多个”是指两个或两个以上。“和/或”,描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B的情况,其中A,B可以是单数或者复数。字符“/”一般表示前后关联对象是一种“或”的关系。“以下至少一项(个)”或其类似表达,是指的这些项中的任意组合,包括单项(个)或复数项(个)的任意组合。例如,a,b,或c中的至少一项(个),可以表示:a,b,c,a-b,a-c,b-c,或a-b-c,其中a,b,c可以是单个,也可以是多个。
为了便于理解本发明实施例,首先,对本发明涉及的部分术语进行解释说明。
区域(region):从地理位置和网络时延维度划分,同一个region内使用同一个资源池(可以理解为共享弹性计算、块存储、对象存储、虚拟私有云(virtual private cloud,VPC)网络、弹性公网网际互连协议(internet protocol,IP)和镜像等公共服务)。
可用性区域(availability zone,AZ):一个AZ是一个或者多个数据中心的集合,一个AZ有独立的水电。在AZ中,可以从逻辑上将计算、网络、存储等资源划分为多个集群。一个region中的多个AZ之间可以通过高速光纤相连,以满足租户跨AZ构建高可用性系统的需求。其中,数据中心是一种基于云计算架构,计算、存储和网络资源耦合,云数据中心设置有多个服务器,通过多个服务器提供的运算能力以及网络、存储资源利用虚拟化技术将计算、网络、存储资源以相互隔离的方式提供给不同租户使用。
云管理平台,能够提供访问接口(如界面或应用程序编程接口(application programming interface,API)),租户可操作客户端远程接入访问接口在云管理平台注册云账号和密码,并登录云管理平台,云管理平台对云账号和密码鉴权成功后,租户可进一步在云管理平台付费选择并购买特定规格(处理器、内存、磁盘)的虚拟机,付费购买成功后,云管理平台提供所购买的虚拟机的远程登录账号密码,客户端可远程登录该虚拟机,在该虚拟机中安装并运行租户的应用。可以将云管理平台基于逻辑功能划分:租户控制台、计算管理服务、网络管理服务、存储管理服务、鉴权服务、镜像管理服务。租户控制台提供界面或API与租户交互,计算管理服务用于管理运行虚拟机和容器的服务器以及裸金属服务器,网络管理服务用于管理网络服务(如网关、防火墙等),存储管理服务用于管理存储服务(如数据桶服务),鉴权服务用于管理租户的账号密码,镜像管理服务用于管理虚拟机镜像。云管理平台客户端能够实现以下功能:接收云管理平台发送的控制面命令,根据控制面控制命令在服务器上创建并对虚拟机进行全生命周期管理。因此,租户或者管理者可通过云管理平台在云数据中心中创建、管理、登录和操作虚拟机。其中,虚拟机也可称为云服务器(elastic compute service,ECS)、弹性实例等,具体此处不做限定。
下面,请参阅图1,图1为本申请实施例提供的基于云技术的故障处理的架构示意图。
如图1所示,云管理平台提供的云服务架构包括多个服务层(deck),每个服务层中包括多个服务格(grid)。其中,服务层可以理解为基于数据维度划分的逻辑集合,示例性的,在云计算中,可以选择租户维度,按照一定的组合属性,将租户划分至不同的服务层切片中。也就是说,一个服务层对应于一批具有相同属性的租户。需要注意的是,不同服务层中包括的服务格数量可以相同,也可以不同,具体此处不做限定。作为软件变更和部署的单元,一个服务层中的每个服务格都具备完成该服务层切片的完整业务能力。
如图1所示,租户通过客户端在云管理平台购买云服务,租户向云管理平台发送调用请求,该调用请求用于向云管理平台请求云服务。云管理平台根据调用请求中携带的租户标识,以及租户标识与服务层之间的映射关系,确定该调用请求对应的服务层。然后根据该服务层对应的状态信息,向该服务格中处于正常工作状态的服务格转发该调用请求。
服务层对应的状态信息,指示了服务层对应的各个服务格是否故障。云管理平台会周期性地检测各个服务格的状态,在服务格的状态改变的情况下,更新对应服务层的状态信息。其中,服务格的状态改变包括,服务格由故障变为处于正常工作状态,或者,由处于正常工作状态变为故障。由于云管理平台会检测并更新各个服务层的状态,因此,在提供云服务的某个服务格故障的情况下,云管理平台会将提供云服务的服务格切换为同一个服务层中,处于正常工作状态的服务格,由此实现故障隔离,降低故障的爆炸半径(也即故 障的影响范围),保证了云服务的顺利进行。
需要注意的是,云管理平台可以部署在计算设备上,计算设备可以有多种设备形态,包括:交换机、路由器、或者芯片等,具体此处不做限定。其中,交换机或者路由器既可以是物理网元,也可以是虚拟网元(即纯软件实现的一个或者多个功能模块的组合),具体此处不做限定。
本申请中,服务格的划分原则上以故障互不影响的计算机集群作为依据,这里的集群包括机房、AZ、机柜、故障隔离的虚拟机集群等,具体此处不做限定。
示例性的,可以将服务层与AZ的交集作为一个服务格。请参阅图2,图2为本申请实施例提供的服务层和服务格的一个示意图。
如图2所示,在纵向维度,一个region中有多个风火水电独立的AZ,基于AZ进行划分;在横向维度,可以按照一定的数据维度划分deck。这样便可以确定deck与AZ的交集为grid。例如,图2所示的deck0和AZ1的交集为grid0-1。基于图2所示的划分方式,实现了grid之间的故障隔离。
在本申请实施例中,如图2所示的架构称为服务格架构(grid architecture),是一种广泛适用于公有云、私有云、行业云场景,降低服务故障的影响面(也即故障爆炸半径)的云计算基础设施架构。
需要注意的是,图2只是对划分服务格的一个示例,在实际应用中,云管理平台还可以提供更多或者更少数量的服务层,每个服务层中还可以包括更多数量的服务格,具体此处不做限定。
下面,以云管理平台提供的云服务架构中包括第一服务层,第一服务层包括第一服务格和第二服务格为例,对本申请实施例提供的基于云技术的故障处理方法的流程进行说明。
请参阅图3,图3为本申请实施例提供的基于云技术的故障处理方法的流程示意图,包括以下步骤:
301.云管理平台接收来自于客户端的第一调用请求。
租户通过客户端向云管理平台发送第一调用请求,第一调用请求用于请求云管理平台提供的云服务。
302.云管理平台确定第一调用请求对应的服务层为第一服务层。
第一调用请求中携带租户的标识,该标识用于唯一指示租户。云管理平台能够根据第一调用请求携带的租户的标识,以及标识与服务层之间的映射关系,确定该租户对应于第一服务层。也就是说,确定该租户发出的第一调用请求,需要转发至第一服务层中的服务格。
需要注意的是,在实际应用中,用于唯一指示租户的租户标识有多种可能,可以是网际互连协议(internet protocol,IP)地址、媒体存取控制(media access control,MAC)地址。除此之外,还可以是其他起到唯一指示作用的标识,例如:身份证号码、手机号码、企业代码与员工号的组合等,具体此处不做限定。
基于上述方法,云管理平台能够存储租户的标识与服务层之间的映射关系,使得云管理平台能够把不同租户发起的调用请求对应到各自的服务层,对租户进行分流,通过租户 标识确定每个租户发起的调用请求所对应的服务层,使得服务层之间的故障隔离,也能在一定程度上降低故障的爆炸半径。
303.若确定第一服务格故障,云管理平台更新第一服务层对应的状态信息。
云管理平台还可以存储并更新每个服务层所对应的状态信息,该状态信息指示的是服务层中每个服务格是否故障。在确定第一服务格故障的情况下,云管理平台会更新第一服务格所在的第一服务层对应的状态信息,更新后的状态信息指示第一服务格故障。
在一些可选的实施方式中,云管理平台能够通过多种方式确认服务格是否故障。云管理平台会连续多次向第一服务格发送检测信息,检测信息用于检测第一服务格的状态。其中,第一服务格的状态是指第一服务格是否故障。如果云管理平台连续多次接收来自于第一服务格的异常响应信息,则确定第一服务格故障;或者,云管理平台连续多次未在预设时间段内接收来自于第一服务格的响应信息,则确定第一服务格故障。
可以理解的是,云管理平台连续多次向第一服务格发送检测信息包括的场景为:云管理平台向第一服务格发送检测信息,第一服务格回复异常响应或者在预设时间段内云管理平台未接收来自于第一服务格的响应信息,为了避免将偶发情况视为第一云服务格故障,云管理平台会重新向第一服务格发送检测信息。多次重试的结果都是第一云服务格回复异常响应消息或者未在预设时间段内接收来自于第一服务格的响应信息,则可以判定第一服务格故障(也即不健康)。
基于上述方法,云管理平台通过多种方式确定第一服务格是否故障,丰富了本申请技术方案的实现方式。另外,云管理平台确认服务格故障需要连续多次检测,能够避免将偶发情况(例如网络暂时不稳定等)认定为服务格故障,提升了故障检测的准确度。同时,云管理平台在第一服务格故障的情况下,会更新第一服务格对应的第一服务层的状态信息,更新后的状态信息指示第一服务格故障,避免将第一调用请求转发至故障的服务格而导致的服务不可用,也即为降低故障的影响范围提供了实现基础,提升了本申请技术方案的可实现性。
304.云管理平台向第二服务格发送第一调用请求。
在更新后的状态信息指示第一服务格故障的情况下,云管理平台会根据更新后的第一服务格对应的状态信息,向处于正常工作状态的第二服务格转发第一调用请求。其中,第二服务格处于正常工作状态也可以称为第二服务格健康。
云管理平台之所以能够将第一调用请求转发至第二服务格,不仅是因为第二服务格处于正常工作状态,还因为在本申请技术方案中,同一个服务层中的各个服务格中会存储相同的云服务数据。所谓云服务数据,是指与云管理平台提供的云服务相关的数据,包括租户信息(例如账号、密码)、鉴权信息(验证码)以及其他在调用云服务的过程中需要使用的信息,具体此处不做限定。
在一些可选的实施方式中,同一个服务层中的各个服务格之间可以通过数据同步的方式存储相同的云服务数据。其中,同步方式包括实时同步或者准实时同步,具体此处不做限定。
下面结合示意图,对服务格之间进行数据同步进行说明,请参阅图4和图5,图4和 图5均为本申请实施例提供的数据同步的示意图。
如图4所示,可以在每个服务层之间维持一个事务日志队列(transaction log queue),当第一个服务格的数据库(data base,DB),缓存(cache)或者消息队列(message queue)写入数据时,将这些软件(数据库/缓存/消息队列)的事务日志(transaction log)通过transaction log queue同步到另外一个或者多个服务格中,完成数据同步。
如图5所示,可以要求写状态的微服务在写服务格1-1的状态数据到数据库/缓存/消息队列时,同时将数据写入到服务格1-2。由于服务格故障等原因产生的无法实时写入的数据可以在每个服务格后台启动一个定时作业,定期扫描数据库/缓存/消息队列用于弥补这些数据。
需要注意的是,图4和图5只是以一个服务层包括两个服务格为例,在实际应用中,一个服务层中还可以有更多服务格,同步方法类似,此处不再赘述。
从以上技术方案可以看出,本申请具有以下优点:同一个服务层的多个服务格中存储相同的云服务数据。云管理平台接收租户发出的对应于第一服务层的第一调用请求,如果第一服务层中的第一服务格发生故障,可以将该调用请求转发给同一个服务层中且处于正常工作状态的第二服务格处理,降低了故障的爆炸半径,从而保证云服务的正常运行。
在一些可选的实施方式中,云管理平台在确定第一调用请求对应第一服务层之后(也即在步骤302之后),如果第一服务层中的每个服务格都处于正常工作状态,云管理平台会将第一调用请求发送至第一服务格。可选的,此处的第一服务格可以是第一服务层中与租户的地址位置最近或者与租户之间的网络时延最小的服务格,具体此处不做限定。
基于上述方法,在第一服务层中的多个服务格均处于正常工作状态的情况下,云管理平台可以选择距离租户地址位置最近的服务格转发调用请求,以提高云服务的可靠性;或者云管理平台选择网络时延最低的服务格转发调用请求,以提高云服务的效率和响应速度。
在一些可选的实施方式中,云管理平台可以通过检测各个服务格的状态,获取服务层对应的状态信息,也就是说,云管理平台还可以执行如下操作:
305.云管理平台向第一服务格发送第一检测信息。
云管理平台可以在超文本传输协议(hyper text transfer protocol,HTTP)请求中承载第一检测信息,第一检测信息用于检测第一服务格是否发生故障。可选的,云管理平台还可以在其他请求中承载第一检测信息,例如租户数据报协议(user datagram protocol,UDP)请求,具体此处不做限定。
306.云管理平台接收来自于第一服务格的第一响应信息。
在一些可选的实施方式中,第一服务格在收到第一检测信息之后,会向云管理平台发送第一响应信息,第一响应信息指示第一服务格处于正常工作状态。
307.云管理平台向第二服务格发送第二检测信息。
308.云管理平台接收来自于第二服务格的第二响应信息。
步骤307至步骤308,与步骤305至步骤306类似,此处不再赘述。
309.云管理平台存储第一服务层对应的状态信息。
云管理平台在收到第一响应信息和第二响应信息之后,会存储第一服务层对应的状态信息。可以理解的是,基于前文对步骤305至步骤308的简单说明,在步骤309中,云管理平台存储的第一服务层对应的状态信息指示第一服务层对应的第一服务格和第二服务格均处于正常工作状态。
在一些可选的实施方式中,在第一服务层中存在多个处于正常工作状态的服务格的情况下,云管理平台可以基于一定的策略选择提供云服务的服务格。下面举例说明这种情况:
云管理平台还可以接收租户发出的第二调用请求,第二调用请求对应于第一服务层。其中,第二调用请求对应的云服务可以与第一调用请求对应的云服务相同,也可以不同,具体此处不做限定。云管理平台会确定第一服务层的当前状态信息,在第一服务层的当前状态信息指示多个服务格均处于正常工作状态的情况下,会从这多个服务格中确定目标服务格,并向目标服务格转发第二调用请求。可选的,目标服务格为多个服务格中距离租户地址位置最近或者与租户之间网络时延最低的服务格,具体此处不做限定。
基于上述方法,在第一服务层中的多个服务格均处于正常工作状态的情况下,云管理平台可以选择距离租户地址位置最近的服务格转发调用请求,以提高云服务的可靠性;或者云管理平台选择与租户之间网络时延最低的服务格转发调用请求,以提高云服务的效率和响应速度。
在一些可选的实施方式中,云管理平台还会向租户提供配置接口,配置接口用于获取所述租户输入或选择的云服务。也就是说,云管理平台提供的云服务可以匹配于租户的实际需求,适用于租户输入的条件或者选择。
示例性的,下面结合场景图,对租户输入或者选择云服务的简单过程进行说明。请参阅图6和图7,图6和图7均为本申请实施例提供的云管理平台的界面示意图。
如图6所示,云管理平台的界面包括输入框601。租户可以输入框601中输入云服务的架构的设置条件,云管理平台会响应针对于输入框601的操作指令,配置符合设置条件的云服务架构。
示例性的,假设某租户A要在云上部署某个可靠性相对比较高的服务,先用自动化工具创建一个服务堆栈(包括服务的所有资源和应用),然后要复制了3个堆栈的副本,得到了4个同样的服务堆栈。如果租户A想要将4个同样的服务堆栈平均放置到两个AZ中,并且使得不同AZ中的服务堆栈数据同步。那么,租户A可以在输入框601中输入配置条件为:将4个同样的服务堆栈定义为4个服务格,将4个同样的服务堆栈平均放置到两个AZ中,然后把任意2个不在同一个AZ的服务格划分到一个服务层,并命名形成的两个服务层为服务层0和服务层1。
基于本申请提供的技术方案,云管理平台该服务会默认将属于同一个服务层的两个不同服务格堆栈中的有状态中间件设置成互相备份关系。
而后,租户还可以配置该服务的路由场景,如依据租户A的服务的租户的ID除以2,余数为0的划分到服务层0,余数为1的划分到服务层1。那么在这个配置的基础上,该服务将自动将使用该云服务的不同租户的请求路由到服务层0和服务层1。
通过上述说明,可以得知本申请技术方案不论是在公有云还是私有云,都能以云服务的形式提供某种服务,该服务可以让客户界定数据维度划分的逻辑集合(也即deck)和集群部署的集合(也即grid);也可以给客户定义配置,从而将消息路由到不同的deck和grid中。
在一些可选的实施方式中,云管理平台的界面还包括预览控件602,租户点击该控件,云管理平台可以显示符合租户输入的设置条件的云服务架构的示意图,该示意图可以类似于图1所示的云服务架构示意图,可以使得租户直观地感受到云服务架构。
在一些可选的实施方式中,云管理平台的界面还包括确定控件603,租户点击确定控件603,云管理平台就可以实际部署满足租户输入的设置条件的云服务架构。
在一些可选的实施方式中,租户输入的配置信息还用于指示租户标识、服务层、服务格之间的映射关系。具体来说,包括租户标识与服务层之间的映射关系,以及服务层与服务格之间的映射关系。租户的标识与服务层之间的映射关系,可以用于对租户分流;服务层与服务格之间的映射关系包括,一个服务层中包括的服务格数量等信息。基于配置信息,云管理平台会部署每个服务层对应的至少两个租户标识和至少两个服务格。
在一些可选的实施方式中,配置信息可以是某种规则或者自定义的算法,使得计算设备将租户标识尽可能均匀地划分至不同的服务层中。例如,假设对于购票的云服务,可以基于租户账号的尾号对租户进行分流,比如将尾号为单数的租户映射到同一个服务层,尾号为双数的租户映射到另一个服务层。每个服务层中所包括的服务格的数量也不进行限定,只要大于一个即可。
如图7所示,云管理平台的界面包括备选框701,每个备选框指示一种云服务架构。租户点击详情控件702,云管理平台会响应针对于详情控件702的触控指令,显示对应的云服务架构的详细信息(例如云服务架构中包括的服务层数量,每个服务层包括的服务格数量等)。云管理平台的界面还包括确定控件703,租户点击确定控件703,云管理平台就可以实际部署满足租户所选择的云服务架构。
需要注意的是,图6和图7只是对云管理平台的界面的示例,并不限定界面的具体内容,在实际应用中,可以灵活设定,具体此处不做限定。
需要注意的是,租户在调用云服务的过程中,可以不感知到数据分流的过程。示例性的,假设租户发起的云服务调用请求为购票请求,租户登录购票软件,点击购票,终端设备直接显示的是购票界面。而在后台,实际上进行的是将购票对应的云服务调用请求发送至计算设备,计算设备根据租户标识确定对应的服务层,再将请求转发至该服务层中处于正常工作状态的服务格中。
在一些可选的实施方式中,服务层中的每个服务格都能为多个租户提供云服务。下面结合具体的应用场景,进行说明。
示例性,假设第一租户请求的第一云服务为买票业务,第一租户对应第一服务层。在第一服务层包括的第一服务格处于正常工作的状态下,由第一服务格提供买票服务。如果第一服务格故障,登录了第一租户账号的终端可以显示刷新界面,使得第一租户重新发起买票业务。此时,云管理平台会将提供买票业务的服务格切换为第二服务格,使得第一租 户能够顺利使用买票业务。如果登录了第二租户账号的终端也发起了买票业务,在故障切换之后,第二租户也可以通过第二服务格完成买票。可选的,一个服务格能够提供不同的云服务。第一租户可以先后请求买票业务和听歌业务,第一租户对应目标服务层,在第一服务格处于正常工作的状态下,由第一服务格提供买票服务和听歌服务。
在一些可选的实施方式中,云管理平台提供的云服务包括弹性云服务、云硬盘服务、虚拟私有云服务、云数据库服务或分布式缓存服务,具体此处不做限定。
基于上述方法,云管理平台能够提供多种类型的云服务,更好地满足了租户业务开展的多样性。
下面,结合示意图,对云服务架构进一步说明。请参阅图8,图8为本申请实施例提供的云服务架构的示意图。
示例性的,图8是以将某云服务架构拆分成两个服务层,每个服务层又有两个服务格为例,在实际应用中,云服务可以对应更多数量的服务层,每个服务层可以对应更多数量的服务格,具体此处不做限定。
每个服务格中部署一个具有完整功能的业务集群,包含业务所有的微服务和相关中间件,数据存储等。其中,数据存储可以通过多种方式实现,可以是数据库(data base,DB)或者消息队列(message queue),除此之外,还可以是其他的方式,例如:缓存(cache)实现,具体此处不做限定。代理(proxy)用于进行数据分发,在收到云服务调用请求之后,将该云服务调用请求分发对应的微服务进行处理。
部署在一个服务层的不同服务格,可以用“消息”的方式进行数据同步,形成一读多写或者主备关系。不同服务格同步的数据为云服务数据,包括租户信息(例如账号、密码等)、鉴权信息(例如验证码)、权限信息等,或者其他在租户发起云服务调用请求中可能用到的信息,具体此处不做限定。
如图8所示,运行云管理平台的计算设备部署在服务层和服务格之外,作为一个通用组件,用于路由云服务的请求。计算设备可以将云服务调用请求路由至任何一个服务层中的任何一个服务格中。
需要注意的是,图8只是一个示例,在实际应用中,还可以包括更多或者更少数量的计算设备,具体此处不做限定。
在本申请实施例中,可以对从逻辑上对计算设备进行划分。下面结合示意图进行说明,请参阅图9,图9为本申请实施例提供的计算设备的结构示意图。
如图9所示,计算设备900包括路由模块(router)901、元数据服务(meta data service)模块902、探测(detector)模块903和命名服务(naming service)模块904。
路由模块901,可以认为是计算设备900的核心模块,用于将云服务调用请求按照逻辑数据分区分配到不同的服务层/服务格中。常用情况下,如果云服务暴露的是HTTP/HTTPS接口,路由模块901可以是一个7层的proxy,可以将HTTP/HTTPS(1.1/2.0)请求路由到不同的后端。如果云服务暴露了其他协议(例如UDP协议)的接口,则路由模块901也可以将其他协议路由到不同后端,只要能完成消息路由功能即可,具体此处不做限定。
元数据服务模块902,用于保存逻辑数据分区关系,也即保存服务层与租户标识的映射关系。例如,在典型的以租户为基础的数据分区实现中,元数据服务模块902可以保存租户ID到服务层的映射关系,以及服务层与服务格的映射关系。
可选的,可以用规则/配置(例如自定义算法)确定数据分区关系(也即租户标识与服务层的映射关系)。可选的,也可以将每个租户标识与服务层的映射关系作为一个键值对存储在元数据服务模块902中。
在一些可选的实施方式中,元数据服务模块902可以维护一个协议,在元数据服务模块902的数据更新(也即租户标识与服务层的映射关系更新)时,将更新推送至路由模块901中,使得路由模块901也维护元数据服务模块902中数据的缓存。换句话说,路由模块901中也可以存储并更新租户标识与服务层的映射关系。这样做能够保证元数据服务模块902崩溃不影响计算设备的正常运行。
探测模块903,用于动态检测各个服务格的健康情况,并在服务格故障之后更新命名服务模块904中的数据,也即更新故障服务格所在的服务层对应的状态信息。在一些可选的实施方式中,探测模块903可以通过设定某些固定的测试请求(如HTTPS请求)不断轮询各个服务格,当确定某个服务格故障时,更新命名服务模块904中的数据。除此之外,在服务格主动关闭情况下,也可以通过人工输入命令或者通过其他途径通知探测模块903,并使得探测模块903更新命名服务模块904中的数据。
命名服务模块904,用于保存服务层中每个服务格的健康情况,也即存储各个服务层对应的状态信息。
当路由模块901希望转发云服务调用请求时,会先得到该云服务调用请求所应该发送的服务层。然后从命名服务模块904中得到该服务层中的各个服务格的健康情况,从而选择健康的服务格,完成云服务调用请求到服务格的映射。其中,健康的服务格是指处于正常工作状态的服务格。
需要注意的是,本申请技术方案的实现方式,除了在上文提到的在公有云还是私有云,都能以云服务的形式提供某种服务,该服务可以让租户界定数据维度划分的逻辑集合(也即deck)和集群部署的集合(grid);也可以给租户定义配置,从而将消息路由到不同的deck和grid中。除此之外,本申请技术方案还可以基于软件方式提供某种架构实现,该架构实现可以让租户界定数据维度划分的逻辑集合(decks)和集群部署的集合(grids),其中集群部署的集合各个元素(集群)可以通过某种方式互相同步状态。还可以通过软硬件件方式提供某种通用软件或者硬件,该通用软件或者硬件可以给客户定义配置,从而将消息路由到不同的数据维度划分的逻辑集合(decks)和集群部署的集合(grids),具体此处不做限定。
根据以上基于云技术的故障处理方法,本发明实施例进一步公开了云管理平台的内部结构,具体参见下文:
请参阅图10,图10为本申请实施例提供的云管理平台的结构示意图。
云管理平台提供的云服务架构至少包括第一服务层,第一服务层至少包括第一服务格 和第二服务格,第一服务格和第二服务格存储相同的云服务数据。
如图10所示,云管理平台1000包括:处理模块1001和收发模块1002。
收发模块1002,用于接收租户发出的第一调用请求。
处理模块1001,用户确定第一调用请求对应于第一服务层。
收发模块1002,还用于在第一服务格故障的情况下,将第一调用请求转发至第二服务格,第二服务格处于正常工作状态。
在一些可选的实施方式中,处理模块1001,还用于更新第一服务层中每个服务格的状态信息,状态信息指示第一服务层中的每个服务格是否故障。
在一些可选的实施方式中,第一调用请求中携带租户的标识。处理模块1001,还用于根据租户的标识,以及标识与服务层之间的映射关系,确定租户对应于第一服务层。
在一些可选的实施方式中,处理模块1001,还用于在第一服务层中的每个服务格均处于正常工作状态的情况下,将第一调用请求发送至第一服务格,第一服务格距离租户的地址位置最近或第一服务格与租户之间的网络时延最小。
在一些可选的实施方式中,处理模块1001,还用于向租户提供配置接口,配置接口用于获取租户输入或选择的云服务。
在一些可选的实施方式中,收发模块1002,具体用于连续多次向第一服务格发送检测信息,检测信息用于检测第一服务格的状态。处理模块1001,具体用于若连续多次接收来自于第一服务格的异常响应信息,则确定第一服务格故障。或者,若连续多次未在预设时间段内接收来自于第一服务格的响应信息,则确定第一服务格故障。
在一些可选的实施方式中,云服务包括弹性云服务、云硬盘服务、虚拟私有云服务、云数据库服务或分布式缓存服务。
在一些可选的实施方式中,收发模块1002,还用于接收租户发出的第二调用请求,第二调用请求对应于第一服务层。其中,第二调用请求对应的云服务可以与第一调用请求对应的云服务相同,也可以不同,具体此处不做限定。处理模块1001,还用于确定第一服务层的当前状态信息,在第一服务层的当前状态信息指示多个服务格均处于正常工作状态的情况下,会从这多个服务格中确定目标服务格。可选的,目标服务格为多个服务格中距离租户位置最近或者时延最低的服务格,具体此处不做限定。收发模块1002,还用于向目标服务格转发第二调用请求。
需要说明的是,处理模块1001和收发模块1002均可以通过软件实现,或者可以通过硬件实现。示例性的,接下来以处理模块1001为例,介绍处理模块1001的实现方式。类似的,收发模块1002的实现方式可以参考处理模块1001的实现方式。
当通过软件实现时,处理模块1001可以是运行在计算机设备上的应用程序或代码块。其中,计算机设备可以是物理主机、虚拟机、容器等计算设备中的至少一种。进一步地,上述计算机设备可以是一台或者多台。例如,处理模块1001可以是运行在多个主机/虚拟机/容器上的应用程序。需要说明的是,用于运行该应用程序的多个主机/虚拟机/容器可以分布在相同的AZ中,也可以分布在不同的AZ中。用于运行该应用程序的多个主机/虚拟机/容器可以分布在相同的region中,也可以分布在不同的region中,具体此处不做限定。
同样,用于运行该应用程序的多个主机/虚拟机/容器可以分布在同一个虚拟私有云(virtual private cloud,VPC)中,也可以分布在多个VPC中。其中,通常一个region可以包括多个VPC,而一个VPC中可以包括多个AZ。
当通过硬件实现时,处理模块1001中可以包括至少一个计算设备,如服务器等。或者,处理模块1001也可以是利用专用集成电路(application-specific integrated circuit,ASIC)实现、或可编程逻辑器件(programmable logic device,PLD)实现的设备等。其中,上述PLD可以是复杂程序逻辑器件(complex programmable logical device,CPLD)、现场可编程门阵列(field-programmable gate array,FPGA)、通用阵列逻辑(generic array logic,GAL)或其任意组合实现。
处理模块1001包括的多个计算设备可以分布在相同的AZ中,也可以分布在不同的AZ中。处理模块1001包括的多个计算设备可以分布在相同的region中,也可以分布在不同的region中。同样,处理模块1001包括的多个计算设备可以分布在同一个VPC中,也可以分布在多个VPC中。其中,所述多个计算设备可以是服务器、ASIC、PLD、CPLD、FPGA和GAL等计算设备的任意组合。
云管理平台1000可以执行前述图1至图9所示实施例中云管理平台所执行的操作,此处不再赘述。
请参阅图11,图11为本申请实施例提供的计算设备结构示意图。
如图11所示,计算设备1000包括:总线1003、处理器1005、存储器1004和通信接口1006。处理器1005、存储器1004和通信接口1006之间通过总线1003通信。计算设备1000可以是服务器或终端设备。应理解,本发明不限定计算设备1000中的处理器、存储器的个数。
总线1003可以是外设部件互连标准(peripheral component interconnect,PCI)总线或扩展工业标准结构(extended industry standard architecture,EISA)总线等。总线可以分为地址总线、数据总线、控制总线等。为便于表示,图11中仅用一条线表示,但并不表示仅有一根总线或一种类型的总线。总线1003可包括在计算设备1000各个部件(例如,存储器1004、处理器1005、通信接口1006)之间传送信息的通路。
处理器1005可以包括中央处理器(central processing unit,CPU)、图形处理器(graphics processing unit,GPU)、微处理器(micro processor,MP)或者数字信号处理器(digital signal processor,DSP)等处理器中的任意一种或多种。
存储器1004可以包括易失性存储器(volatile memory),例如随机存取存储器(random access memory,RAM)。处理器1005还可以包括非易失性存储器(non-volatile memory),例如只读存储器(read-only memory,ROM),快闪存储器,机械硬盘(hard disk drive,HDD)或固态硬盘(solid state drive,SSD)。
存储器1004中存储有可执行的程序代码,处理器1005执行该可执行的程序代码以分别实现前述处理模块1001和收发模块1002的功能,从而实现基于云技术的故障处理方法。也即,存储器1004上存有云管理平台用于执行基于云技术的故障处理方法的指令。
通信接口1006使用例如但不限于网络接口卡、收发器一类的收发单元,来实现计算设 备1000与其他设备或通信网络之间的通信。
本发明实施例还提供了一种计算设备集群。该计算设备集群包括至少一台计算设备。该计算设备可以是服务器,例如是中心服务器、边缘服务器,或者是本地数据中心中的本地服务器。在一些实施例中,计算设备也可以是台式机、笔记本电脑或者智能手机等终端设备。
请参阅图12,图12是本发明实施例提供的计算设备集群一种结构示意图。
如图12所示,计算设备集群包括至少一个计算设备1000。计算设备集群中的一个或多个计算设备1000中的存储器1004中可以存有相同的云管理平台用于执行基于云技术的故障处理方法的指令。
需要说明的是,计算设备集群中的不同的计算设备1000中的存储器1004可以存储不同的指令,用于执行云管理平台的部分功能。也即,不同的计算设备1000中的存储器1004存储的指令可以实现存储处理模块1001和收发模块1002中的一个或多个模块的功能。
本发明实施例还提供了一种包含指令的计算机程序产品。所述计算机程序产品可以是包含指令的,能够运行在计算设备上或被储存在任何可用介质中的软件或程序产品。当所述计算机程序产品在至少一个计算机设备上运行时,使得至少一个计算机设备执行上述应用于云管理平台用于执行基于云技术的故障处理方法。
本发明实施例还提供了一种计算机可读存储介质。所述计算机可读存储介质可以是计算设备能够存储的任何可用介质或者是包含一个或多个可用介质的数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘)等。该计算机可读存储介质包括指令,所述指令指示计算设备执行上述应用于云管理平台用于执行基于云技术的故障处理方法。
最后应说明的是:以上实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的保护范围。

Claims (17)

  1. 一种基于云技术的故障处理方法,其特征在于,所述方法应用于云服务,所述云服务的架构中至少包括第一服务层,所述第一服务层至少包括第一服务格和第二服务格,所述第一服务格和所述第二服务格存储相同的云服务数据,所述方法包括:
    接收租户发出的第一调用请求,确定所述第一调用请求对应于所述第一服务层;
    在所述第一服务格故障的情况下,将所述第一调用请求转发至所述第二服务格,所述第二服务格处于正常工作状态。
  2. 根据权利要求1所述的方法,其特征在于,所述在所述第一服务格故障的情况下之后,所述方法还包括:
    更新所述第一服务层中每个服务格的状态信息,所述状态信息指示所述第一服务层中的每个服务格是否故障。
  3. 根据权利要求1或2所述的方法,其特征在于,所述第一调用请求中携带所述租户的标识,所述确定所述第一调用请求对应于所述第一服务层具体包括:
    根据所述租户的标识,以及所述标识与服务层之间的映射关系,确定所述第一调用请求对应于所述第一服务层。
  4. 根据权利要求1或3任一项所述的方法,其特征在于,确定所述第一调用请求对应于所述第一服务层之后,所述方法还包括:
    在所述第一服务层中的每个服务格均处于正常工作状态的情况下,将所述第一调用请求发送至所述第一服务格,所述第一服务格距离所述租户的地址位置最近或所述第一服务格与所述租户之间的网络时延最小。
  5. 根据权利要求1至4任一项所述的方法,其特征在于,所述方法还包括:
    向所述租户提供配置接口,所述配置接口用于获取所述租户输入或选择的所述云服务。
  6. 根据权利要求1至5中任一项所述的方法,其特征在于,所述确定所述第一服务格故障,具体包括:
    连续多次向所述第一服务格发送检测信息,所述检测信息用于检测所述第一服务格的状态;
    若连续多次接收来自于所述第一服务格的异常响应信息,则确定所述第一服务格故障;或者,若连续多次未在预设时间段内接收来自于所述第一服务格的响应信息,则确定所述第一服务格故障。
  7. 根据权利要求1至6中任一项所述的方法,其特征在于,所述云服务包括弹性云服务、云硬盘服务、虚拟私有云服务、云数据库服务或分布式缓存服务。
  8. 一种云管理平台,其特征在于,所述云管理平台提供的云服务架构至少包括第一服务层,所述第一服务层至少包括第一服务格和第二服务格,所述第一服务格和第二服务格存储相同的云服务数据,所述云管理平台包括:
    收发模块,用于接收租户发出的第一调用请求;
    处理模块,用于确定所述第一调用请求对应于所述第一服务层;
    所述收发模块,还用于在所述第一服务格故障的情况下,将所述第一调用请求转发至 所述第二服务格,所述第二服务格处于正常工作状态。
  9. 根据权利要求8所述的云管理平台,其特征在于,所述处理模块,还用于更新所述第一服务层中每个服务格的状态信息,所述状态信息指示所述第一服务层中的每个服务格是否故障。
  10. 根据权利要求8或9所述的云管理平台,其特征在于,所述第一调用请求中携带所述租户的标识;
    所述处理模块,还用于根据所述租户的标识,以及标识与服务层之间的映射关系,确定所述租户对应于所述第一服务层。
  11. 根据权利要求8至10中任一项所述的云管理平台,其特征在于,所述处理模块,还用于在所述第一服务层中的每个服务格均处于正常工作状态的情况下,将所述第一调用请求发送至所述第一服务格,所述第一服务格距离所述租户的地址位置最近或所述第一服务格与所述租户之间的网络时延最小。
  12. 根据权利要求8至11中任一项所述的云管理平台,其特征在于,所述处理模块,还用于向所述租户提供配置接口,所述配置接口用于获取所述租户输入或选择的所述云服务。
  13. 根据权利要求8至12中任一项所述的云管理平台,其特征在于,所述收发模块,具体用于连续多次向所述第一服务格发送检测信息,所述检测信息用于检测所述第一服务格的状态;
    所述处理模块,具体用于若连续多次接收来自于所述第一服务格的异常响应信息,则确定所述第一服务格故障;或者,若连续多次未在预设时间段内接收来自于所述第一服务格的响应信息,则确定所述第一服务格故障。
  14. 根据权利要求8至13中任一项所述的云管理平台,其特征在于,所述云服务包括弹性云服务、云硬盘服务、虚拟私有云服务、云数据库服务或分布式缓存服务。
  15. 一种计算设备集群,其特征在于,包括至少一个计算设备,每个计算设备包括处理器和存储器;
    所述至少一个计算设备的处理器用于执行所述至少一个计算设备的存储器中存储的指令,以使得所述计算设备集群执行如权利要求1至7中任一项所述的方法。
  16. 一种包含指令的计算机程序产品,其特征在于,当所述指令被计算机设备集群运行时,使得所述计算机设备集群执行如权利要求的1至7中任一项所述的方法。
  17. 一种计算机可读存储介质,其特征在于,包括计算机程序指令,当所述计算机程序指令由计算设备集群执行时,所述计算设备集群执行如权利要求1至7中任一项所述的方法。
PCT/CN2023/081036 2022-06-30 2023-03-13 基于云技术的故障处理方法、云管理平台和相关设备 WO2024001299A1 (zh)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN202210762448 2022-06-30
CN202210762448.1 2022-06-30
CN202211214539.8 2022-09-30
CN202211214539.8A CN117376103A (zh) 2022-06-30 2022-09-30 基于云技术的故障处理方法、云管理平台和相关设备

Publications (1)

Publication Number Publication Date
WO2024001299A1 true WO2024001299A1 (zh) 2024-01-04

Family

ID=89382615

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/081036 WO2024001299A1 (zh) 2022-06-30 2023-03-13 基于云技术的故障处理方法、云管理平台和相关设备

Country Status (1)

Country Link
WO (1) WO2024001299A1 (zh)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102158612A (zh) * 2010-02-11 2011-08-17 青牛(北京)技术有限公司 基于云计算技术的虚拟呼叫中心系统及其操作方法
US20130238788A1 (en) * 2012-02-24 2013-09-12 Accenture Global Services Limited Cloud services system
CN103778031A (zh) * 2014-01-15 2014-05-07 华中科技大学 一种云环境下的分布式系统多级故障容错方法
US20170331812A1 (en) * 2016-05-11 2017-11-16 Oracle International Corporation Microservices based multi-tenant identity and data security management cloud service

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102158612A (zh) * 2010-02-11 2011-08-17 青牛(北京)技术有限公司 基于云计算技术的虚拟呼叫中心系统及其操作方法
US20130238788A1 (en) * 2012-02-24 2013-09-12 Accenture Global Services Limited Cloud services system
CN103778031A (zh) * 2014-01-15 2014-05-07 华中科技大学 一种云环境下的分布式系统多级故障容错方法
US20170331812A1 (en) * 2016-05-11 2017-11-16 Oracle International Corporation Microservices based multi-tenant identity and data security management cloud service

Similar Documents

Publication Publication Date Title
US10700979B2 (en) Load balancing for a virtual networking system
US11677818B2 (en) Multi-cluster ingress
US10735509B2 (en) Systems and methods for synchronizing microservice data stores
EP2871553B1 (en) Systems and methods for protecting virtualized assets
US11095534B1 (en) API-based endpoint discovery of resources in cloud edge locations embedded in telecommunications networks
US20190235979A1 (en) Systems and methods for performing computing cluster node switchover
US10742594B1 (en) User-configurable dynamic DNS mapping for virtual services
EP2693336A1 (en) Virtual machine administration system, and virtual machine administration method
US11953997B2 (en) Systems and methods for cross-regional back up of distributed databases on a cloud service
WO2021051570A1 (zh) 基于分布式集群的数据存储方法、及其相关设备
CN107666493B (zh) 一种数据库配置方法及其设备
US8543680B2 (en) Migrating device management between object managers
US11706298B2 (en) Multichannel virtual internet protocol address affinity
US11886309B2 (en) Cell-based storage system with failure isolation
US11743325B1 (en) Centralized load balancing of resources in cloud edge locations embedded in telecommunications networks
JP7348983B2 (ja) 負荷分散システム、方法、装置、電子機器及び記憶媒体
US10692168B1 (en) Availability modes for virtualized graphics processing
EP3481099B1 (en) Load balancing method and associated device
US11354204B2 (en) Host multipath layer notification and path switchover following node failure
WO2024001299A1 (zh) 基于云技术的故障处理方法、云管理平台和相关设备
US10481963B1 (en) Load-balancing for achieving transaction fault tolerance
CN117376103A (zh) 基于云技术的故障处理方法、云管理平台和相关设备
CN114730311A (zh) 分布式云计算平台中的跨数据中心读写一致性
US11924305B2 (en) Edge node autonomy
CN116760850B (zh) 一种数据处理方法、装置、设备、介质及系统

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23829490

Country of ref document: EP

Kind code of ref document: A1