CN108632057A - A kind of fault recovery method of cloud computing server, device and management system - Google Patents

A kind of fault recovery method of cloud computing server, device and management system Download PDF

Info

Publication number
CN108632057A
CN108632057A CN201710160761.7A CN201710160761A CN108632057A CN 108632057 A CN108632057 A CN 108632057A CN 201710160761 A CN201710160761 A CN 201710160761A CN 108632057 A CN108632057 A CN 108632057A
Authority
CN
China
Prior art keywords
failure
application
virtual machine
operating system
cloud computing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710160761.7A
Other languages
Chinese (zh)
Inventor
欧亚聪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to CN201710160761.7A priority Critical patent/CN108632057A/en
Publication of CN108632057A publication Critical patent/CN108632057A/en
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0654Management of faults, events, alarms or notifications using network fault recovery

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The embodiment of the invention discloses a kind of fault recovery method of cloud computing server, device and management system, this method to include:The hardware resource fault message transmitted by IaaS management platforms is obtained, the operating system failure information of cloud computing server is obtained, obtains the application and trouble information of cloud computing server;According to accessed hardware resource fault message, operating system failure information and application and trouble information determine the failure root of the cloud computing server because;According to failure root because determining troubleshooting strategy;Operation indicated by troubleshooting strategy carries out fault recovery.Implement the embodiment of the present invention, high reliability guarantee can be provided for the Legacy System of enterprise in cloud computing platform, advantageously ensure that the reliable even running of Legacy System.

Description

A kind of fault recovery method of cloud computing server, device and management system
Technical field
The present invention relates to field of cloud computer technology more particularly to a kind of fault recovery method of cloud computing server, devices And management system.
Background technology
Cloud computing (Cloud Computing) is a kind of emerging business computation model, and calculating task is distributed in greatly by it Amount calculate mechanism at resource pool on, so that various application systems is obtained computing capability, memory space and various as needed Software service.In order to obtain a series of benefits that cloud computing is brought, includes the complexity for reducing O&M, save hardware cost It traditional IT system is put moves to the relevant resource pool of cloud computing Deng, more and more enterprises selection and operates above, allow entire IT system can realize unified O&M using the service of cloud computing, the running environment of these IT systems has occurred huge therewith Big variation, due to cloud computing platform reliability there is no dedicated server height, so must be filled in cloud computing platform Divide the reliability service for considering how to continue guarantee system when part computing resource fails.Under cloud computing platform, computing resource It is distributed from resource pool on demand, when computing resource fails, needs that cloud is waited for reschedule distribution computing resource, such as logical Elastic telescopic is crossed to trigger.In the prior art, in order to adapt to the framework of cloud computing, if to ensure that traditional IT system moves It moves on to after the relevant resource pool of cloud computing, can also obtain the guarantee of high reliability (High Availability, HA), usually It is required that the IT system is the system of ready (Cloud-Ready) type of cloud.For the system of Cloud-Ready types, first, it Should be a distributed system, cohesion and the transparency with height;Secondly, it should be redundancy, can handle clothes The case where business device failure, Single Point of Faliure is not present.
However, often there is also the Legacy System that part does not have These characteristics, these something lost in the IT system of enterprises Stay system that funnel-shaped perpendicular system is taken to build, the Resource dynamic allocation in framework level does not fully consider cloud environment, resource Situations such as failure, belongs to the system of non-" Cloud-Ready " type.From the perspective of framework compatibility, perpendicular system and distribution Formula system does not have coupling, and cloud computing at present is designed generally directed at distributed system, so cloud computing at present is flat The general HA schemes of platform are not applied for enterprise's Legacy System, when enterprise is by entire IT system (including these Legacy Systems), entirely After portion all moves on the relevant resource pool of cloud computing, for Legacy System therein, only simply by system again portion In the computing resource for affixing one's name to cloud distribution, guarantee of the cloud computing to its reliability can not be obtained, such as cannot achieve elastic telescopic, Resource of distributing according to need etc., therefore prodigious challenge will be faced in terms of reliability.
Invention content
The embodiment of the present invention provides a kind of fault recovery method of cloud computing server, device and management system, to solve Legacy System moves to the integrity problem of Yun Shanghou.
In a first aspect, an embodiment of the present invention provides the fault recovery method of cloud computing server, it is applied to cloud computing and takes Business device, including:
It is the hardware resource fault message serviced transmitted by IaaS management platforms that PaaS management platforms, which obtain infrastructure, In, the IaaS management platforms are used to manage the hardware resource of the cloud computing server, are additionally operable to detect the hardware resource Hardware resource fault message, the IaaS management platforms are independently of the cloud computing server;Obtain the cloud computing service The operating system failure information of device, the operating system failure information are used to indicate the operation for being installed on the cloud computing server The failure that system occurs;The application and trouble information of the cloud computing server is obtained, the application and trouble information is used to indicate It is installed on the failure that the application of the operating system occurs;
According to the accessed hardware resource fault message, the operating system failure information and the application and trouble Information determine the failure root of the cloud computing server because;According to the failure root because determining troubleshooting strategy;According to described Operation indicated by troubleshooting strategy carries out fault recovery.
Above-mentioned first aspect describes the embodiment of the present invention from PaaS management platforms side and provides a kind of cloud computing server Fault recovery method, by implementing this method, PaaS management platforms can comprehensively detect the hardware resource of cloud computing server The failure that layer, operating system layer and application layer occur, and comprehensive analysis is carried out based on above-mentioned failure, determine failure root because, And fault recovery is carried out using corresponding troubleshooting strategy.In the embodiment of the present invention, when the Legacy System of enterprise moves to After cloud computing server, PaaS management platforms provide HA schemes to the Legacy System, when the cloud computing server breaks down, PaaS management platforms can accurately determine failure and be happened at hardware resource layer, operating system layer or application layer, and correspond to The layer carries out corresponding fault recovery, therefore HA schemes provided in an embodiment of the present invention are with comprehensive.
With reference to first aspect, in some possible embodiments, the operating system also has first agent's application;
The operating system failure information of the cloud computing server is obtained, including:By detecting first agent's application Heartbeat message determine that the operating system failure information, the heartbeat message are used to indicate whether the operating system occurs Failure.
That is, PaaS management platforms are at application deployment (including application program, application system, Enterprise IT System etc.) Cloud computing server on all virtual machines operating system on first agent be all installed apply (Agent), the first generation Heartbeat communication ought to be carried out with PaaS management platforms.PaaS management platforms detect the heartbeat with first agent's application, when some First agent disappears using heartbeat, then shows that disconnection failure occurs for the virtual machine (operating system), PaaS management platforms are corresponding Obtain operating system failure information.
With reference to first aspect, in some possible embodiments, also there is second agent's application in the operating system;
The application and trouble information of the cloud computing server is obtained, including:Described in second agent's application call The state-detection script of application determines the application and trouble information according to the return value of the state-detection script.
Wherein, second agent applies and first agent's application can be the same agent application, can also be different generation It ought to use.
Second agent's application is equally deployed in application layer, can be used for managing the application in cloud computing server, and periodically supervise The operating status of the application on virtual machine is controlled, for example, second agent applies the state inspection by being provided using (application system) Survey the monitoring that script carries out relevant operational state.In specific application scenarios, in the process of running using (application system), move A state-detection script is provided to state, second agent's application periodically calls status.sh in the installation directory, and obtains corresponding Return value, it is possible to understand that, second agent apply according to the return value of script judge application (application system) operating status, And corresponding operating status is sent to PaaS.In the case where determining that application is broken down, second agent's application, which generates, applies Fault message, and the application and trouble information is sent to PaaS.
It should be understood that when second agent applies and first agent's application can be the same agent application, then PaaS management platforms can both monitor the operating status of virtual machine (operating system) by the agent application, can also pass through the generation The operating status of monitoring application ought to be used so that the HA schemes that are provided of the embodiment of the present invention can quickly and conveniently into Row deployment.
With reference to first aspect, in some possible embodiments, believed according to the accessed hardware resource failure Breath, the operating system failure information and the application and trouble information determine the failure root of the cloud computing server because at least Including:
It is all detected in preset time under the hardware resource fault message and the operating system failure information state, Failure root is determined because the hardware resource breaks down;Or detected in preset time the operating system failure information and The application and trouble information, and in the case of not detecting the hardware resource fault message, failure root is determined because of the behaviour Make system failure;Or only detected under application and trouble information state in preset time, failure root is determined because described answer With failure.
It can be seen that when cloud computing server breaks down, failure is no longer directly to be handled in this layer, but unite One converges to PaaS management platforms, since PaaS management platforms grasp the status information on cloud computing server every aspect, tool There is global information view, therefore PaaS management platforms are based in preset time (such as 3 minutes) accessed faulty letter of institute Breath carry out comprehensive analysis and judgement, to accurately determine the failure root for causing cloud computing server above-mentioned failure occur because.It can be with Prevent wrong report from failing to report, therefore HA schemes provided in an embodiment of the present invention have accuracy.
With reference to first aspect, in some possible embodiments, according to the failure root because determining troubleshooting strategy Including:
In failure root because in the case that the hardware resource breaks down, the troubleshooting strategy includes restarting virtually Machine, local reconstruction virtual machine and migration virtual machine;It is described or in the case where failure root breaks down because of the operating system Troubleshooting strategy, which includes at least, restarts virtual machine;Or in the case where failure root breaks down because of the application, the event Barrier processing strategy, which includes at least, restarts virtual machine, restarts application.
In a specific application scenarios, PaaS management platforms determine specific troubleshooting plan based on the type of failure Slightly, for example preset failure diagnostic data base, the Fault Diagnosis Database it can be stored with various faults information in PaaS, for The fault message for belonging to same level assigns different fault levels, as fault level one, fault level two, fault level are third Deng.Such as being directed in the default troubleshooting strategy of hardware resource layer failure, corresponding to preset failure grade one Troubleshooting strategy be to restart virtual machine, two corresponding troubleshooting strategy of fault level is local to rebuild virtual machine, failure Three corresponding troubleshooting strategy of grade is migration virtual machine, and so on.Failure root is being determined because after, PaaS is based on actually obtaining The hardware resource layer failure taken is analyzed, and determines the corresponding fault level of hardware resource layer failure, and base Troubleshooting strategy is determined accordingly in the fault level.
In another specific application scenarios, the different faults processing strategy that PaaS assigns same layer in advance is different preferential Grade automatically selects this layer of priority most after determining that fault rootstock is certain layer of failure based on the fault message received for the first time The troubleshooting strategy that high troubleshooting strategy is executed as needs.Event is cannot achieve in the high troubleshooting strategy of priority In the case that barrier restores, PaaS reselects the lower troubleshooting strategy of priority, and repeats above-mentioned steps.
Such as in failure root because in the case that the hardware resource breaks down, the troubleshooting strategy is attached most importance to Open virtual machine;In the case where execution restarts virtual machine and can not achieve hardware resource fault recovery, the troubleshooting strategy is It is local to rebuild virtual machine;Restart virtual machine and the local feelings rebuild virtual machine and all can not achieve hardware resource fault recovery executing Under condition, the troubleshooting strategy is migration virtual machine.
It for another example says, in failure root because in the case that the application is broken down, the troubleshooting strategy is to restart Using;In the case where execution is restarted using can not achieve application and trouble recovery, the troubleshooting strategy is to restart virtual machine.
It can be seen that the embodiment of the present invention is directed to different failure roots because providing various faults recovery ways, wherein When a kind of fault recovery means cannot achieve fault recovery, it also will continue to carry out corresponding failure using other fault recovery means Restore.And then cloud computing server is ensured after failure, it can be as soon as possible from by the fault recovery, to ensure cloud meter Calculate the high availability of platform.
With reference to first aspect, in some possible embodiments, the operation indicated by troubleshooting strategy, packet are executed It includes:
In failure root because in the case that the hardware resource breaks down, the operation indicated by troubleshooting strategy is executed It includes at least:The IaaS management platforms interface is called to execute the operation indicated by corresponding troubleshooting strategy;Or in failure Because in the case of operating system failure, execute the operation indicated by troubleshooting strategy includes root:Described in calling IaaS management platform interfaces execute the operation indicated by corresponding troubleshooting strategy;Or in failure root because the application occurs In the case of failure, executing the operation indicated by troubleshooting strategy includes:Call second agent's application execution corresponding Operation indicated by troubleshooting strategy.
It can be seen that when needing to carry out fault recovery, troubleshooting strategy is based on by PaaS management platforms PaaS Fault recovery is initiated, ensures that failover capability, PaaS management platforms can call IaaS management platforms or agent application to carry out Corresponding fault recovery, based on different failure roots because taking different fault recovery means, therefore the ability of fault recovery and effect Rate can be ensured by PaaS, that is to say, that HA schemes provided in an embodiment of the present invention are not restricted by the HA abilities of IaaS no matter How the HA abilities of IaaS can ensure the reliability of the application on operation cloud computing server, so therefore present invention implementation The HA schemes that example provides have versatility.
With reference to first aspect, in some possible embodiments, the operation indicated by troubleshooting strategy is executed, is also wrapped It includes:
Fault log is generated based on fault message, the fault log is achieved, and the failure is reported to network management system Daily record, the fault message include the hardware resource fault message, the operating system failure information and the application and trouble Information.The fault log is used to indicate the information such as time, position, fault type, the fault recovery history of failure generation.
When all troubleshooting strategies of PaaS management platforms and corresponding fault recovery all can not achieve fault recovery When, PaaS management platforms are alerted to network management system, report the fault log, in order to which operation maintenance personnel passes through the webmaster System finds the failure and carries out manual maintenance in time, and cloud computing server is avoided to be shut down because can not achieve fault recovery, Ensure the high availability of cloud computing platform.
Second aspect, an embodiment of the present invention provides a kind of devices of fault recovery that realizing cloud computing server, including: Fault detection module, failure analysis module, failure strategy module and Failure Recovery Module, with execute that first aspect is provided one The method that kind realizes the fault recovery of cloud computing server, wherein:
Fault detection module is used to obtain the hardware resource failure that infrastructure services transmitted by IaaS management platforms and believes Breath, wherein the IaaS management platforms are used to manage the hardware resource of the cloud computing server, are additionally operable to detect the hardware The hardware resource fault message of resource, the IaaS management platforms are independently of the cloud computing server;It is additionally operable to described in acquisition The operating system failure information of cloud computing server, the operating system failure information, which is used to indicate, is installed on the cloud computing clothes The failure that the operating system of business device occurs;It is additionally operable to obtain the application and trouble information of the cloud computing server, the application Fault message, which is used to indicate, is installed on the failure that the application of the operating system occurs;
Failure analysis module is used to be believed according to the accessed hardware resource fault message, the operating system failure Breath and the application and trouble information determine the failure root of the cloud computing server because;
Failure strategy module is used for according to the failure root because determining troubleshooting strategy;
Failure Recovery Module carries out fault recovery for the operation indicated by the troubleshooting strategy.
The third aspect, an embodiment of the present invention provides the device of another fault recovery for realizing cloud computing server (clothes Business device), including:The memory and processor coupled with the memory, transmitter and receiver, wherein:The transmitter For sending director data with to outside, the receiver is used to receive the data of external transmission, and the memory is for storing Program code and related data, the processor is for executing the program code stored in the memory, to execute one kind The fault recovery method of cloud computing server, wherein the method is method as described in relation to the first aspect.
Fourth aspect, the embodiment of the present invention provide a kind of management system, the management system include IaaS management platforms, PaaS management platforms and SaaS service platforms, wherein PaaS management platforms include fault detection module, failure analysis module, event It includes agent application to hinder policy module and Failure Recovery Module, SaaS service platforms.The disparate modules of PaaS management platforms with IaaS management platforms pass through the first IF connections of periodic communication interface, disparate modules and the SaaS service platforms of PaaS management platforms Pass through the 2nd IF connections.The management system for realizing the cloud computing server described in first aspect fault recovery method.
5th aspect, an embodiment of the present invention provides a kind of computer readable storage medium, the computer-readable storage Media storage has instruction (realizing code), when run on a computer, computer may make to be based on described instruction and execute State the method described in first aspect.
7th aspect, an embodiment of the present invention provides a kind of computer program products including instruction, when it is in computer When upper operation, computer may make to execute the method described in above-mentioned first aspect based on described instruction.
It can be seen that by implementing the embodiment of the present invention, Legacy System is moved to the cloud meter of cloud computing platform in enterprise After calculating server, PaaS can pass through the fortune of agent application monitor operating system by the failure of IaaS monitoring hardware resource layers The operating status of row state and Legacy System.When PaaS gets fault message, continue to obtain in preset time (such as 2 minutes) Other fault messages carry out comprehensive analysis, determination leads to failure after preset time based on all fault messages summarized The failure root of generation because, and based on failure root because of the specific troubleshooting strategy of determination, so call IaaS or agent application into The corresponding fault recovery of row and fault warning, ensure that Legacy System high availability possessed by cloud computing platform, The HA schemes of the embodiment of the present invention have the complete characteristics such as comprehensive, accuracy and versatility.
Description of the drawings
Technical solution in order to illustrate the embodiments of the present invention more clearly or in background technology below will be implemented the present invention Attached drawing illustrates needed in example or background technology.
Fig. 1 is a kind of cloud computing platform configuration diagram that the prior art provides;
Fig. 2 is a kind of fault recovery method flow diagram of cloud computing server provided in an embodiment of the present invention;
Fig. 3 is the fault recovery method flow diagram of another cloud computing server provided in an embodiment of the present invention;
Fig. 4 is a kind of schematic diagram of PaaS comprehensive detections cloud computing server failure provided in an embodiment of the present invention;
Fig. 5 is that a kind of PaaS provided in an embodiment of the present invention judges the schematic diagram whether cloud computing server breaks down.
Fig. 6 is the flow signal that a kind of PaaS provided in an embodiment of the present invention selects troubleshooting strategy based on priority Figure;
Fig. 7 is a kind of schematic device of fault recovery for realizing cloud computing server provided in an embodiment of the present invention;
Fig. 8 is the schematic device of the fault recovery of another realization cloud computing server provided in an embodiment of the present invention;
Fig. 9 is a kind of management system provided in an embodiment of the present invention;
Figure 10 is another management system provided in an embodiment of the present invention.
Specific implementation mode
The embodiment of the present invention is described with reference to the attached drawing in the embodiment of the present invention.
In current internet and the cloud era of big data technology fast development, cloud computing (cloud computing) is Mainstream through evolving as novel information system calculates general type.Cloud computing is parallel computation, Distributed Calculation, effectiveness calculating And the product of a series of network technologies such as virtualization and computing technique fusion.Fig. 1 is referred to, Fig. 1 is that the prior art provides A kind of cloud computing platform configuration diagram, cloud computing platform are commonly divided into software according to the difference for providing service level and service (Software a s a Service, SaaS), platform service (Platform as a Service, PaaS) and basis is set Standby i.e. service (Infra-structure a s a Service, IaaS) three big service modes, wherein PaaS and IaaS can be with Directly by Services Oriented Achitecture (SOA, Service-Oriented Architecture) or network server to flat Platform user provides service, and the support platform that can also be used as SaaS patterns is serviced to end user indirectly.Wherein:
For independent IaaS (I layers) service mode, the I layers of operating system provided on virtual machine and virtual machine The calculating of (opera system, OS), server virtual, virtual memory and virtual network resource.User is generally concerned with virtual machine Type and relevant configuration (CPU, memory, disk, network etc.), the middleware on the operating system upper layer of virtual machine (middleware), (runtime) and application etc. are all disposed by user oneself when running.IaaS is supplied to the clothes of consumer Business is the utilization to all facilities, including processing, storage, network and other basic computing resources, user can dispose and transport The arbitrary software of row, consumer do not manage or control any cloud computing infrastructure, but can control the selection of operating system, storage sky Between, deployment application, it is also possible to obtain the control of conditional networking component.
For independent PaaS (P layers) service mode:It is that user is required using exploitation that P layers, which are supplied to the service of user, Language and developing instrument are deployed to cloud computing infrastructure up, provide a user running environment, the middleware clothes of application software Business, life cycle management etc..Client need not manage or control the cloud infrastructure of bottom, including network, server, operation When system, storage, operating system, middleware and operation etc., but user can monitor disposed application (application, application system Deng), it is also possible to the hosting environment configuration of control operation application.User often only focuses on the exploitation of application software and in the middle part of PaaS Affix one's name to related data and application.
For independent SaaS (S layers) service mode:S layers provide a user the application operated in cloud computing infrastructure (application, application system) service.User can be accessed in various equipment by client end interface, such as browser, to directly The S layers of application service provided are provided.User need not manage or control any cloud computing infrastructure, including network, service Device, operating system, storage, development environment, using etc..
In the prior art, the computing resource in cloud computing platform has and prodigious may fail.This requires clouds It needs to ensure that run application system has high availability (High Availability, HA) in calculating, high availability refers to One system is highly reliable, i.e., seldom breaks down, or can be restored quickly after breaking down.That is cloud computing When there is computing resource failure or other failures occurs in application system in platform, it is necessary to there is corresponding HA mechanism to ensure application The recovery as early as possible of system shortens the downtime caused by inside the plan routine maintaining operations or system crash outside the plan, to keep away Exempt from the interruption for causing business, improves the availability of application system.Generally use cloud management platform provides the side HA for cloud computing platform Case, currently, most popular cloud management platform is OpenStack, OpenStack is one it is intended that public and private clound construction The open source projects of software are provided with management.Mechanism and individual in the communities OpenStack are all using OpenStack as basic facility Service the universal front end of (IaaS) resource.OpenStack has become current industrial quarters and academia's IaaS cloud platform is true On deployment criteria, OpenStack is widely used in all trades and professions.The top priority of OpenStack is the deployment of simplified cloud Process simultaneously brings good scalability for it, and from the viewpoint of OpenStack, in cloud computing platform, IaaS is as cloud The brace foundation facility of calculating, IaaS provide elasticity, expansible infrastructure services, and big rule can be provided to upper layer application Mould, the calculating service of distribution according to need, storage service and network service, the network service of IaaS cloud platform is as its core the most Service is the key that influence all kinds of cloud application service quality.So OpenStack be deployed in cloud platform bottom physics calculating, On storage and Internet resources, to realize the unified management of calculating, storage and Internet resources, IaaS layers of cloud infrastructure is provided Uniform service.
OpenStack proposes the solution of a virtual machine (Virtual Machine, VM) HA, OpenStack's HA schemes are dedicated to solving failure monitoring and the fault recovery of infrastructure layer.The HA schemes of OpenStack include mainly:(1) It monitors (Monitoring), detects virtualization layer failure, the failure of monitoring calculation node;(2) (Fencing) is isolated, isolation is lost Lose calculate node;(3) restore (Recovery), fault virtual machine is recovered.
As can be seen that existing OpenStack frameworks lead to from the upper failure for only ensureing IaaS layers of design in from the description above It crosses and ensures IaaS layers of HA to realize the HA of cloud computing platform, for the operating system or application on upper layer, OpenStack thinks itself be solved by application, so existing OpenStack HA schemes can not detect upper layer The failure of operating system or application.In fact, up to the present, there are one complete void for the communities OpenStack Quasi- machine HA solutions.
Moreover, for funnel-shaped Legacy System, after enterprise moves to Legacy System on cloud computing platform, by Compatible architected features are difficult to cloud computing platform in Legacy System, possibly being present at basis with the relevant failure of Legacy System sets Layer, OS layers and application layer are applied, and OpenStack HA schemes cannot be fully solved the HA problems of application.Also, The fault message for the only infrastructure layer that OpenStack HA schemes obtain, the state for being not bound with application carry out comprehensive point Analysis judges, erroneous judgement is susceptible to, to produce new failure.In addition, OpenStack HA schemes require IaaS to need to have The ability of automatically restoring fault however, the VM HA abilities of different IaaS management platforms are not consistent, or even has part IaaS The ability that management platform miss fault is restored automatically, therefore, OpenStack HA schemes are not in all IaaS management platforms It is general.
In order to solve the disadvantage that in the prior art, an embodiment of the present invention provides a kind of fault recoveries of cloud computing server Method, relevant apparatus and management system, from IaaS layers, OS layers, application layer establish multi-level, comprehensive fault detect and from Reason mechanism, solves the problems, such as how the old Legacy System of enterprise ensures its reliability service after moving to cloud computing platform, most Limits ensure the reliability of application (Legacy System).
It is a kind of management system provided in an embodiment of the present invention referring to Fig. 9, Fig. 9, the management system is serviced flat by SaaS Platform (hereinafter referred to as SaaS), PaaS management platforms (hereinafter referred to as PaaS) and IaaS management platforms (hereinafter referred to as IaaS) Connection is set up.The management system can be directed to (I layers, P layers and S layers) of different levels and provide corresponding management service, In specific implementation, the IaaS management platforms, PaaS management platforms and SaaS service platforms can be separately operable in difference Server in, the IaaS management platforms, PaaS management platforms and SaaS service platforms can also operate in same server On.
Specifically, IaaS management platforms can be the cloud computing base provided towards privately owned, publicly-owned or mixing IaaS cloud user Infrastructure platform can manage ultra-large server, storage and Internet resources concentratedly, form the cloud that can be managed collectively and dispatch Computing resource pond provides to the user on demand using the computing capability with flexible scheduling.IaaS management platforms can be supported a variety of simultaneously Virtualization technology integrates, and provides unified resource management, scheduling and monitoring to I layers of hardware resource, realizes to virtual machine, deposits Resource and Internet resources are stored up from the lifecycle management for creating, detecting destruction, for I layer of virtual machines provide it is quick create, A series of guarantee of high availability such as resilient expansion, local reconstructions, dynamic migration, the not branch of virtual machine offer operating system It holds;
Specifically, PaaS management platforms can be implemented on IaaS management platforms, that is to say, that IaaS management platforms are straight It connects and local hardware resources is managed.PaaS management platforms need to obtain or call the relevant information of local hardware resources When, directly it can ask or call to IaaS management platforms.In addition, PaaS management platforms also provide the end of application and related resource To the monitoring and management at end, request instruction is routed into effective application example, and relies on agent application, cloud controller, health pipe Reason device etc. components are managed and monitor to the information such as operating system, application and the state of related service, operating parameter.
Specifically, in embodiments of the present invention, SaaS service platforms can be built in PaaS management platforms and IaaS management On the architecture of platform composition, SaaS service platforms are only focused in providing application service to cloud service operator or enterprise, Application, application system in SaaS service platforms are managed by PaaS management platforms, are detected and controlled.
The management system is that cloud computing platform (cloud computing server) provides HA schemes, wherein PaaS is in the side HA The core of case has global information view, plans as a whole detection, analysis, strategy and the recovery of failure.
On the one hand, after certain levels of cloud computing server break down, failure is directly located in this layer Reason, but PaaS is uniformly converged to, PaaS comprehensive analysis and judgement, fault detect range covers the hardware money of cloud computing server Active layer face, operating system level and application, therefore HA schemes provided in an embodiment of the present invention are with comprehensive;
On the other hand, PaaS processing, PaaS combinations hardware resource, operation system are given in the unified convergence of I layers, P layers and S layers of failure System and the state of application carry out root cause analysis, accurately determine failure root because so as to prevent wrong report from failing to report, therefore the present invention is real The HA schemes for applying example offer have accuracy;
In another aspect, PaaS, which is based on troubleshooting strategy, initiates fault recovery, based on different failure roots because taking difference Fault recovery means, therefore the ability of fault recovery and efficiency can be ensured by PaaS, that is to say, that provided in an embodiment of the present invention HA schemes are not restricted by the HA abilities of IaaS, and no matter how the HA abilities of IaaS can ensure on operation cloud computing server Application reliability, so therefore HA schemes provided in an embodiment of the present invention have versatility.
A kind of fault recovery method for cloud computing server that the embodiment of the present invention also provides refers to Fig. 2, a kind of cloud meter The fault recovery method of server is calculated, including:
Step S101 obtains hardware resource fault message, operating system failure information, the application and trouble of cloud computing server Information.
In embodiments of the present invention, fault management cores of the design PaaS as entire cloud computing platform (cloud computing server) The heart, PaaS are in the centre of IaaS and SaaS, and PaaS can be used for collecting the cloud service business datum that PaaS itself is managed, may be used also For collecting the data that IaaS and SaaS is submitted, wherein the PaaS is independently of the cloud computing server.
In embodiments of the present invention, PaaS obtains the fault message of cloud computing server, and the fault message specifically includes Hardware resource fault message, operating system failure information and application and trouble information,
Wherein, hardware resource fault message is used to indicate the failure that hardware resource failure level occurs, such as storage resource Deficiency, Network Abnormal, virtual machine operation troubles etc.;Operating system failure information is used to indicate operating system (OS) level and is gone out Existing failure, such as operating system log in exception, system in case of system halt etc.;Application and trouble information is used to indicate using occurred event Barrier, such as application stop, and application system is abnormal etc..
Specifically, PaaS executes the fault message of cloud computing server, including execute following steps S201-S203:
Step S201:Obtain the hardware resource fault message sent in infrastructure, that is, service system IaaS.
In embodiments of the present invention, IaaS is for managing hardware resource, including computing resource, storage resource and network money Source, IaaS are additionally operable to failure caused by the hardware resource of detection cloud computing server, wherein the IaaS is independently of the cloud Calculation server.
In embodiments of the present invention, IaaS can monitor local hardware resources in real time, can dynamically show calculating money Source, storage resource, Internet resources and associated virtual machine operating status, specifically, IaaS can carry out resource capacity inquiry, money Source dosage control, VM monitoring running states, fault warning etc., and relevant information is reported into PaaS.
In a particular embodiment, when the virtual machine (VM) for running on cloud computing server breaks down or cloud computing clothes When relevant hardware configuration (CPU, memory, disk, network etc.) is broken down in business device, IaaS detects the failure, and in real time Corresponding hardware resource fault message is generated, and the hardware resource fault message is sent to PaaS, correspondingly, PaaS is obtained The hardware resource fault message.
Step S202:Obtain the operating system failure information of the cloud computing server.
The running environment and middleware services of PaaS management application softwares, to need the application run to provide life cycle Management, PaaS can obtain middleware, using etc. the operating system related status information that is relied on.For running on a virtual machine Operating system, when system disconnection, system crash etc. failure occurs in the operating system, the PaaS can obtain relevant Operating system failure information.In concrete implementation mode, generation can be arranged in the operating system OS of required detection in PaaS It manages (Agent), PaaS is communicated with the Agent, the operation shape of OS where judging Agent by detecting communication quality State.
Step S203:Obtain the application and trouble information of the cloud computing server.
In embodiments of the present invention, SaaS is only focused on services in offer application (application software, application, application system etc.), The application service is not directly managed and monitors, the role for managing and monitoring the application is actually served as by PaaS, works as institute When stating the application in SaaS and breaking down, PaaS is detected and is got the corresponding application and trouble information of the failure in real time.
It should be noted that it should be noted that there is no inevitable elder generation between step S201, step S202 and step S203 Sequence afterwards, in addition, in the particular embodiment, step S201, two steps in step S202 and step S203 can be simultaneously It carries out, step S201, step S202 and step S203 can also be carried out at the same time, and the description of above-described embodiment should not be construed as to this The limitation of invention.
Step S102:According to the hardware resource fault message, the operating system failure information and the application and trouble Information determine the failure root of the Cloud Server because.
Wherein, after PaaS obtains fault message, judge the source of fault message, and timing is set for the fault message Device, continues whether detection can also get other fault messages in preset time (such as 3 minutes).The described of timer is preset When time terminates, PaaS carries out comprehensive analysis based on all fault messages got in preset time, to determine that cloud takes Be engaged in device failure root because, that is, determine cause Cloud Server break down concrete reason and failure specific position It sets.
Referring to table 1, table 1 be PaaS is got in preset time in particular embodiments of the invention fault message with Correspondence of the failure root because between determined by PaaS.
Table 1
Note:√ indicates state health, × indicate to detect that failure, NA indicate no detection information
It can be seen that PaaS may include 3 kinds of situations to the result that failure is detected:State health, detect failure with And reported without detection information, to failure root because analysis include following several situations:
When PaaS is in preset time, hardware resource fault message and operating system failure information are detected, then, PaaS Failure root will be determined because being that the hardware resource layer of cloud computing server failure occurs;
When PaaS is in preset time, hardware resource fault message and application and trouble information are detected, then, PaaS will be true Failure root is determined because being that the hardware resource layer of cloud computing server failure occurs;
It can be seen that when PaaS is in preset time, operating system failure information and application and trouble information are detected, and Hardware resource fault message is not detected in the preset time, then, PaaS will determine failure root because being cloud computing service The operating system OS of device breaks down;Or
It can be seen that when PaaS only detects application and trouble information in preset time, without detecting hardware resource Fault message and operating system failure information, then, PaaS will determine failure root because being that the application layer of cloud computing server occurs Failure.
It can be seen that only detecting operating system failure information in PaaS;Or, ought only detect that hardware resource failure is believed Breath, alternatively, when only detecting hardware resource fault message and application and trouble information etc., then, described in PaaS will judge The fault message occurred in preset time belongs to the wrong report of cloud computing platform (cloud computing server), and in the case of these, PaaS will Ignore above-mentioned dependent failure information.
Step S103:According to the failure root because determining troubleshooting strategy.
Failure root is being determined because after by step S102, PaaS is according to failure root because of the corresponding troubleshooting of determination Strategy.
Referring to table 2, table 2 be in the embodiment of the present invention failure root because of some correspondences with troubleshooting strategy.
Table 2
It can be seen that in a concrete application scene, it is described in the case where failure root is because being that hardware resource breaks down Troubleshooting strategy, which includes at least, restarts (reboot) virtual machine, local reconstruction (rebuild) virtual machine and migration (migration) virtual machine.
It can be seen that in a concrete application scene, it is described in the case where failure root is because being that operating system breaks down Troubleshooting strategy, which includes at least, restarts virtual machine, and in that case, when restarting virtual machine, it is right that virtual machine can accordingly load institute The operating system answered, and then complete restarting for operating system;Under special case, restart if not restarting virtual machine and can also realize Operating system, then troubleshooting strategy is directly to restart operating system.
It can be seen that in a concrete application scene, in the case where failure root is because being that application layer breaks down, the event Barrier processing strategy, which includes at least, restarts application, restarts virtual machine, wherein restarts and applies directly to be carried out again to relevant application Restart;It is the virtual machine first restarted where the application to restart virtual machine, and virtual machine can accordingly load corresponding operating system, so Run the application in the operating system again afterwards;Under special case, if reboot operation system can also be realized by not restarting virtual machine System, then then troubleshooting strategy runs the application in the operating system again directly to restart operating system.
Step S104, the operation indicated by the troubleshooting strategy carries out fault recovery.
It should be understood that after determining troubleshooting strategy, PaaS can be based on indicated by the troubleshooting strategy The recovery of dependent failure is realized in operation.
In embodiments of the present invention, IaaS is responsible for the management and control of hardware resource, including adjustment virtual machine CPU, memory And disk dilatation, carry out the restarting of virtual machine, it is local rebuild and dynamic migration etc., ensure virtual machine business to the maximum extent Continuity, in order to reduce the service impact for even being eliminated and being brought due to virtual-machine fail.
Therefore the operation indicated by the troubleshooting strategy is when being operated for hardware resource layer, PaaS is to IaaS Under send instructions, described instruction includes the operation indicated by the troubleshooting strategy, and IaaS is based on executing described instruction, realizes phase Close the recovery of failure.
For example, in the case where troubleshooting strategy is to restart virtual machine, PaaS calls IaaS interfaces to restart virtual machine, so Afterwards by checking whether virtual machine state failure judgement is restored.
In the case where troubleshooting strategy is local reconstruction virtual machine, PaaS judges the virtual machine on cloud computing server The system disk at place is share dish, then PaaS calls IaaS interfaces to carry out virtual machine in the share dish and locally rebuilds, and then passes through Check whether the task status failure judgement of virtual machine is restored.
In the case where troubleshooting strategy is migration virtual machine, PaaS calls IaaS interfaces by failure cloud computing server On virtual machine (vm) migration to other hosts on.
In the case where troubleshooting strategy is to restart operating system, PaaS calls IaaS interfaces to restart virtual machine, in void Quasi- machine loads corresponding operating system after restarting.
Under special circumstances, if restarting for operating system can also be realized by being not required to restart in the virtual machine, PaaS calls the direct reboot operation system of IaaS interfaces.
As can be seen that by implementing the embodiment of the present invention, (such as application program, application system, IT systems will be applied in enterprise System, Legacy System etc.) move to the cloud computing server of cloud computing platform after, PaaS can pass through IaaS monitoring hardware resource layers Failure can pass through the operating status of agent application monitor operating system and the operating status of application.PaaS gets fault message When, continue obtain preset time in other fault messages, after preset time, based on all fault messages summarized into Row comprehensive analysis, determine the failure root that causes failure to occur because, and based on failure root because of the specific troubleshooting strategy of determination, into And IaaS or agent application is called to carry out corresponding fault recovery, it ensures that and applies the height possessed by cloud computing platform can There are the complete characteristics such as comprehensive, accuracy and versatility with the HA schemes of property, the embodiment of the present invention.
Please integrate refering to Fig. 3-Fig. 6, Fig. 3 be another cloud computing server provided in an embodiment of the present invention fault recovery Method, this method include but not limited to following steps:
Step S301:IaaS detects hardware resource fault message, and faulty resource information is sent to PaaS.
In the particular embodiment, referring to Fig. 4, IaaS monitors the hardware resource of cloud computing server, to determine cloud computing Whether the I layers of server break down, and when hardware resource breaks down, IaaS reports hardware resource fault message to PaaS. For example, after being deployed on the virtual machine of IaaS distribution using (application program, application system, Enterprise IT System etc.), management Personnel register the information such as tenant, user account, the virtual machine address of IaaS to PaaS as needed, in order to which PaaS is carried to IaaS For high-availability arrangement.For example, when the application is the funnel-shaped Legacy System of enterprise, in order to make the Legacy System obtain High availability, administrative staff register the Legacy System relevant information of the enterprise of IaaS to PaaS.The PaaS identifications related letter After breath, from the fault warning of cloud computing server (host) and virtual machine where trend IaaS subscription Legacy Systems. IaaS detects the operating status of virtual machine, host in real time, and after detecting the appearance of failure, IaaS generates corresponding hardware money Source fault message (such as VM operating statuses exception information), and by hardware resource fault information reporting to PaaS, in order to PaaS into The processing of row consequent malfunction.
Step S302:PaaS obtains operating system failure information, institute by detecting the heartbeat message of first agent's application It states heartbeat message and is used to indicate the operating system failure information.
Wherein, step S302 is applied using first agent just for the sake of being carried out with second agent's application in step S303 It distinguishes.
As shown in figure 4, PaaS installs an agent application (Agent on all virtual machines of application deployment Application), that is to say, that first agent's application is deployed in the OS layers of cloud computing server, and the agent application is used In the heartbeat applied with first agent with PaaS progress heartbeat maintenances, PaaS timing detections, to judge whether OS layers occur event Barrier.When some first agent disappears using heartbeat, PaaS sends heartbeat request, if first agent's application cannot still be returned in time The response to heartbeat request is returned, that indicates that PaaS applies the operating system (or virtual machine) at place to have occurred with the first agent Disconnection, so, PaaS will generate corresponding operating system failure information.
Step S303:PaaS is believed by the operating status of second agent's application detection application to obtain the application and trouble Breath.
Specifically, the application is deployed in the application layer of cloud computing server, and it is integrated into the cloud of application by SaaS Service, in order to provide the cloud service to cloud service operator or enterprise.
Wherein, step S303 is applied using second agent just for the sake of carrying out area with the agent application in step S302 Point.Second agent's application is equally deployed on all virtual machines, and answering for this virtual machine node is managed by agent application With.In the particular embodiment, second agent applies and first agent's application can be the same application, can also be different Using the embodiment of the present invention does not limit herein.
As shown in figure 4, second agent's application is equally deployed in application layer, can be used for managing answering in cloud computing server With, and the operating status of the application on regular monitoring virtual machine, it is examined by the provided state of application for example, second agent applies Survey the monitoring that script carries out relevant operational state.In specific application scenarios, using in the process of running, one is dynamically provided A state-detection script, such as state-detection script are status.sh, and status.sh defines return value 1 and indicates that application is being transported Row defines return value 2 and indicates that application is out of service, defines return value 3 and indicates that unusual condition has occurred in application.It is described Status.sh is placed under the installation directory of the application (application system), and second agent's application is periodically in the installation directory tune With status.sh, and obtain corresponding return value, it is possible to understand that, second agent applies to be answered according to the judgement of the return value of script Operating status, and corresponding operating status is sent to SaaS.In the case where return value is 2 or 3, second agent's application Application and trouble information is generated, and the application and trouble information is sent to PaaS, correspondingly, PaaS obtains the application and trouble letter Breath.
Step S304:PaaS is according to the hardware resource fault message, the operating system failure information and the application Fault message determine the failure root of the Cloud Server because.
In embodiments of the present invention, PaaS obtains the fault message of all levels, therefore PaaS can combine hardware resource Layer, OS layer and application layer fault message carry out the analytical judgment of synthesis, with accurately obtain failure root because.
It is normal condition to define cloud computing server original operating state referring to Fig. 5, PaaS, when PaaS receives failure letter When breath, PaaS based on receive the fault message time point setting preset time T (such as 2 minutes), and in preset time T after Continuous others fault message, after preset time, PaaS carries out comprehensive judgement based on acquired all fault messages.
If accessed fault message meets preset condition, it is failure that PaaS, which defines cloud computing server working condition, State, and further determine that failure root because.If accessed fault message is unsatisfactory for preset condition, PaaS continues to define Cloud computing server working condition is normal condition.
As shown in figure 5, if after preset time T, PaaS based on depositing simultaneously in acquired all fault messages Vm health is abnormal and heartbeat disconnection, then PaaS will define cloud computing server working condition as malfunction, failure root because It breaks down for hardware resource layer;If after preset time T, PaaS in acquired all fault messages based on only depositing In vm health exception or heartbeat disconnection, then PaaS will abandon above-mentioned fault message, and cloud computing server work shape is defined State is normal condition.
Equally, if after preset time T, PaaS in acquired all fault messages based on existing simultaneously application Abnormal and heartbeat disconnection, then it is malfunction that PaaS, which will define cloud computing server working condition, failure root is because of OS layers of hair Raw failure;If after preset time T, PaaS based in acquired all fault messages there is only application it is abnormal or Heartbeat disconnection, then PaaS will abandon above-mentioned fault message, and it is normal condition to define cloud computing server working condition.
In addition, if after preset time T, PaaS is based on there is only application is different in acquired all fault messages Often, other fault messages may be not present, then PaaS will define cloud computing server working condition be malfunction, failure root because It breaks down for application layer.
Step S305:PaaS is according to the failure root because determining troubleshooting strategy.
In a specific application scenarios, PaaS determines specific troubleshooting strategy based on the type of failure, such as can With the preset failure diagnostic data base in PaaS, which is stored with various faults information, same for belonging to The fault message of level assigns different fault levels, such as fault level one, fault level two, fault level three.Such as For being directed in the default troubleshooting strategy of hardware resource layer failure, at the failure corresponding to preset failure grade one Reason strategy is restarts virtual machine, and two corresponding troubleshooting strategy of fault level is locally to rebuild virtual machine, and fault level three is right The troubleshooting strategy answered is migration virtual machine, and so on.Failure root is being determined because after, PaaS is based on the hardware actually obtained Resource layer failure is analyzed, and determines the corresponding fault level of hardware resource layer failure, and be based on the failure Grade determines troubleshooting strategy accordingly.
In another specific application scenarios, the different faults processing strategy that PaaS assigns same layer in advance is different preferential Grade automatically selects this layer of priority most after determining that fault rootstock is certain layer of failure based on the fault message received for the first time The troubleshooting strategy that high troubleshooting strategy is executed as needs.Event is cannot achieve in the high troubleshooting strategy of priority In the case that barrier restores, PaaS reselects the lower troubleshooting strategy of priority, and repeats above-mentioned steps.
For example, in a concrete application scene, referring to Fig. 6, the troubleshooting strategy of corresponding hardware resource layer according to Priority is respectively from high to low:Restart virtual machine, local reconstruction virtual machine, migrate virtual machine and report network management system.True After determining failure root because hardware resource breaks down, PaaS select to restart virtual machine (in this layer of highest priority) as failure at Reason strategy.In the case where subsequent execution restarts virtual machine and cannot achieve fault recovery, PaaS reselects fault management strategy Virtual machine is rebuild to be local.In the case where subsequent execution locally rebuilds virtual machine and cannot achieve fault recovery, PaaS is selected again It is migration virtual machine to select fault management strategy.In the case where subsequent execution migration virtual machine cannot achieve fault recovery, PaaS It is to report network management system, and it is extensive after executing the operation for reporting network management system to terminate above-mentioned failure to reselect fault management strategy Resurgent journey.
Wherein, network management system is reported to specifically include:PaaS is based on fault message and generates fault log, and by the failure day Will achieves, and the fault log is used to indicate the information such as time, position, fault type, the fault recovery history of failure generation. PaaS reports the fault log to network management system, in order to operation maintenance personnel by the network management system find in time the failure and Carry out manual maintenance.
Again for example, in another concrete application scene, the troubleshooting strategy of respective operations system includes at least weight It opens virtual machine and reports network management system.In failure root because in the case that operating system breaks down, PaaS selections are restarted virtual Machine (in this layer of highest priority) is used as troubleshooting strategy.Restart the feelings that virtual machine cannot achieve fault recovery in subsequent execution Under condition, it is to report network management system, and terminate after executing the operation for reporting network management system that PaaS, which reselects fault management strategy, Above-mentioned fault recovery flow.
Again for example, in another concrete application scene, correspond to application troubleshooting strategy according to priority from It is high to Low to be respectively:Restart application, restart virtual machine and reports network management system.It breaks down because of application layer in failure root In the case of, PaaS selects to restart application (in this layer of highest priority) as troubleshooting strategy.Restart application in subsequent execution In the case of cannot achieve fault recovery, it is to restart virtual machine that PaaS, which reselects fault management strategy,.Restart in subsequent execution In the case that virtual machine cannot achieve fault recovery, it is to report network management system, and holding that PaaS, which reselects fault management strategy, Terminate above-mentioned fault recovery flow after the capable operation for reporting network management system.
Step S306, PaaS sends fault recovery instruction to IaaS respectively, and the fault recovery instruction includes identified Troubleshooting strategy, correspondingly, IaaS executes the fault recovery that the operation indicated by troubleshooting strategy carries out I layers;Or PaaS sends fault recovery instruction to the second agent of SaaS application respectively, and the fault recovery instruction includes identified failure Processing strategy, correspondingly, the operation indicated by second agent's application execution troubleshooting strategy carries out S layers of fault recovery;
Specifically, PaaS sends fault recovery instruction to IaaS, the fault recovery instruction includes determined I layers of failure Processing strategy, the fault recovery that IaaS executes I layers of operation progress indicated by troubleshooting strategy include:Execution is restarted virtually Machine executes local reconstruction virtual machine and executes migration virtual machine, as shown in fig. 6, after IaaS executes aforesaid operations, if PaaS sentences Disconnected failure is recovered, then will terminate aforesaid operations flow.After if IaaS executes aforesaid operations, PaaS failure judgements do not have Restore, then PaaS will execute the operation for reporting network management system.
Specifically, PaaS sends fault recovery instruction to second agent's application, the fault recovery instruction includes determined S The troubleshooting strategy of layer, the fault recovery that IaaS executes S layers of operation progress indicated by troubleshooting strategy include:Execute weight It opens application, after second agent's application execution aforesaid operations, if PaaS failure judgements are recovered, aforesaid operations will be terminated Flow.After if IaaS executes aforesaid operations, PaaS failure judgements are not restored, then PaaS instruction second agents apply in weight Application is restarted in execution after opening virtual machine, if failure is restored not yet, PaaS reports the operation of network management system by executing.
It can be seen that by implementing the embodiment of the present invention, application is moved to the cloud computing clothes of cloud computing platform in enterprise It is engaged in after device, PaaS can pass through the operation shape of agent application monitor operating system by the failure of IaaS monitoring hardware resource layers The operating status of state and Legacy System.When PaaS gets fault message, continue to obtain other fault messages in preset time, After preset time, comprehensive analysis is carried out based on all fault messages summarized, determines the failure root for causing failure to occur Cause, and based on failure root because of the specific troubleshooting strategy of determination, and then IaaS or agent application is called to carry out corresponding failure Restore, all can not achieve fault recovery in all troubleshooting strategies, PaaS carries out fault warning to carry out to network management system Further Breakdown Maintenance ensures that Legacy System high availability possessed by cloud computing platform, the embodiment of the present invention HA schemes have the complete characteristics such as comprehensive, accuracy and versatility.
Based on same inventive concept, the embodiment of the present invention provides a kind of device of fault recovery that realizing cloud computing server 70, Fig. 7 is referred to, control node 70 includes:It transmitter 703, receiver 704, memory 702 and couples with memory 702 Processor 701.Transmitter 703, receiver 704, memory 702 can be connected with processor 701 by bus or other manner (in Fig. 7 for being connected by bus).Wherein:
Processor 701 can be one or more central processing units (Central Processing Unit, CPU), Fig. 7 In by taking a processor as an example, in the case where processor 701 is a CPU, which can be monokaryon CPU, can also be more Core CPU.
Memory 702, include but not limited to be random access memory (Random Access Memory, RAM), it is read-only Memory (Read-Only Memory, ROM), Erasable Programmable Read Only Memory EPROM (Erasable Programmable Read Only Memory, EPROM) or portable read-only memory (Compact Disc Read-Only Memory, CD- ROM), which is used for dependent instruction and data, is additionally operable to storage program code, and said program code is specifically used for real The function of the control node in existing Fig. 5 or Fig. 8 embodiments;
Transmitter 703 is used to send director data to outside;
Receiver 704 is used to receive data from outside;
Specifically, processor 701 is used to call the program code stored in memory 702, and execute following steps:
It is the hardware resource fault message serviced transmitted by IaaS management platforms to obtain infrastructure using receiver 704, Wherein, the IaaS management platforms are used to manage the hardware resource of the cloud computing server, are additionally operable to detect the hardware money The hardware resource fault message in source, the IaaS management platforms are independently of the cloud computing server;
The operating system failure information of the cloud computing server, the operating system failure are obtained using receiver 704 Information, which is used to indicate, is installed on the failure that the operating system of the cloud computing server occurs;
The application and trouble information of the cloud computing server is obtained using receiver 704, the application and trouble information is used for Instruction is installed on the failure that the application of the operating system occurs;
Processor 701 is according to the accessed hardware resource fault message, the operating system failure information and institute State application and trouble information determine the failure root of the cloud computing server because;
Processor 701 is according to the failure root because determining troubleshooting strategy;
Fault recovery is carried out using operation of the transmitter 703 indicated by the troubleshooting strategy.
Specifically, the operating system also has first agent's application;
The operating system failure information of the cloud computing server is obtained using receiver 704, including:Utilize receiver 704 determine the operating system failure information, the heartbeat message by detecting the heartbeat message of first agent's application It is used to indicate whether the operating system breaks down.
Specifically, also there is second agent's application in the operating system;
The application and trouble information of the cloud computing server is obtained using receiver 704, including:It is logical using receiver 704 The state-detection script applied described in second agent's application call is crossed, is determined according to the return value of the state-detection script The application and trouble information.
Processor 701 is according to the accessed hardware resource fault message, the operating system failure information and institute Stating application and trouble information determines the failure root of the cloud computing server because including at least:Institute is all detected in preset time It states under hardware resource fault message and the operating system failure information state, processor 701 determines failure root because described hard Part resource breaks down;Or the operating system failure information and the application and trouble information are detected in preset time, and In the case of not detecting the hardware resource fault message, processor 701 determines failure root because the operating system occurs Failure;Or only detected under application and trouble information state in preset time, processor 701 determines failure root because of the application It breaks down.
Processor 701 is according to the failure root because determining troubleshooting strategy includes:
In failure root because in the case that the hardware resource breaks down, the troubleshooting strategy includes restarting virtually Machine, local reconstruction virtual machine and migration virtual machine;It is described or in the case where failure root breaks down because of the operating system Troubleshooting strategy, which includes at least, restarts virtual machine;Or in the case where failure root breaks down because of the application, the event Barrier processing strategy, which includes at least, restarts virtual machine, restarts application.
Specifically, in failure root because in the case that the hardware resource breaks down, the troubleshooting strategy includes Restart virtual machine, local reconstruction virtual machine and migration virtual machine, specially:In failure root because the hardware resource breaks down In the case of, the troubleshooting strategy is to restart virtual machine;Restarting virtual machine in execution, to can not achieve hardware resource failure extensive In the case of multiple, the troubleshooting strategy is local reconstruction virtual machine;Restart virtual machine and local reconstruction virtual machine in execution In the case of all can not achieve hardware resource fault recovery, the troubleshooting strategy is migration virtual machine.
Specifically, in failure root because in the case that the application is broken down, the troubleshooting strategy includes at least Restart virtual machine, restart application, specially:In failure root because in the case that the application is broken down, the troubleshooting Strategy is to restart application;In the case where execution restarts virtual machine and can not achieve application and trouble recovery, the troubleshooting strategy To restart virtual machine.
Specifically, the operation indicated by troubleshooting strategy is executed, including:In failure root because the hardware resource occurs In the case of failure, executes the operation indicated by troubleshooting strategy and include at least:The IaaS management platforms interface is called to hold Operation indicated by the corresponding troubleshooting strategy of row;Or in the case where failure root breaks down because of the operating system, Executing the operation indicated by troubleshooting strategy includes:The IaaS management platforms interface is called to execute corresponding troubleshooting plan Slightly indicated operation;Or it is executed indicated by troubleshooting strategy because in the case that the application is broken down in failure root Operation include:Call the operation indicated by the corresponding troubleshooting strategy of second agent's application execution.
Processor 701 executes the operation indicated by troubleshooting strategy, further includes:
Processor 701 is based on fault message and generates fault log, and the fault log is achieved, and utilizes transmitter 703 It includes the hardware resource fault message, the operating system to report the fault log, the fault message to network management system Fault message and the application and trouble information.
It should be noted that by the detailed description of earlier figures 2- Fig. 6 embodiments, those skilled in the art can clearly know The implementation method for each functional unit that road device 70 is included, so in order to illustrate the succinct of book, details are not described herein.
Based on same inventive concept, a kind of dress of fault recovery that realizing cloud computing server provided in an embodiment of the present invention 80 are set, Fig. 8 is referred to, which includes multiple function modules, and each function module is described in detail as follows.
Fault detection module 801, it is former for obtaining the hardware resource that infrastructure services transmitted by IaaS management platforms Hinder information, wherein the IaaS management platforms are used to manage the hardware resource of the cloud computing server, are additionally operable to described in detection The hardware resource fault message of hardware resource, the IaaS management platforms are independently of the cloud computing server;It is additionally operable to obtain The operating system failure information of the cloud computing server, the operating system failure information, which is used to indicate, is installed on the cloud meter Calculate the failure that the operating system of server occurs;It is additionally operable to obtain the application and trouble information of the cloud computing server, it is described Application and trouble information, which is used to indicate, is installed on the failure that the application of the operating system occurs;
Failure analysis module 802, for according to the accessed hardware resource fault message, operating system event Barrier information and the application and trouble information determine the failure root of the cloud computing server because;
Failure strategy module 803 is used for according to the failure root because determining troubleshooting strategy;
Failure Recovery Module 804 carries out fault recovery for the operation indicated by the troubleshooting strategy.
In the particular embodiment, the operating system is applied with first agent;Fault detection module 801 is additionally operable to obtain The operating system failure information of the cloud computing server is taken, including:The fault detection module 801 is additionally operable to by detecting institute The heartbeat message of first agent's application is stated to determine that the operating system failure information, the heartbeat message are used to indicate the behaviour Make whether system breaks down.
In a particular embodiment, second agent's application is installed in the operating system;The fault detection module 801 The application and trouble information for being additionally operable to obtain the cloud computing server includes:The fault detection module 801 is additionally operable to pass through institute The state-detection script applied described in second agent's application call is stated, according to the determination of the return value of the state-detection script Application and trouble information.
In a particular embodiment, failure analysis module 802 be used for according to the accessed hardware resource fault message, The operating system failure information and the application and trouble information determine the failure root of the cloud computing server because at least wrapping It includes:
The failure analysis module 802 in preset time for all detecting the hardware resource fault message and described Under operating system failure information state, failure root is determined because the hardware resource breaks down;Or the failure analysis module 802 in preset time for detecting the operating system failure information and the application and trouble information, and does not detect In the case of the hardware resource fault message, failure root is determined because the operating system breaks down;Or the accident analysis Module 802 determines failure root because the application occurs for only being detected under application and trouble information state in preset time Failure.
In a particular embodiment, failure strategy module 803 is used for according to the failure root because determining troubleshooting strategy packet It includes:
In failure root because in the case that the hardware resource breaks down, the troubleshooting strategy includes restarting virtually Machine, local reconstruction virtual machine and migration virtual machine;It is described or in the case where failure root breaks down because of the operating system Troubleshooting strategy, which includes at least, restarts virtual machine;Or in the case where failure root breaks down because of the application, the event Barrier processing strategy, which includes at least, restarts virtual machine, restarts application.
Wherein, in failure root because in the case that hardware resource breaks down, the troubleshooting strategy includes restarting void Quasi- machine, local reconstruction virtual machine and migration virtual machine, specially:
In failure root because in the case that the hardware resource breaks down, the troubleshooting strategy is to restart virtually Machine;In the case where execution restarts virtual machine and can not achieve hardware resource fault recovery, the troubleshooting strategy is local weight Build virtual machine;In the case where virtual machine is restarted in execution and local reconstruction virtual machine all can not achieve hardware resource fault recovery, The troubleshooting strategy is migration virtual machine.
Wherein, in failure root because in the case that the application is broken down, the troubleshooting strategy includes at least weight It opens virtual machine, restart application, specially:
In failure root because in the case that the application is broken down, the troubleshooting strategy is to restart application;It is holding Row restart virtual machine can not achieve application and trouble restore in the case of, the troubleshooting strategy be restart virtual machine.
In a particular embodiment, Failure Recovery Module 804 for indicated by the troubleshooting strategy operation into Row fault recovery, including:
In failure root because in the case that the hardware resource breaks down, the operation indicated by troubleshooting strategy is executed It includes at least:The IaaS management platforms interface is called to execute the operation indicated by corresponding troubleshooting strategy;Or in failure Because in the case of operating system failure, execute the operation indicated by troubleshooting strategy includes root:Described in calling IaaS management platform interfaces execute the operation indicated by corresponding troubleshooting strategy;Or in failure root because the application occurs In the case of failure, executing the operation indicated by troubleshooting strategy includes:Call second agent's application execution corresponding Operation indicated by troubleshooting strategy.
In a particular embodiment, described device 80 further includes fault warning module 805, and the fault warning module is used for base Fault log is generated in fault message, the fault log is achieved, and the fault log, the event are reported to network management system It includes the hardware resource fault message, the operating system failure information and the application and trouble information to hinder information.
It should be noted that by the detailed description of earlier figures 2- Fig. 6 embodiments, those skilled in the art can clearly know The implementation method for each functional unit that road device 80 is included, so in order to illustrate the succinct of book, details are not described herein.
Based on same inventive concept, the embodiment of the present invention also provides another management system, referring to Figure 10, the management system System includes IaaS management platforms 901, PaaS management platforms 902 and SaaS service platforms 903, wherein PaaS management platforms 902 are wrapped Fault detection module 801, failure analysis module 802, failure strategy module 803 and Failure Recovery Module 804 are included, SaaS services are flat Platform 903 includes agent application 806.The disparate modules of PaaS management platforms 902 pass through periodic communication with IaaS management platforms 901 Interface IF connections, the disparate modules of PaaS management platforms 902 are connect with SaaS service platforms 903 also by IF, different interfaces It is described as follows:
Interface name Interface connection relation
IF1 Connecting fault detection module 801 and failure analysis module 802
IF2 Connecting fault policy module 803 and failure analysis module 802
IF3 Connecting fault recovery module 804 and failure analysis module 802
IF4 Connecting fault detection module 801 and IaaS management platforms 901
IF5 Connecting fault detection module 801 and agent application 806
IF6 Connecting fault recovery module 804 and IaaS management platforms 901
IF7 Connecting fault recovery module 804 and agent application 806
It should be noted that the function of each management platform, module and each interface is implemented above in management system Existing embodiment in example, for details, reference can be made to the associated descriptions of Fig. 2-Fig. 9, are not repeating herein.
In the above-described embodiments, it can be realized wholly or partly by software, hardware, firmware or arbitrary combination. When implemented in software, it can realize in the form of a computer program product in whole or in part.The computer program Product includes one or more computer instructions, when loading on computers and executing the computer program instructions, all or It partly generates according to the flow or function described in the embodiment of the present invention.The computer can be all-purpose computer, special meter Calculation machine, computer network or other programmable devices.The computer instruction is storable in computer readable storage medium, or Person is transmitted from a computer readable storage medium to another computer readable storage medium, for example, the computer instruction Wired (such as coaxial cable, optical fiber, digital subscriber can be passed through from a website, computer, server or data center Line) or wirelessly (such as infrared, microwave etc.) mode is passed to another website, computer, server or data center It is defeated.The computer readable storage medium can be any usable medium that computer can access, and can also be comprising one Or the data storage devices such as integrated server, data center of multiple usable mediums.The usable medium can be magnetic medium (such as floppy disk, hard disk, tape etc.), optical medium (such as DVD etc.) or semiconductor medium (such as solid state disk) etc..
In the above-described embodiments, it emphasizes particularly on different fields to the description of each embodiment, there is no the part being described in detail in some embodiment, It may refer to the associated description of other embodiment.

Claims (19)

1. a kind of fault recovery method of cloud computing server, which is characterized in that be applied to cloud computing server, the method packet It includes:
It is the hardware resource fault message serviced transmitted by IaaS management platforms to obtain infrastructure, and the IaaS management platforms are used In the hardware resource fault message for detecting the hardware resource;
The operating system failure information of the cloud computing server is obtained, the operating system failure information, which is used to indicate, to be installed on The failure that the operating system of the cloud computing server occurs;
The application and trouble information of the cloud computing server is obtained, the application and trouble information, which is used to indicate, is installed on the operation The failure that systematic difference occurs;
According to the accessed hardware resource fault message, the operating system failure information and the application and trouble information Determine the failure root of the cloud computing server because;
According to the failure root because determining troubleshooting strategy;
Operation indicated by the troubleshooting strategy carries out fault recovery.
2. according to the method described in claim 1, it is characterized in that, the operating system also has first agent's application;
The operating system failure information of the cloud computing server is obtained, including:
The operating system failure information, the heartbeat message are determined by detecting the heartbeat message of first agent's application It is used to indicate whether the operating system breaks down.
3. method according to claim 1 or 2, which is characterized in that also there is second agent's application in the operating system;
The application and trouble information of the cloud computing server is obtained, including:
By the state-detection script applied described in second agent's application call, according to the return of the state-detection script Value determines the application and trouble information.
4. method according to any one of claims 1 to 3, which is characterized in that according to the accessed hardware resource Fault message, the operating system failure information and the application and trouble information determine the failure root of the cloud computing server Cause includes at least:
It all detects under the hardware resource fault message and the operating system failure information state, determines in preset time Failure root is because the hardware resource breaks down;Or
The operating system failure information and the application and trouble information are detected in preset time, and are not detected described In the case of hardware resource fault message, failure root is determined because the operating system breaks down;Or
It is only detected under application and trouble information state in preset time, determines failure root because the application is broken down.
5. according to claim 4 any one of them method, which is characterized in that according to the failure root because determining troubleshooting plan Slightly include:
Failure root because the hardware resource break down in the case of, the troubleshooting strategy include restart virtual machine, It is local to rebuild virtual machine and migrate one or more in virtual machine;Or
In failure root because in the case that the operating system breaks down, the troubleshooting strategy includes restarting virtual machine; Or in failure root because in the case that the application is broken down, the troubleshooting strategy includes restarting application and restarting virtual One or both of machine.
6. according to claim 5 any one of them method, which is characterized in that in failure root because event occurs in the hardware resource In the case of barrier, the troubleshooting strategy includes restarting virtual machine, local one kind rebuild in virtual machine and migration virtual machine Or it is a variety of, specially:
In failure root because in the case that the hardware resource breaks down, the troubleshooting strategy is to restart virtual machine;
In the case where execution restarts virtual machine and can not achieve hardware resource fault recovery, the troubleshooting strategy is local weight Build virtual machine;
In the case where virtual machine is restarted in execution and local reconstruction virtual machine all can not achieve hardware resource fault recovery, the event Barrier processing strategy is migration virtual machine.
7. according to claim 5 any one of them method, which is characterized in that break down because of the application in failure root In the case of, the troubleshooting strategy includes one or both of restarting application and restarting virtual machine, specially:
In failure root because in the case that the application is broken down, the troubleshooting strategy is to restart application;
In the case where execution is restarted using can not achieve application and trouble recovery, the troubleshooting strategy is to restart virtual machine.
8. according to the method described in claim 5, it is characterized in that, execute troubleshooting strategy indicated by operation, including:
In failure root because in the case that the hardware resource breaks down, the operation indicated by execution troubleshooting strategy is at least Including:The IaaS management platforms interface is called to execute the operation indicated by corresponding troubleshooting strategy;Or
In failure root because in the case that the operating system breaks down, the operation packet indicated by troubleshooting strategy is executed It includes:The IaaS management platforms interface is called to execute the operation indicated by corresponding troubleshooting strategy;Or
In failure root because in the case that the application is broken down, executing the operation indicated by troubleshooting strategy includes:It adjusts With the operation indicated by the corresponding troubleshooting strategy of second agent's application execution.
9. a kind of device of fault recovery that realizing cloud computing server, which is characterized in that including:
Fault detection module services hardware resource fault message transmitted by IaaS management platforms for obtaining infrastructure, Wherein, the IaaS management platforms are used to detect the hardware resource fault message of the hardware resource;It is additionally operable to obtain the cloud The operating system failure information of calculation server, the operating system failure information, which is used to indicate, is installed on the cloud computing service The failure that the operating system of device occurs;It is additionally operable to obtain the application and trouble information of the cloud computing server, the application event Barrier information, which is used to indicate, is installed on the failure that the application of the operating system occurs;
Failure analysis module, for according to the accessed hardware resource fault message, the operating system failure information With the application and trouble information determine the failure root of the cloud computing server because;
Failure strategy module is used for according to the failure root because determining troubleshooting strategy;
Failure Recovery Module carries out fault recovery for the operation indicated by the troubleshooting strategy.
10. device according to claim 9, which is characterized in that the operating system is applied with first agent;
Fault detection module is additionally operable to obtain the operating system failure information of the cloud computing server, including:
The fault detection module is additionally operable to determine the operation system by detecting the heartbeat message of first agent's application System fault message, the heartbeat message are used to indicate whether the operating system breaks down.
11. device according to claim 9 or 10, which is characterized in that be equipped with second agent in the operating system Using;
The application and trouble information that the fault detection module is additionally operable to obtain the cloud computing server includes:
The fault detection module is additionally operable to the state-detection script by being applied described in second agent's application call, according to The return value of the state-detection script determines the application and trouble information.
12. according to claim 9 to 11 any one of them device, which is characterized in that failure analysis module 802 is used for according to institute The hardware resource fault message, the operating system failure information and the application and trouble information got determines the cloud The failure root of calculation server is because including at least:
The failure analysis module in preset time for all detecting the hardware resource fault message and operation system It unites in the case of fault message, determines failure root because the hardware resource breaks down;Or
The failure analysis module in preset time for detecting the operating system failure information and the application and trouble Information, and in the case of not detecting the hardware resource fault message, determine failure root because the operating system occurs therefore Barrier;Or
The failure analysis module for only being detected under application and trouble information state in preset time, determine failure root because The application is broken down.
13. according to claim 12 any one of them device, which is characterized in that failure strategy module 803 is used for according to Failure root because determine troubleshooting strategy include:
Failure root because the hardware resource break down in the case of, the troubleshooting strategy include restart virtual machine, It is local to rebuild one or more of virtual machine and migration virtual machine;Or
In failure root because in the case that the operating system breaks down, the troubleshooting strategy, which includes at least, restarts virtually Machine;Or
Failure root because the application break down in the case of, the troubleshooting strategy include at least restart application and again Open one or two of virtual machine.
14. according to claim 13 any one of them device, which is characterized in that in failure root because the hardware resource occurs In the case of failure, the troubleshooting strategy includes restarting virtual machine, local one rebuild in virtual machine and migration virtual machine It is a or multiple, specially:
In failure root because in the case that the hardware resource breaks down, the troubleshooting strategy is to restart virtual machine;
In the case where execution restarts virtual machine and can not achieve hardware resource fault recovery, the troubleshooting strategy is local weight Build virtual machine;
In the case where virtual machine is restarted in execution and local reconstruction virtual machine all can not achieve hardware resource fault recovery, the event Barrier processing strategy is migration virtual machine.
15. according to claim 13 or 14 any one of them devices, which is characterized in that in failure root because the application occurs In the case of failure, the troubleshooting strategy, which includes at least, one or two of restarts application and restarts virtual machine, specifically For:
In failure root because in the case that the application is broken down, the troubleshooting strategy is to restart application;
In the case where execution is restarted using can not achieve application and trouble recovery, the troubleshooting strategy is to restart virtual machine.
16. device according to claim 15, which is characterized in that Failure Recovery Module 804 is used at according to the failure Operation indicated by reason strategy carries out fault recovery, including:
In failure root because in the case that the hardware resource breaks down, the operation indicated by execution troubleshooting strategy is at least Including:The IaaS management platforms interface is called to execute the operation indicated by corresponding troubleshooting strategy;Or
In failure root because in the case that the operating system breaks down, the operation packet indicated by troubleshooting strategy is executed It includes:The IaaS management platforms interface is called to execute the operation indicated by corresponding troubleshooting strategy;Or
In failure root because in the case that the application is broken down, executing the operation indicated by troubleshooting strategy includes:It adjusts With the operation indicated by the corresponding troubleshooting strategy of second agent's application execution.
17. a kind of device of fault recovery that realizing cloud computing server, which is characterized in that including:Memory and with it is described Processor, transmitter and the receiver of memory coupling, wherein:The transmitter is used to send director data, institute with to outside Data of the receiver for receiving external transmission are stated, the memory is for storing program code and related data, the place Reason device is for executing the program code stored in the memory, to execute a kind of fault recovery method of cloud computing server, Wherein, the method is such as claim 1 to 8 any one of them method.
18. a kind of management system, including IaaS management platforms, PaaS management platforms and SaaS service platforms, wherein PaaS is managed Platform includes fault detection module, failure analysis module, failure strategy module and Failure Recovery Module, and SaaS service platforms include Agent application, PaaS management platforms are connect with IaaS management platforms and SaaS service platforms by periodic communication interface.It is described Management system is for realizing such as claim 1-8 any one of them method.
19. a kind of computer readable storage medium, which is characterized in that including instruction, when run on a computer so that meter Calculation machine executes such as claim 1-8 any one of them methods.
CN201710160761.7A 2017-03-17 2017-03-17 A kind of fault recovery method of cloud computing server, device and management system Pending CN108632057A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710160761.7A CN108632057A (en) 2017-03-17 2017-03-17 A kind of fault recovery method of cloud computing server, device and management system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710160761.7A CN108632057A (en) 2017-03-17 2017-03-17 A kind of fault recovery method of cloud computing server, device and management system

Publications (1)

Publication Number Publication Date
CN108632057A true CN108632057A (en) 2018-10-09

Family

ID=63687046

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710160761.7A Pending CN108632057A (en) 2017-03-17 2017-03-17 A kind of fault recovery method of cloud computing server, device and management system

Country Status (1)

Country Link
CN (1) CN108632057A (en)

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111092855A (en) * 2019-11-14 2020-05-01 山东中创软件商用中间件股份有限公司 Server operation and maintenance system, method and device and computer readable storage medium
CN111309515A (en) * 2018-12-11 2020-06-19 华为技术有限公司 Disaster recovery control method, device and system
CN111355605A (en) * 2019-10-18 2020-06-30 烽火通信科技股份有限公司 Virtual machine fault recovery method and server of cloud platform
CN111786827A (en) * 2020-06-29 2020-10-16 中国工商银行股份有限公司 Fault association positioning alarm method and device for distributed cloud computing environment
CN111970147A (en) * 2020-07-29 2020-11-20 苏州浪潮智能科技有限公司 Method for processing large-scale host faults of cloud platform
US10887382B2 (en) 2018-12-18 2021-01-05 Storage Engine, Inc. Methods, apparatuses and systems for cloud-based disaster recovery
CN112256498A (en) * 2020-11-17 2021-01-22 珠海大横琴科技发展有限公司 Fault processing method and device
CN112350862A (en) * 2020-10-30 2021-02-09 广州市汇聚支付电子科技有限公司 Monitoring alarm and fault self-healing system
CN112398668A (en) * 2019-08-14 2021-02-23 北京东土科技股份有限公司 IaaS cluster-based cloud platform and node switching method
US10958720B2 (en) 2018-12-18 2021-03-23 Storage Engine, Inc. Methods, apparatuses and systems for cloud based disaster recovery
CN112543126A (en) * 2020-12-22 2021-03-23 武汉联影医疗科技有限公司 Cloud platform monitoring method and device, computer equipment and storage medium
US10983886B2 (en) 2018-12-18 2021-04-20 Storage Engine, Inc. Methods, apparatuses and systems for cloud-based disaster recovery
CN112799910A (en) * 2021-01-26 2021-05-14 中国工商银行股份有限公司 Hierarchical monitoring method and device
CN113438122A (en) * 2021-05-14 2021-09-24 济南浪潮数据技术有限公司 Heartbeat management method and device for server, computer equipment and medium
US11178221B2 (en) 2018-12-18 2021-11-16 Storage Engine, Inc. Methods, apparatuses and systems for cloud-based disaster recovery
US11176002B2 (en) 2018-12-18 2021-11-16 Storage Engine, Inc. Methods, apparatuses and systems for cloud-based disaster recovery
CN113890903A (en) * 2021-09-27 2022-01-04 中信科移动通信技术股份有限公司 Alarm information management system and method
US11252019B2 (en) 2018-12-18 2022-02-15 Storage Engine, Inc. Methods, apparatuses and systems for cloud-based disaster recovery
CN114095964A (en) * 2021-11-19 2022-02-25 中国联合网络通信集团有限公司 Fault recovery method and device and computer readable storage medium
US11489730B2 (en) 2018-12-18 2022-11-01 Storage Engine, Inc. Methods, apparatuses and systems for configuring a network environment for a server
CN115665036A (en) * 2022-10-14 2023-01-31 郑州浪潮数据技术有限公司 Routing strategy fault processing method, device and medium
EP4254191A1 (en) * 2022-03-28 2023-10-04 Nuctech Company Limited Method and apparatus of implementing high availability of cluster virtual machine

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103167004A (en) * 2011-12-15 2013-06-19 中国移动通信集团上海有限公司 Cloud platform host system fault correcting method and cloud platform front control server
CN104394194A (en) * 2014-10-31 2015-03-04 北京思特奇信息技术股份有限公司 Cloud system operation and maintenance monitoring method and system based on platform-as-a-service (PaaS) platform
CN104486406A (en) * 2014-12-15 2015-04-01 浪潮电子信息产业股份有限公司 Layered resource monitoring method based on cloud data center
CN106130809A (en) * 2016-09-07 2016-11-16 东南大学 A kind of IaaS cloud platform network failure locating method based on log analysis and system
US9516112B1 (en) * 2012-06-29 2016-12-06 EMC IP Holding Company LLC Sending alerts from cloud computing systems

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103167004A (en) * 2011-12-15 2013-06-19 中国移动通信集团上海有限公司 Cloud platform host system fault correcting method and cloud platform front control server
US9516112B1 (en) * 2012-06-29 2016-12-06 EMC IP Holding Company LLC Sending alerts from cloud computing systems
CN104394194A (en) * 2014-10-31 2015-03-04 北京思特奇信息技术股份有限公司 Cloud system operation and maintenance monitoring method and system based on platform-as-a-service (PaaS) platform
CN104486406A (en) * 2014-12-15 2015-04-01 浪潮电子信息产业股份有限公司 Layered resource monitoring method based on cloud data center
CN106130809A (en) * 2016-09-07 2016-11-16 东南大学 A kind of IaaS cloud platform network failure locating method based on log analysis and system

Cited By (28)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111309515A (en) * 2018-12-11 2020-06-19 华为技术有限公司 Disaster recovery control method, device and system
CN111309515B (en) * 2018-12-11 2023-11-28 华为技术有限公司 Disaster recovery control method, device and system
US10958720B2 (en) 2018-12-18 2021-03-23 Storage Engine, Inc. Methods, apparatuses and systems for cloud based disaster recovery
US11489730B2 (en) 2018-12-18 2022-11-01 Storage Engine, Inc. Methods, apparatuses and systems for configuring a network environment for a server
US11252019B2 (en) 2018-12-18 2022-02-15 Storage Engine, Inc. Methods, apparatuses and systems for cloud-based disaster recovery
US10887382B2 (en) 2018-12-18 2021-01-05 Storage Engine, Inc. Methods, apparatuses and systems for cloud-based disaster recovery
US11176002B2 (en) 2018-12-18 2021-11-16 Storage Engine, Inc. Methods, apparatuses and systems for cloud-based disaster recovery
US11178221B2 (en) 2018-12-18 2021-11-16 Storage Engine, Inc. Methods, apparatuses and systems for cloud-based disaster recovery
US10983886B2 (en) 2018-12-18 2021-04-20 Storage Engine, Inc. Methods, apparatuses and systems for cloud-based disaster recovery
CN112398668B (en) * 2019-08-14 2022-08-23 北京东土科技股份有限公司 IaaS cluster-based cloud platform and node switching method
CN112398668A (en) * 2019-08-14 2021-02-23 北京东土科技股份有限公司 IaaS cluster-based cloud platform and node switching method
CN111355605A (en) * 2019-10-18 2020-06-30 烽火通信科技股份有限公司 Virtual machine fault recovery method and server of cloud platform
CN111092855A (en) * 2019-11-14 2020-05-01 山东中创软件商用中间件股份有限公司 Server operation and maintenance system, method and device and computer readable storage medium
CN111786827A (en) * 2020-06-29 2020-10-16 中国工商银行股份有限公司 Fault association positioning alarm method and device for distributed cloud computing environment
CN111970147A (en) * 2020-07-29 2020-11-20 苏州浪潮智能科技有限公司 Method for processing large-scale host faults of cloud platform
US11881984B2 (en) 2020-07-29 2024-01-23 Inspur Suzhou Intelligent Technology Co., Ltd. Method for handling large-scale host failures on cloud platform
CN111970147B (en) * 2020-07-29 2022-05-06 苏州浪潮智能科技有限公司 Method for processing large-scale host faults of cloud platform
CN112350862A (en) * 2020-10-30 2021-02-09 广州市汇聚支付电子科技有限公司 Monitoring alarm and fault self-healing system
CN112256498A (en) * 2020-11-17 2021-01-22 珠海大横琴科技发展有限公司 Fault processing method and device
CN112543126A (en) * 2020-12-22 2021-03-23 武汉联影医疗科技有限公司 Cloud platform monitoring method and device, computer equipment and storage medium
CN112799910A (en) * 2021-01-26 2021-05-14 中国工商银行股份有限公司 Hierarchical monitoring method and device
CN113438122B (en) * 2021-05-14 2022-05-17 济南浪潮数据技术有限公司 Heartbeat management method and device for server, computer equipment and medium
CN113438122A (en) * 2021-05-14 2021-09-24 济南浪潮数据技术有限公司 Heartbeat management method and device for server, computer equipment and medium
CN113890903A (en) * 2021-09-27 2022-01-04 中信科移动通信技术股份有限公司 Alarm information management system and method
CN114095964A (en) * 2021-11-19 2022-02-25 中国联合网络通信集团有限公司 Fault recovery method and device and computer readable storage medium
CN114095964B (en) * 2021-11-19 2023-05-26 中国联合网络通信集团有限公司 Fault recovery method and device and computer readable storage medium
EP4254191A1 (en) * 2022-03-28 2023-10-04 Nuctech Company Limited Method and apparatus of implementing high availability of cluster virtual machine
CN115665036A (en) * 2022-10-14 2023-01-31 郑州浪潮数据技术有限公司 Routing strategy fault processing method, device and medium

Similar Documents

Publication Publication Date Title
CN108632057A (en) A kind of fault recovery method of cloud computing server, device and management system
US9740546B2 (en) Coordinating fault recovery in a distributed system
EP2710484B1 (en) Cross-cloud management and troubleshooting
CN102346460B (en) Transaction-based service control system and method
CN102231681B (en) High availability cluster computer system and fault treatment method thereof
CN105659562B (en) It is a kind of for hold barrier method and data processing system and include for holds hinder computer usable code storage equipment
CN105095001B (en) Virtual machine abnormal restoring method under distributed environment
CN108270726B (en) Application instance deployment method and device
US20080307258A1 (en) Distributed Job Manager Recovery
CN104408071A (en) Distributive database high-availability method and system based on cluster manager
AU2012259086A1 (en) Cross-cloud management and troubleshooting
CN106559441B (en) Virtual machine monitoring method, device and system based on cloud computing service
CN104516789A (en) Method and system for failover detection and treatment in checkpoint systems
CN112948063B (en) Cloud platform creation method and device, cloud platform and cloud platform implementation system
CN110445662A (en) OpenStack control node is adaptively switched to the method and device of calculate node
Melo et al. Comparative analysis of migration-based rejuvenation schedules on cloud availability
CN101442437A (en) Method, system and equipment for implementing high availability
JP2014048933A (en) Plant monitoring system, plant monitoring method, and plant monitoring program
CN116192885A (en) High-availability cluster architecture artificial intelligent experiment cloud platform data processing method and system
Mathews et al. Service resilience framework for enhanced end-to-end service quality
CN111966469B (en) Cluster virtual machine high availability method and system
CN114691304A (en) Method, device, equipment and medium for realizing high availability of cluster virtual machine
CN107122228A (en) The dispositions method and device of the management platform of super emerging system
US10985985B2 (en) Cloud service system
CN107783855B (en) Fault self-healing control device and method for virtual network element

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20181009

RJ01 Rejection of invention patent application after publication