CN108632057A - A kind of fault recovery method of cloud computing server, device and management system - Google Patents
A kind of fault recovery method of cloud computing server, device and management system Download PDFInfo
- Publication number
- CN108632057A CN108632057A CN201710160761.7A CN201710160761A CN108632057A CN 108632057 A CN108632057 A CN 108632057A CN 201710160761 A CN201710160761 A CN 201710160761A CN 108632057 A CN108632057 A CN 108632057A
- Authority
- CN
- China
- Prior art keywords
- failure
- application
- virtual machine
- operating system
- cloud computing
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/06—Management of faults, events, alarms or notifications
- H04L41/0654—Management of faults, events, alarms or notifications using network fault recovery
Landscapes
- Engineering & Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Debugging And Monitoring (AREA)
Abstract
The embodiment of the invention discloses a kind of fault recovery method of cloud computing server, device and management system, this method to include:The hardware resource fault message transmitted by IaaS management platforms is obtained, the operating system failure information of cloud computing server is obtained, obtains the application and trouble information of cloud computing server;According to accessed hardware resource fault message, operating system failure information and application and trouble information determine the failure root of the cloud computing server because;According to failure root because determining troubleshooting strategy;Operation indicated by troubleshooting strategy carries out fault recovery.Implement the embodiment of the present invention, high reliability guarantee can be provided for the Legacy System of enterprise in cloud computing platform, advantageously ensure that the reliable even running of Legacy System.
Description
Technical field
The present invention relates to field of cloud computer technology more particularly to a kind of fault recovery method of cloud computing server, devices
And management system.
Background technology
Cloud computing (Cloud Computing) is a kind of emerging business computation model, and calculating task is distributed in greatly by it
Amount calculate mechanism at resource pool on, so that various application systems is obtained computing capability, memory space and various as needed
Software service.In order to obtain a series of benefits that cloud computing is brought, includes the complexity for reducing O&M, save hardware cost
It traditional IT system is put moves to the relevant resource pool of cloud computing Deng, more and more enterprises selection and operates above, allow entire
IT system can realize unified O&M using the service of cloud computing, the running environment of these IT systems has occurred huge therewith
Big variation, due to cloud computing platform reliability there is no dedicated server height, so must be filled in cloud computing platform
Divide the reliability service for considering how to continue guarantee system when part computing resource fails.Under cloud computing platform, computing resource
It is distributed from resource pool on demand, when computing resource fails, needs that cloud is waited for reschedule distribution computing resource, such as logical
Elastic telescopic is crossed to trigger.In the prior art, in order to adapt to the framework of cloud computing, if to ensure that traditional IT system moves
It moves on to after the relevant resource pool of cloud computing, can also obtain the guarantee of high reliability (High Availability, HA), usually
It is required that the IT system is the system of ready (Cloud-Ready) type of cloud.For the system of Cloud-Ready types, first, it
Should be a distributed system, cohesion and the transparency with height;Secondly, it should be redundancy, can handle clothes
The case where business device failure, Single Point of Faliure is not present.
However, often there is also the Legacy System that part does not have These characteristics, these something lost in the IT system of enterprises
Stay system that funnel-shaped perpendicular system is taken to build, the Resource dynamic allocation in framework level does not fully consider cloud environment, resource
Situations such as failure, belongs to the system of non-" Cloud-Ready " type.From the perspective of framework compatibility, perpendicular system and distribution
Formula system does not have coupling, and cloud computing at present is designed generally directed at distributed system, so cloud computing at present is flat
The general HA schemes of platform are not applied for enterprise's Legacy System, when enterprise is by entire IT system (including these Legacy Systems), entirely
After portion all moves on the relevant resource pool of cloud computing, for Legacy System therein, only simply by system again portion
In the computing resource for affixing one's name to cloud distribution, guarantee of the cloud computing to its reliability can not be obtained, such as cannot achieve elastic telescopic,
Resource of distributing according to need etc., therefore prodigious challenge will be faced in terms of reliability.
Invention content
The embodiment of the present invention provides a kind of fault recovery method of cloud computing server, device and management system, to solve
Legacy System moves to the integrity problem of Yun Shanghou.
In a first aspect, an embodiment of the present invention provides the fault recovery method of cloud computing server, it is applied to cloud computing and takes
Business device, including:
It is the hardware resource fault message serviced transmitted by IaaS management platforms that PaaS management platforms, which obtain infrastructure,
In, the IaaS management platforms are used to manage the hardware resource of the cloud computing server, are additionally operable to detect the hardware resource
Hardware resource fault message, the IaaS management platforms are independently of the cloud computing server;Obtain the cloud computing service
The operating system failure information of device, the operating system failure information are used to indicate the operation for being installed on the cloud computing server
The failure that system occurs;The application and trouble information of the cloud computing server is obtained, the application and trouble information is used to indicate
It is installed on the failure that the application of the operating system occurs;
According to the accessed hardware resource fault message, the operating system failure information and the application and trouble
Information determine the failure root of the cloud computing server because;According to the failure root because determining troubleshooting strategy;According to described
Operation indicated by troubleshooting strategy carries out fault recovery.
Above-mentioned first aspect describes the embodiment of the present invention from PaaS management platforms side and provides a kind of cloud computing server
Fault recovery method, by implementing this method, PaaS management platforms can comprehensively detect the hardware resource of cloud computing server
The failure that layer, operating system layer and application layer occur, and comprehensive analysis is carried out based on above-mentioned failure, determine failure root because,
And fault recovery is carried out using corresponding troubleshooting strategy.In the embodiment of the present invention, when the Legacy System of enterprise moves to
After cloud computing server, PaaS management platforms provide HA schemes to the Legacy System, when the cloud computing server breaks down,
PaaS management platforms can accurately determine failure and be happened at hardware resource layer, operating system layer or application layer, and correspond to
The layer carries out corresponding fault recovery, therefore HA schemes provided in an embodiment of the present invention are with comprehensive.
With reference to first aspect, in some possible embodiments, the operating system also has first agent's application;
The operating system failure information of the cloud computing server is obtained, including:By detecting first agent's application
Heartbeat message determine that the operating system failure information, the heartbeat message are used to indicate whether the operating system occurs
Failure.
That is, PaaS management platforms are at application deployment (including application program, application system, Enterprise IT System etc.)
Cloud computing server on all virtual machines operating system on first agent be all installed apply (Agent), the first generation
Heartbeat communication ought to be carried out with PaaS management platforms.PaaS management platforms detect the heartbeat with first agent's application, when some
First agent disappears using heartbeat, then shows that disconnection failure occurs for the virtual machine (operating system), PaaS management platforms are corresponding
Obtain operating system failure information.
With reference to first aspect, in some possible embodiments, also there is second agent's application in the operating system;
The application and trouble information of the cloud computing server is obtained, including:Described in second agent's application call
The state-detection script of application determines the application and trouble information according to the return value of the state-detection script.
Wherein, second agent applies and first agent's application can be the same agent application, can also be different generation
It ought to use.
Second agent's application is equally deployed in application layer, can be used for managing the application in cloud computing server, and periodically supervise
The operating status of the application on virtual machine is controlled, for example, second agent applies the state inspection by being provided using (application system)
Survey the monitoring that script carries out relevant operational state.In specific application scenarios, in the process of running using (application system), move
A state-detection script is provided to state, second agent's application periodically calls status.sh in the installation directory, and obtains corresponding
Return value, it is possible to understand that, second agent apply according to the return value of script judge application (application system) operating status,
And corresponding operating status is sent to PaaS.In the case where determining that application is broken down, second agent's application, which generates, applies
Fault message, and the application and trouble information is sent to PaaS.
It should be understood that when second agent applies and first agent's application can be the same agent application, then
PaaS management platforms can both monitor the operating status of virtual machine (operating system) by the agent application, can also pass through the generation
The operating status of monitoring application ought to be used so that the HA schemes that are provided of the embodiment of the present invention can quickly and conveniently into
Row deployment.
With reference to first aspect, in some possible embodiments, believed according to the accessed hardware resource failure
Breath, the operating system failure information and the application and trouble information determine the failure root of the cloud computing server because at least
Including:
It is all detected in preset time under the hardware resource fault message and the operating system failure information state,
Failure root is determined because the hardware resource breaks down;Or detected in preset time the operating system failure information and
The application and trouble information, and in the case of not detecting the hardware resource fault message, failure root is determined because of the behaviour
Make system failure;Or only detected under application and trouble information state in preset time, failure root is determined because described answer
With failure.
It can be seen that when cloud computing server breaks down, failure is no longer directly to be handled in this layer, but unite
One converges to PaaS management platforms, since PaaS management platforms grasp the status information on cloud computing server every aspect, tool
There is global information view, therefore PaaS management platforms are based in preset time (such as 3 minutes) accessed faulty letter of institute
Breath carry out comprehensive analysis and judgement, to accurately determine the failure root for causing cloud computing server above-mentioned failure occur because.It can be with
Prevent wrong report from failing to report, therefore HA schemes provided in an embodiment of the present invention have accuracy.
With reference to first aspect, in some possible embodiments, according to the failure root because determining troubleshooting strategy
Including:
In failure root because in the case that the hardware resource breaks down, the troubleshooting strategy includes restarting virtually
Machine, local reconstruction virtual machine and migration virtual machine;It is described or in the case where failure root breaks down because of the operating system
Troubleshooting strategy, which includes at least, restarts virtual machine;Or in the case where failure root breaks down because of the application, the event
Barrier processing strategy, which includes at least, restarts virtual machine, restarts application.
In a specific application scenarios, PaaS management platforms determine specific troubleshooting plan based on the type of failure
Slightly, for example preset failure diagnostic data base, the Fault Diagnosis Database it can be stored with various faults information in PaaS, for
The fault message for belonging to same level assigns different fault levels, as fault level one, fault level two, fault level are third
Deng.Such as being directed in the default troubleshooting strategy of hardware resource layer failure, corresponding to preset failure grade one
Troubleshooting strategy be to restart virtual machine, two corresponding troubleshooting strategy of fault level is local to rebuild virtual machine, failure
Three corresponding troubleshooting strategy of grade is migration virtual machine, and so on.Failure root is being determined because after, PaaS is based on actually obtaining
The hardware resource layer failure taken is analyzed, and determines the corresponding fault level of hardware resource layer failure, and base
Troubleshooting strategy is determined accordingly in the fault level.
In another specific application scenarios, the different faults processing strategy that PaaS assigns same layer in advance is different preferential
Grade automatically selects this layer of priority most after determining that fault rootstock is certain layer of failure based on the fault message received for the first time
The troubleshooting strategy that high troubleshooting strategy is executed as needs.Event is cannot achieve in the high troubleshooting strategy of priority
In the case that barrier restores, PaaS reselects the lower troubleshooting strategy of priority, and repeats above-mentioned steps.
Such as in failure root because in the case that the hardware resource breaks down, the troubleshooting strategy is attached most importance to
Open virtual machine;In the case where execution restarts virtual machine and can not achieve hardware resource fault recovery, the troubleshooting strategy is
It is local to rebuild virtual machine;Restart virtual machine and the local feelings rebuild virtual machine and all can not achieve hardware resource fault recovery executing
Under condition, the troubleshooting strategy is migration virtual machine.
It for another example says, in failure root because in the case that the application is broken down, the troubleshooting strategy is to restart
Using;In the case where execution is restarted using can not achieve application and trouble recovery, the troubleshooting strategy is to restart virtual machine.
It can be seen that the embodiment of the present invention is directed to different failure roots because providing various faults recovery ways, wherein
When a kind of fault recovery means cannot achieve fault recovery, it also will continue to carry out corresponding failure using other fault recovery means
Restore.And then cloud computing server is ensured after failure, it can be as soon as possible from by the fault recovery, to ensure cloud meter
Calculate the high availability of platform.
With reference to first aspect, in some possible embodiments, the operation indicated by troubleshooting strategy, packet are executed
It includes:
In failure root because in the case that the hardware resource breaks down, the operation indicated by troubleshooting strategy is executed
It includes at least:The IaaS management platforms interface is called to execute the operation indicated by corresponding troubleshooting strategy;Or in failure
Because in the case of operating system failure, execute the operation indicated by troubleshooting strategy includes root:Described in calling
IaaS management platform interfaces execute the operation indicated by corresponding troubleshooting strategy;Or in failure root because the application occurs
In the case of failure, executing the operation indicated by troubleshooting strategy includes:Call second agent's application execution corresponding
Operation indicated by troubleshooting strategy.
It can be seen that when needing to carry out fault recovery, troubleshooting strategy is based on by PaaS management platforms PaaS
Fault recovery is initiated, ensures that failover capability, PaaS management platforms can call IaaS management platforms or agent application to carry out
Corresponding fault recovery, based on different failure roots because taking different fault recovery means, therefore the ability of fault recovery and effect
Rate can be ensured by PaaS, that is to say, that HA schemes provided in an embodiment of the present invention are not restricted by the HA abilities of IaaS no matter
How the HA abilities of IaaS can ensure the reliability of the application on operation cloud computing server, so therefore present invention implementation
The HA schemes that example provides have versatility.
With reference to first aspect, in some possible embodiments, the operation indicated by troubleshooting strategy is executed, is also wrapped
It includes:
Fault log is generated based on fault message, the fault log is achieved, and the failure is reported to network management system
Daily record, the fault message include the hardware resource fault message, the operating system failure information and the application and trouble
Information.The fault log is used to indicate the information such as time, position, fault type, the fault recovery history of failure generation.
When all troubleshooting strategies of PaaS management platforms and corresponding fault recovery all can not achieve fault recovery
When, PaaS management platforms are alerted to network management system, report the fault log, in order to which operation maintenance personnel passes through the webmaster
System finds the failure and carries out manual maintenance in time, and cloud computing server is avoided to be shut down because can not achieve fault recovery,
Ensure the high availability of cloud computing platform.
Second aspect, an embodiment of the present invention provides a kind of devices of fault recovery that realizing cloud computing server, including:
Fault detection module, failure analysis module, failure strategy module and Failure Recovery Module, with execute that first aspect is provided one
The method that kind realizes the fault recovery of cloud computing server, wherein:
Fault detection module is used to obtain the hardware resource failure that infrastructure services transmitted by IaaS management platforms and believes
Breath, wherein the IaaS management platforms are used to manage the hardware resource of the cloud computing server, are additionally operable to detect the hardware
The hardware resource fault message of resource, the IaaS management platforms are independently of the cloud computing server;It is additionally operable to described in acquisition
The operating system failure information of cloud computing server, the operating system failure information, which is used to indicate, is installed on the cloud computing clothes
The failure that the operating system of business device occurs;It is additionally operable to obtain the application and trouble information of the cloud computing server, the application
Fault message, which is used to indicate, is installed on the failure that the application of the operating system occurs;
Failure analysis module is used to be believed according to the accessed hardware resource fault message, the operating system failure
Breath and the application and trouble information determine the failure root of the cloud computing server because;
Failure strategy module is used for according to the failure root because determining troubleshooting strategy;
Failure Recovery Module carries out fault recovery for the operation indicated by the troubleshooting strategy.
The third aspect, an embodiment of the present invention provides the device of another fault recovery for realizing cloud computing server (clothes
Business device), including:The memory and processor coupled with the memory, transmitter and receiver, wherein:The transmitter
For sending director data with to outside, the receiver is used to receive the data of external transmission, and the memory is for storing
Program code and related data, the processor is for executing the program code stored in the memory, to execute one kind
The fault recovery method of cloud computing server, wherein the method is method as described in relation to the first aspect.
Fourth aspect, the embodiment of the present invention provide a kind of management system, the management system include IaaS management platforms,
PaaS management platforms and SaaS service platforms, wherein PaaS management platforms include fault detection module, failure analysis module, event
It includes agent application to hinder policy module and Failure Recovery Module, SaaS service platforms.The disparate modules of PaaS management platforms with
IaaS management platforms pass through the first IF connections of periodic communication interface, disparate modules and the SaaS service platforms of PaaS management platforms
Pass through the 2nd IF connections.The management system for realizing the cloud computing server described in first aspect fault recovery method.
5th aspect, an embodiment of the present invention provides a kind of computer readable storage medium, the computer-readable storage
Media storage has instruction (realizing code), when run on a computer, computer may make to be based on described instruction and execute
State the method described in first aspect.
7th aspect, an embodiment of the present invention provides a kind of computer program products including instruction, when it is in computer
When upper operation, computer may make to execute the method described in above-mentioned first aspect based on described instruction.
It can be seen that by implementing the embodiment of the present invention, Legacy System is moved to the cloud meter of cloud computing platform in enterprise
After calculating server, PaaS can pass through the fortune of agent application monitor operating system by the failure of IaaS monitoring hardware resource layers
The operating status of row state and Legacy System.When PaaS gets fault message, continue to obtain in preset time (such as 2 minutes)
Other fault messages carry out comprehensive analysis, determination leads to failure after preset time based on all fault messages summarized
The failure root of generation because, and based on failure root because of the specific troubleshooting strategy of determination, so call IaaS or agent application into
The corresponding fault recovery of row and fault warning, ensure that Legacy System high availability possessed by cloud computing platform,
The HA schemes of the embodiment of the present invention have the complete characteristics such as comprehensive, accuracy and versatility.
Description of the drawings
Technical solution in order to illustrate the embodiments of the present invention more clearly or in background technology below will be implemented the present invention
Attached drawing illustrates needed in example or background technology.
Fig. 1 is a kind of cloud computing platform configuration diagram that the prior art provides;
Fig. 2 is a kind of fault recovery method flow diagram of cloud computing server provided in an embodiment of the present invention;
Fig. 3 is the fault recovery method flow diagram of another cloud computing server provided in an embodiment of the present invention;
Fig. 4 is a kind of schematic diagram of PaaS comprehensive detections cloud computing server failure provided in an embodiment of the present invention;
Fig. 5 is that a kind of PaaS provided in an embodiment of the present invention judges the schematic diagram whether cloud computing server breaks down.
Fig. 6 is the flow signal that a kind of PaaS provided in an embodiment of the present invention selects troubleshooting strategy based on priority
Figure;
Fig. 7 is a kind of schematic device of fault recovery for realizing cloud computing server provided in an embodiment of the present invention;
Fig. 8 is the schematic device of the fault recovery of another realization cloud computing server provided in an embodiment of the present invention;
Fig. 9 is a kind of management system provided in an embodiment of the present invention;
Figure 10 is another management system provided in an embodiment of the present invention.
Specific implementation mode
The embodiment of the present invention is described with reference to the attached drawing in the embodiment of the present invention.
In current internet and the cloud era of big data technology fast development, cloud computing (cloud computing) is
Mainstream through evolving as novel information system calculates general type.Cloud computing is parallel computation, Distributed Calculation, effectiveness calculating
And the product of a series of network technologies such as virtualization and computing technique fusion.Fig. 1 is referred to, Fig. 1 is that the prior art provides
A kind of cloud computing platform configuration diagram, cloud computing platform are commonly divided into software according to the difference for providing service level and service
(Software a s a Service, SaaS), platform service (Platform as a Service, PaaS) and basis is set
Standby i.e. service (Infra-structure a s a Service, IaaS) three big service modes, wherein PaaS and IaaS can be with
Directly by Services Oriented Achitecture (SOA, Service-Oriented Architecture) or network server to flat
Platform user provides service, and the support platform that can also be used as SaaS patterns is serviced to end user indirectly.Wherein:
For independent IaaS (I layers) service mode, the I layers of operating system provided on virtual machine and virtual machine
The calculating of (opera system, OS), server virtual, virtual memory and virtual network resource.User is generally concerned with virtual machine
Type and relevant configuration (CPU, memory, disk, network etc.), the middleware on the operating system upper layer of virtual machine
(middleware), (runtime) and application etc. are all disposed by user oneself when running.IaaS is supplied to the clothes of consumer
Business is the utilization to all facilities, including processing, storage, network and other basic computing resources, user can dispose and transport
The arbitrary software of row, consumer do not manage or control any cloud computing infrastructure, but can control the selection of operating system, storage sky
Between, deployment application, it is also possible to obtain the control of conditional networking component.
For independent PaaS (P layers) service mode:It is that user is required using exploitation that P layers, which are supplied to the service of user,
Language and developing instrument are deployed to cloud computing infrastructure up, provide a user running environment, the middleware clothes of application software
Business, life cycle management etc..Client need not manage or control the cloud infrastructure of bottom, including network, server, operation
When system, storage, operating system, middleware and operation etc., but user can monitor disposed application (application, application system
Deng), it is also possible to the hosting environment configuration of control operation application.User often only focuses on the exploitation of application software and in the middle part of PaaS
Affix one's name to related data and application.
For independent SaaS (S layers) service mode:S layers provide a user the application operated in cloud computing infrastructure
(application, application system) service.User can be accessed in various equipment by client end interface, such as browser, to directly
The S layers of application service provided are provided.User need not manage or control any cloud computing infrastructure, including network, service
Device, operating system, storage, development environment, using etc..
In the prior art, the computing resource in cloud computing platform has and prodigious may fail.This requires clouds
It needs to ensure that run application system has high availability (High Availability, HA) in calculating, high availability refers to
One system is highly reliable, i.e., seldom breaks down, or can be restored quickly after breaking down.That is cloud computing
When there is computing resource failure or other failures occurs in application system in platform, it is necessary to there is corresponding HA mechanism to ensure application
The recovery as early as possible of system shortens the downtime caused by inside the plan routine maintaining operations or system crash outside the plan, to keep away
Exempt from the interruption for causing business, improves the availability of application system.Generally use cloud management platform provides the side HA for cloud computing platform
Case, currently, most popular cloud management platform is OpenStack, OpenStack is one it is intended that public and private clound construction
The open source projects of software are provided with management.Mechanism and individual in the communities OpenStack are all using OpenStack as basic facility
Service the universal front end of (IaaS) resource.OpenStack has become current industrial quarters and academia's IaaS cloud platform is true
On deployment criteria, OpenStack is widely used in all trades and professions.The top priority of OpenStack is the deployment of simplified cloud
Process simultaneously brings good scalability for it, and from the viewpoint of OpenStack, in cloud computing platform, IaaS is as cloud
The brace foundation facility of calculating, IaaS provide elasticity, expansible infrastructure services, and big rule can be provided to upper layer application
Mould, the calculating service of distribution according to need, storage service and network service, the network service of IaaS cloud platform is as its core the most
Service is the key that influence all kinds of cloud application service quality.So OpenStack be deployed in cloud platform bottom physics calculating,
On storage and Internet resources, to realize the unified management of calculating, storage and Internet resources, IaaS layers of cloud infrastructure is provided
Uniform service.
OpenStack proposes the solution of a virtual machine (Virtual Machine, VM) HA, OpenStack's
HA schemes are dedicated to solving failure monitoring and the fault recovery of infrastructure layer.The HA schemes of OpenStack include mainly:(1)
It monitors (Monitoring), detects virtualization layer failure, the failure of monitoring calculation node;(2) (Fencing) is isolated, isolation is lost
Lose calculate node;(3) restore (Recovery), fault virtual machine is recovered.
As can be seen that existing OpenStack frameworks lead to from the upper failure for only ensureing IaaS layers of design in from the description above
It crosses and ensures IaaS layers of HA to realize the HA of cloud computing platform, for the operating system or application on upper layer,
OpenStack thinks itself be solved by application, so existing OpenStack HA schemes can not detect upper layer
The failure of operating system or application.In fact, up to the present, there are one complete void for the communities OpenStack
Quasi- machine HA solutions.
Moreover, for funnel-shaped Legacy System, after enterprise moves to Legacy System on cloud computing platform, by
Compatible architected features are difficult to cloud computing platform in Legacy System, possibly being present at basis with the relevant failure of Legacy System sets
Layer, OS layers and application layer are applied, and OpenStack HA schemes cannot be fully solved the HA problems of application.Also,
The fault message for the only infrastructure layer that OpenStack HA schemes obtain, the state for being not bound with application carry out comprehensive point
Analysis judges, erroneous judgement is susceptible to, to produce new failure.In addition, OpenStack HA schemes require IaaS to need to have
The ability of automatically restoring fault however, the VM HA abilities of different IaaS management platforms are not consistent, or even has part IaaS
The ability that management platform miss fault is restored automatically, therefore, OpenStack HA schemes are not in all IaaS management platforms
It is general.
In order to solve the disadvantage that in the prior art, an embodiment of the present invention provides a kind of fault recoveries of cloud computing server
Method, relevant apparatus and management system, from IaaS layers, OS layers, application layer establish multi-level, comprehensive fault detect and from
Reason mechanism, solves the problems, such as how the old Legacy System of enterprise ensures its reliability service after moving to cloud computing platform, most
Limits ensure the reliability of application (Legacy System).
It is a kind of management system provided in an embodiment of the present invention referring to Fig. 9, Fig. 9, the management system is serviced flat by SaaS
Platform (hereinafter referred to as SaaS), PaaS management platforms (hereinafter referred to as PaaS) and IaaS management platforms (hereinafter referred to as IaaS)
Connection is set up.The management system can be directed to (I layers, P layers and S layers) of different levels and provide corresponding management service,
In specific implementation, the IaaS management platforms, PaaS management platforms and SaaS service platforms can be separately operable in difference
Server in, the IaaS management platforms, PaaS management platforms and SaaS service platforms can also operate in same server
On.
Specifically, IaaS management platforms can be the cloud computing base provided towards privately owned, publicly-owned or mixing IaaS cloud user
Infrastructure platform can manage ultra-large server, storage and Internet resources concentratedly, form the cloud that can be managed collectively and dispatch
Computing resource pond provides to the user on demand using the computing capability with flexible scheduling.IaaS management platforms can be supported a variety of simultaneously
Virtualization technology integrates, and provides unified resource management, scheduling and monitoring to I layers of hardware resource, realizes to virtual machine, deposits
Resource and Internet resources are stored up from the lifecycle management for creating, detecting destruction, for I layer of virtual machines provide it is quick create,
A series of guarantee of high availability such as resilient expansion, local reconstructions, dynamic migration, the not branch of virtual machine offer operating system
It holds;
Specifically, PaaS management platforms can be implemented on IaaS management platforms, that is to say, that IaaS management platforms are straight
It connects and local hardware resources is managed.PaaS management platforms need to obtain or call the relevant information of local hardware resources
When, directly it can ask or call to IaaS management platforms.In addition, PaaS management platforms also provide the end of application and related resource
To the monitoring and management at end, request instruction is routed into effective application example, and relies on agent application, cloud controller, health pipe
Reason device etc. components are managed and monitor to the information such as operating system, application and the state of related service, operating parameter.
Specifically, in embodiments of the present invention, SaaS service platforms can be built in PaaS management platforms and IaaS management
On the architecture of platform composition, SaaS service platforms are only focused in providing application service to cloud service operator or enterprise,
Application, application system in SaaS service platforms are managed by PaaS management platforms, are detected and controlled.
The management system is that cloud computing platform (cloud computing server) provides HA schemes, wherein PaaS is in the side HA
The core of case has global information view, plans as a whole detection, analysis, strategy and the recovery of failure.
On the one hand, after certain levels of cloud computing server break down, failure is directly located in this layer
Reason, but PaaS is uniformly converged to, PaaS comprehensive analysis and judgement, fault detect range covers the hardware money of cloud computing server
Active layer face, operating system level and application, therefore HA schemes provided in an embodiment of the present invention are with comprehensive;
On the other hand, PaaS processing, PaaS combinations hardware resource, operation system are given in the unified convergence of I layers, P layers and S layers of failure
System and the state of application carry out root cause analysis, accurately determine failure root because so as to prevent wrong report from failing to report, therefore the present invention is real
The HA schemes for applying example offer have accuracy;
In another aspect, PaaS, which is based on troubleshooting strategy, initiates fault recovery, based on different failure roots because taking difference
Fault recovery means, therefore the ability of fault recovery and efficiency can be ensured by PaaS, that is to say, that provided in an embodiment of the present invention
HA schemes are not restricted by the HA abilities of IaaS, and no matter how the HA abilities of IaaS can ensure on operation cloud computing server
Application reliability, so therefore HA schemes provided in an embodiment of the present invention have versatility.
A kind of fault recovery method for cloud computing server that the embodiment of the present invention also provides refers to Fig. 2, a kind of cloud meter
The fault recovery method of server is calculated, including:
Step S101 obtains hardware resource fault message, operating system failure information, the application and trouble of cloud computing server
Information.
In embodiments of the present invention, fault management cores of the design PaaS as entire cloud computing platform (cloud computing server)
The heart, PaaS are in the centre of IaaS and SaaS, and PaaS can be used for collecting the cloud service business datum that PaaS itself is managed, may be used also
For collecting the data that IaaS and SaaS is submitted, wherein the PaaS is independently of the cloud computing server.
In embodiments of the present invention, PaaS obtains the fault message of cloud computing server, and the fault message specifically includes
Hardware resource fault message, operating system failure information and application and trouble information,
Wherein, hardware resource fault message is used to indicate the failure that hardware resource failure level occurs, such as storage resource
Deficiency, Network Abnormal, virtual machine operation troubles etc.;Operating system failure information is used to indicate operating system (OS) level and is gone out
Existing failure, such as operating system log in exception, system in case of system halt etc.;Application and trouble information is used to indicate using occurred event
Barrier, such as application stop, and application system is abnormal etc..
Specifically, PaaS executes the fault message of cloud computing server, including execute following steps S201-S203:
Step S201:Obtain the hardware resource fault message sent in infrastructure, that is, service system IaaS.
In embodiments of the present invention, IaaS is for managing hardware resource, including computing resource, storage resource and network money
Source, IaaS are additionally operable to failure caused by the hardware resource of detection cloud computing server, wherein the IaaS is independently of the cloud
Calculation server.
In embodiments of the present invention, IaaS can monitor local hardware resources in real time, can dynamically show calculating money
Source, storage resource, Internet resources and associated virtual machine operating status, specifically, IaaS can carry out resource capacity inquiry, money
Source dosage control, VM monitoring running states, fault warning etc., and relevant information is reported into PaaS.
In a particular embodiment, when the virtual machine (VM) for running on cloud computing server breaks down or cloud computing clothes
When relevant hardware configuration (CPU, memory, disk, network etc.) is broken down in business device, IaaS detects the failure, and in real time
Corresponding hardware resource fault message is generated, and the hardware resource fault message is sent to PaaS, correspondingly, PaaS is obtained
The hardware resource fault message.
Step S202:Obtain the operating system failure information of the cloud computing server.
The running environment and middleware services of PaaS management application softwares, to need the application run to provide life cycle
Management, PaaS can obtain middleware, using etc. the operating system related status information that is relied on.For running on a virtual machine
Operating system, when system disconnection, system crash etc. failure occurs in the operating system, the PaaS can obtain relevant
Operating system failure information.In concrete implementation mode, generation can be arranged in the operating system OS of required detection in PaaS
It manages (Agent), PaaS is communicated with the Agent, the operation shape of OS where judging Agent by detecting communication quality
State.
Step S203:Obtain the application and trouble information of the cloud computing server.
In embodiments of the present invention, SaaS is only focused on services in offer application (application software, application, application system etc.),
The application service is not directly managed and monitors, the role for managing and monitoring the application is actually served as by PaaS, works as institute
When stating the application in SaaS and breaking down, PaaS is detected and is got the corresponding application and trouble information of the failure in real time.
It should be noted that it should be noted that there is no inevitable elder generation between step S201, step S202 and step S203
Sequence afterwards, in addition, in the particular embodiment, step S201, two steps in step S202 and step S203 can be simultaneously
It carries out, step S201, step S202 and step S203 can also be carried out at the same time, and the description of above-described embodiment should not be construed as to this
The limitation of invention.
Step S102:According to the hardware resource fault message, the operating system failure information and the application and trouble
Information determine the failure root of the Cloud Server because.
Wherein, after PaaS obtains fault message, judge the source of fault message, and timing is set for the fault message
Device, continues whether detection can also get other fault messages in preset time (such as 3 minutes).The described of timer is preset
When time terminates, PaaS carries out comprehensive analysis based on all fault messages got in preset time, to determine that cloud takes
Be engaged in device failure root because, that is, determine cause Cloud Server break down concrete reason and failure specific position
It sets.
Referring to table 1, table 1 be PaaS is got in preset time in particular embodiments of the invention fault message with
Correspondence of the failure root because between determined by PaaS.
Table 1
Note:√ indicates state health, × indicate to detect that failure, NA indicate no detection information
It can be seen that PaaS may include 3 kinds of situations to the result that failure is detected:State health, detect failure with
And reported without detection information, to failure root because analysis include following several situations:
When PaaS is in preset time, hardware resource fault message and operating system failure information are detected, then, PaaS
Failure root will be determined because being that the hardware resource layer of cloud computing server failure occurs;
When PaaS is in preset time, hardware resource fault message and application and trouble information are detected, then, PaaS will be true
Failure root is determined because being that the hardware resource layer of cloud computing server failure occurs;
It can be seen that when PaaS is in preset time, operating system failure information and application and trouble information are detected, and
Hardware resource fault message is not detected in the preset time, then, PaaS will determine failure root because being cloud computing service
The operating system OS of device breaks down;Or
It can be seen that when PaaS only detects application and trouble information in preset time, without detecting hardware resource
Fault message and operating system failure information, then, PaaS will determine failure root because being that the application layer of cloud computing server occurs
Failure.
It can be seen that only detecting operating system failure information in PaaS;Or, ought only detect that hardware resource failure is believed
Breath, alternatively, when only detecting hardware resource fault message and application and trouble information etc., then, described in PaaS will judge
The fault message occurred in preset time belongs to the wrong report of cloud computing platform (cloud computing server), and in the case of these, PaaS will
Ignore above-mentioned dependent failure information.
Step S103:According to the failure root because determining troubleshooting strategy.
Failure root is being determined because after by step S102, PaaS is according to failure root because of the corresponding troubleshooting of determination
Strategy.
Referring to table 2, table 2 be in the embodiment of the present invention failure root because of some correspondences with troubleshooting strategy.
Table 2
It can be seen that in a concrete application scene, it is described in the case where failure root is because being that hardware resource breaks down
Troubleshooting strategy, which includes at least, restarts (reboot) virtual machine, local reconstruction (rebuild) virtual machine and migration
(migration) virtual machine.
It can be seen that in a concrete application scene, it is described in the case where failure root is because being that operating system breaks down
Troubleshooting strategy, which includes at least, restarts virtual machine, and in that case, when restarting virtual machine, it is right that virtual machine can accordingly load institute
The operating system answered, and then complete restarting for operating system;Under special case, restart if not restarting virtual machine and can also realize
Operating system, then troubleshooting strategy is directly to restart operating system.
It can be seen that in a concrete application scene, in the case where failure root is because being that application layer breaks down, the event
Barrier processing strategy, which includes at least, restarts application, restarts virtual machine, wherein restarts and applies directly to be carried out again to relevant application
Restart;It is the virtual machine first restarted where the application to restart virtual machine, and virtual machine can accordingly load corresponding operating system, so
Run the application in the operating system again afterwards;Under special case, if reboot operation system can also be realized by not restarting virtual machine
System, then then troubleshooting strategy runs the application in the operating system again directly to restart operating system.
Step S104, the operation indicated by the troubleshooting strategy carries out fault recovery.
It should be understood that after determining troubleshooting strategy, PaaS can be based on indicated by the troubleshooting strategy
The recovery of dependent failure is realized in operation.
In embodiments of the present invention, IaaS is responsible for the management and control of hardware resource, including adjustment virtual machine CPU, memory
And disk dilatation, carry out the restarting of virtual machine, it is local rebuild and dynamic migration etc., ensure virtual machine business to the maximum extent
Continuity, in order to reduce the service impact for even being eliminated and being brought due to virtual-machine fail.
Therefore the operation indicated by the troubleshooting strategy is when being operated for hardware resource layer, PaaS is to IaaS
Under send instructions, described instruction includes the operation indicated by the troubleshooting strategy, and IaaS is based on executing described instruction, realizes phase
Close the recovery of failure.
For example, in the case where troubleshooting strategy is to restart virtual machine, PaaS calls IaaS interfaces to restart virtual machine, so
Afterwards by checking whether virtual machine state failure judgement is restored.
In the case where troubleshooting strategy is local reconstruction virtual machine, PaaS judges the virtual machine on cloud computing server
The system disk at place is share dish, then PaaS calls IaaS interfaces to carry out virtual machine in the share dish and locally rebuilds, and then passes through
Check whether the task status failure judgement of virtual machine is restored.
In the case where troubleshooting strategy is migration virtual machine, PaaS calls IaaS interfaces by failure cloud computing server
On virtual machine (vm) migration to other hosts on.
In the case where troubleshooting strategy is to restart operating system, PaaS calls IaaS interfaces to restart virtual machine, in void
Quasi- machine loads corresponding operating system after restarting.
Under special circumstances, if restarting for operating system can also be realized by being not required to restart in the virtual machine,
PaaS calls the direct reboot operation system of IaaS interfaces.
As can be seen that by implementing the embodiment of the present invention, (such as application program, application system, IT systems will be applied in enterprise
System, Legacy System etc.) move to the cloud computing server of cloud computing platform after, PaaS can pass through IaaS monitoring hardware resource layers
Failure can pass through the operating status of agent application monitor operating system and the operating status of application.PaaS gets fault message
When, continue obtain preset time in other fault messages, after preset time, based on all fault messages summarized into
Row comprehensive analysis, determine the failure root that causes failure to occur because, and based on failure root because of the specific troubleshooting strategy of determination, into
And IaaS or agent application is called to carry out corresponding fault recovery, it ensures that and applies the height possessed by cloud computing platform can
There are the complete characteristics such as comprehensive, accuracy and versatility with the HA schemes of property, the embodiment of the present invention.
Please integrate refering to Fig. 3-Fig. 6, Fig. 3 be another cloud computing server provided in an embodiment of the present invention fault recovery
Method, this method include but not limited to following steps:
Step S301:IaaS detects hardware resource fault message, and faulty resource information is sent to PaaS.
In the particular embodiment, referring to Fig. 4, IaaS monitors the hardware resource of cloud computing server, to determine cloud computing
Whether the I layers of server break down, and when hardware resource breaks down, IaaS reports hardware resource fault message to PaaS.
For example, after being deployed on the virtual machine of IaaS distribution using (application program, application system, Enterprise IT System etc.), management
Personnel register the information such as tenant, user account, the virtual machine address of IaaS to PaaS as needed, in order to which PaaS is carried to IaaS
For high-availability arrangement.For example, when the application is the funnel-shaped Legacy System of enterprise, in order to make the Legacy System obtain
High availability, administrative staff register the Legacy System relevant information of the enterprise of IaaS to PaaS.The PaaS identifications related letter
After breath, from the fault warning of cloud computing server (host) and virtual machine where trend IaaS subscription Legacy Systems.
IaaS detects the operating status of virtual machine, host in real time, and after detecting the appearance of failure, IaaS generates corresponding hardware money
Source fault message (such as VM operating statuses exception information), and by hardware resource fault information reporting to PaaS, in order to PaaS into
The processing of row consequent malfunction.
Step S302:PaaS obtains operating system failure information, institute by detecting the heartbeat message of first agent's application
It states heartbeat message and is used to indicate the operating system failure information.
Wherein, step S302 is applied using first agent just for the sake of being carried out with second agent's application in step S303
It distinguishes.
As shown in figure 4, PaaS installs an agent application (Agent on all virtual machines of application deployment
Application), that is to say, that first agent's application is deployed in the OS layers of cloud computing server, and the agent application is used
In the heartbeat applied with first agent with PaaS progress heartbeat maintenances, PaaS timing detections, to judge whether OS layers occur event
Barrier.When some first agent disappears using heartbeat, PaaS sends heartbeat request, if first agent's application cannot still be returned in time
The response to heartbeat request is returned, that indicates that PaaS applies the operating system (or virtual machine) at place to have occurred with the first agent
Disconnection, so, PaaS will generate corresponding operating system failure information.
Step S303:PaaS is believed by the operating status of second agent's application detection application to obtain the application and trouble
Breath.
Specifically, the application is deployed in the application layer of cloud computing server, and it is integrated into the cloud of application by SaaS
Service, in order to provide the cloud service to cloud service operator or enterprise.
Wherein, step S303 is applied using second agent just for the sake of carrying out area with the agent application in step S302
Point.Second agent's application is equally deployed on all virtual machines, and answering for this virtual machine node is managed by agent application
With.In the particular embodiment, second agent applies and first agent's application can be the same application, can also be different
Using the embodiment of the present invention does not limit herein.
As shown in figure 4, second agent's application is equally deployed in application layer, can be used for managing answering in cloud computing server
With, and the operating status of the application on regular monitoring virtual machine, it is examined by the provided state of application for example, second agent applies
Survey the monitoring that script carries out relevant operational state.In specific application scenarios, using in the process of running, one is dynamically provided
A state-detection script, such as state-detection script are status.sh, and status.sh defines return value 1 and indicates that application is being transported
Row defines return value 2 and indicates that application is out of service, defines return value 3 and indicates that unusual condition has occurred in application.It is described
Status.sh is placed under the installation directory of the application (application system), and second agent's application is periodically in the installation directory tune
With status.sh, and obtain corresponding return value, it is possible to understand that, second agent applies to be answered according to the judgement of the return value of script
Operating status, and corresponding operating status is sent to SaaS.In the case where return value is 2 or 3, second agent's application
Application and trouble information is generated, and the application and trouble information is sent to PaaS, correspondingly, PaaS obtains the application and trouble letter
Breath.
Step S304:PaaS is according to the hardware resource fault message, the operating system failure information and the application
Fault message determine the failure root of the Cloud Server because.
In embodiments of the present invention, PaaS obtains the fault message of all levels, therefore PaaS can combine hardware resource
Layer, OS layer and application layer fault message carry out the analytical judgment of synthesis, with accurately obtain failure root because.
It is normal condition to define cloud computing server original operating state referring to Fig. 5, PaaS, when PaaS receives failure letter
When breath, PaaS based on receive the fault message time point setting preset time T (such as 2 minutes), and in preset time T after
Continuous others fault message, after preset time, PaaS carries out comprehensive judgement based on acquired all fault messages.
If accessed fault message meets preset condition, it is failure that PaaS, which defines cloud computing server working condition,
State, and further determine that failure root because.If accessed fault message is unsatisfactory for preset condition, PaaS continues to define
Cloud computing server working condition is normal condition.
As shown in figure 5, if after preset time T, PaaS based on depositing simultaneously in acquired all fault messages
Vm health is abnormal and heartbeat disconnection, then PaaS will define cloud computing server working condition as malfunction, failure root because
It breaks down for hardware resource layer;If after preset time T, PaaS in acquired all fault messages based on only depositing
In vm health exception or heartbeat disconnection, then PaaS will abandon above-mentioned fault message, and cloud computing server work shape is defined
State is normal condition.
Equally, if after preset time T, PaaS in acquired all fault messages based on existing simultaneously application
Abnormal and heartbeat disconnection, then it is malfunction that PaaS, which will define cloud computing server working condition, failure root is because of OS layers of hair
Raw failure;If after preset time T, PaaS based in acquired all fault messages there is only application it is abnormal or
Heartbeat disconnection, then PaaS will abandon above-mentioned fault message, and it is normal condition to define cloud computing server working condition.
In addition, if after preset time T, PaaS is based on there is only application is different in acquired all fault messages
Often, other fault messages may be not present, then PaaS will define cloud computing server working condition be malfunction, failure root because
It breaks down for application layer.
Step S305:PaaS is according to the failure root because determining troubleshooting strategy.
In a specific application scenarios, PaaS determines specific troubleshooting strategy based on the type of failure, such as can
With the preset failure diagnostic data base in PaaS, which is stored with various faults information, same for belonging to
The fault message of level assigns different fault levels, such as fault level one, fault level two, fault level three.Such as
For being directed in the default troubleshooting strategy of hardware resource layer failure, at the failure corresponding to preset failure grade one
Reason strategy is restarts virtual machine, and two corresponding troubleshooting strategy of fault level is locally to rebuild virtual machine, and fault level three is right
The troubleshooting strategy answered is migration virtual machine, and so on.Failure root is being determined because after, PaaS is based on the hardware actually obtained
Resource layer failure is analyzed, and determines the corresponding fault level of hardware resource layer failure, and be based on the failure
Grade determines troubleshooting strategy accordingly.
In another specific application scenarios, the different faults processing strategy that PaaS assigns same layer in advance is different preferential
Grade automatically selects this layer of priority most after determining that fault rootstock is certain layer of failure based on the fault message received for the first time
The troubleshooting strategy that high troubleshooting strategy is executed as needs.Event is cannot achieve in the high troubleshooting strategy of priority
In the case that barrier restores, PaaS reselects the lower troubleshooting strategy of priority, and repeats above-mentioned steps.
For example, in a concrete application scene, referring to Fig. 6, the troubleshooting strategy of corresponding hardware resource layer according to
Priority is respectively from high to low:Restart virtual machine, local reconstruction virtual machine, migrate virtual machine and report network management system.True
After determining failure root because hardware resource breaks down, PaaS select to restart virtual machine (in this layer of highest priority) as failure at
Reason strategy.In the case where subsequent execution restarts virtual machine and cannot achieve fault recovery, PaaS reselects fault management strategy
Virtual machine is rebuild to be local.In the case where subsequent execution locally rebuilds virtual machine and cannot achieve fault recovery, PaaS is selected again
It is migration virtual machine to select fault management strategy.In the case where subsequent execution migration virtual machine cannot achieve fault recovery, PaaS
It is to report network management system, and it is extensive after executing the operation for reporting network management system to terminate above-mentioned failure to reselect fault management strategy
Resurgent journey.
Wherein, network management system is reported to specifically include:PaaS is based on fault message and generates fault log, and by the failure day
Will achieves, and the fault log is used to indicate the information such as time, position, fault type, the fault recovery history of failure generation.
PaaS reports the fault log to network management system, in order to operation maintenance personnel by the network management system find in time the failure and
Carry out manual maintenance.
Again for example, in another concrete application scene, the troubleshooting strategy of respective operations system includes at least weight
It opens virtual machine and reports network management system.In failure root because in the case that operating system breaks down, PaaS selections are restarted virtual
Machine (in this layer of highest priority) is used as troubleshooting strategy.Restart the feelings that virtual machine cannot achieve fault recovery in subsequent execution
Under condition, it is to report network management system, and terminate after executing the operation for reporting network management system that PaaS, which reselects fault management strategy,
Above-mentioned fault recovery flow.
Again for example, in another concrete application scene, correspond to application troubleshooting strategy according to priority from
It is high to Low to be respectively:Restart application, restart virtual machine and reports network management system.It breaks down because of application layer in failure root
In the case of, PaaS selects to restart application (in this layer of highest priority) as troubleshooting strategy.Restart application in subsequent execution
In the case of cannot achieve fault recovery, it is to restart virtual machine that PaaS, which reselects fault management strategy,.Restart in subsequent execution
In the case that virtual machine cannot achieve fault recovery, it is to report network management system, and holding that PaaS, which reselects fault management strategy,
Terminate above-mentioned fault recovery flow after the capable operation for reporting network management system.
Step S306, PaaS sends fault recovery instruction to IaaS respectively, and the fault recovery instruction includes identified
Troubleshooting strategy, correspondingly, IaaS executes the fault recovery that the operation indicated by troubleshooting strategy carries out I layers;Or
PaaS sends fault recovery instruction to the second agent of SaaS application respectively, and the fault recovery instruction includes identified failure
Processing strategy, correspondingly, the operation indicated by second agent's application execution troubleshooting strategy carries out S layers of fault recovery;
Specifically, PaaS sends fault recovery instruction to IaaS, the fault recovery instruction includes determined I layers of failure
Processing strategy, the fault recovery that IaaS executes I layers of operation progress indicated by troubleshooting strategy include:Execution is restarted virtually
Machine executes local reconstruction virtual machine and executes migration virtual machine, as shown in fig. 6, after IaaS executes aforesaid operations, if PaaS sentences
Disconnected failure is recovered, then will terminate aforesaid operations flow.After if IaaS executes aforesaid operations, PaaS failure judgements do not have
Restore, then PaaS will execute the operation for reporting network management system.
Specifically, PaaS sends fault recovery instruction to second agent's application, the fault recovery instruction includes determined S
The troubleshooting strategy of layer, the fault recovery that IaaS executes S layers of operation progress indicated by troubleshooting strategy include:Execute weight
It opens application, after second agent's application execution aforesaid operations, if PaaS failure judgements are recovered, aforesaid operations will be terminated
Flow.After if IaaS executes aforesaid operations, PaaS failure judgements are not restored, then PaaS instruction second agents apply in weight
Application is restarted in execution after opening virtual machine, if failure is restored not yet, PaaS reports the operation of network management system by executing.
It can be seen that by implementing the embodiment of the present invention, application is moved to the cloud computing clothes of cloud computing platform in enterprise
It is engaged in after device, PaaS can pass through the operation shape of agent application monitor operating system by the failure of IaaS monitoring hardware resource layers
The operating status of state and Legacy System.When PaaS gets fault message, continue to obtain other fault messages in preset time,
After preset time, comprehensive analysis is carried out based on all fault messages summarized, determines the failure root for causing failure to occur
Cause, and based on failure root because of the specific troubleshooting strategy of determination, and then IaaS or agent application is called to carry out corresponding failure
Restore, all can not achieve fault recovery in all troubleshooting strategies, PaaS carries out fault warning to carry out to network management system
Further Breakdown Maintenance ensures that Legacy System high availability possessed by cloud computing platform, the embodiment of the present invention
HA schemes have the complete characteristics such as comprehensive, accuracy and versatility.
Based on same inventive concept, the embodiment of the present invention provides a kind of device of fault recovery that realizing cloud computing server
70, Fig. 7 is referred to, control node 70 includes:It transmitter 703, receiver 704, memory 702 and couples with memory 702
Processor 701.Transmitter 703, receiver 704, memory 702 can be connected with processor 701 by bus or other manner
(in Fig. 7 for being connected by bus).Wherein:
Processor 701 can be one or more central processing units (Central Processing Unit, CPU), Fig. 7
In by taking a processor as an example, in the case where processor 701 is a CPU, which can be monokaryon CPU, can also be more
Core CPU.
Memory 702, include but not limited to be random access memory (Random Access Memory, RAM), it is read-only
Memory (Read-Only Memory, ROM), Erasable Programmable Read Only Memory EPROM (Erasable Programmable
Read Only Memory, EPROM) or portable read-only memory (Compact Disc Read-Only Memory, CD-
ROM), which is used for dependent instruction and data, is additionally operable to storage program code, and said program code is specifically used for real
The function of the control node in existing Fig. 5 or Fig. 8 embodiments;
Transmitter 703 is used to send director data to outside;
Receiver 704 is used to receive data from outside;
Specifically, processor 701 is used to call the program code stored in memory 702, and execute following steps:
It is the hardware resource fault message serviced transmitted by IaaS management platforms to obtain infrastructure using receiver 704,
Wherein, the IaaS management platforms are used to manage the hardware resource of the cloud computing server, are additionally operable to detect the hardware money
The hardware resource fault message in source, the IaaS management platforms are independently of the cloud computing server;
The operating system failure information of the cloud computing server, the operating system failure are obtained using receiver 704
Information, which is used to indicate, is installed on the failure that the operating system of the cloud computing server occurs;
The application and trouble information of the cloud computing server is obtained using receiver 704, the application and trouble information is used for
Instruction is installed on the failure that the application of the operating system occurs;
Processor 701 is according to the accessed hardware resource fault message, the operating system failure information and institute
State application and trouble information determine the failure root of the cloud computing server because;
Processor 701 is according to the failure root because determining troubleshooting strategy;
Fault recovery is carried out using operation of the transmitter 703 indicated by the troubleshooting strategy.
Specifically, the operating system also has first agent's application;
The operating system failure information of the cloud computing server is obtained using receiver 704, including:Utilize receiver
704 determine the operating system failure information, the heartbeat message by detecting the heartbeat message of first agent's application
It is used to indicate whether the operating system breaks down.
Specifically, also there is second agent's application in the operating system;
The application and trouble information of the cloud computing server is obtained using receiver 704, including:It is logical using receiver 704
The state-detection script applied described in second agent's application call is crossed, is determined according to the return value of the state-detection script
The application and trouble information.
Processor 701 is according to the accessed hardware resource fault message, the operating system failure information and institute
Stating application and trouble information determines the failure root of the cloud computing server because including at least:Institute is all detected in preset time
It states under hardware resource fault message and the operating system failure information state, processor 701 determines failure root because described hard
Part resource breaks down;Or the operating system failure information and the application and trouble information are detected in preset time, and
In the case of not detecting the hardware resource fault message, processor 701 determines failure root because the operating system occurs
Failure;Or only detected under application and trouble information state in preset time, processor 701 determines failure root because of the application
It breaks down.
Processor 701 is according to the failure root because determining troubleshooting strategy includes:
In failure root because in the case that the hardware resource breaks down, the troubleshooting strategy includes restarting virtually
Machine, local reconstruction virtual machine and migration virtual machine;It is described or in the case where failure root breaks down because of the operating system
Troubleshooting strategy, which includes at least, restarts virtual machine;Or in the case where failure root breaks down because of the application, the event
Barrier processing strategy, which includes at least, restarts virtual machine, restarts application.
Specifically, in failure root because in the case that the hardware resource breaks down, the troubleshooting strategy includes
Restart virtual machine, local reconstruction virtual machine and migration virtual machine, specially:In failure root because the hardware resource breaks down
In the case of, the troubleshooting strategy is to restart virtual machine;Restarting virtual machine in execution, to can not achieve hardware resource failure extensive
In the case of multiple, the troubleshooting strategy is local reconstruction virtual machine;Restart virtual machine and local reconstruction virtual machine in execution
In the case of all can not achieve hardware resource fault recovery, the troubleshooting strategy is migration virtual machine.
Specifically, in failure root because in the case that the application is broken down, the troubleshooting strategy includes at least
Restart virtual machine, restart application, specially:In failure root because in the case that the application is broken down, the troubleshooting
Strategy is to restart application;In the case where execution restarts virtual machine and can not achieve application and trouble recovery, the troubleshooting strategy
To restart virtual machine.
Specifically, the operation indicated by troubleshooting strategy is executed, including:In failure root because the hardware resource occurs
In the case of failure, executes the operation indicated by troubleshooting strategy and include at least:The IaaS management platforms interface is called to hold
Operation indicated by the corresponding troubleshooting strategy of row;Or in the case where failure root breaks down because of the operating system,
Executing the operation indicated by troubleshooting strategy includes:The IaaS management platforms interface is called to execute corresponding troubleshooting plan
Slightly indicated operation;Or it is executed indicated by troubleshooting strategy because in the case that the application is broken down in failure root
Operation include:Call the operation indicated by the corresponding troubleshooting strategy of second agent's application execution.
Processor 701 executes the operation indicated by troubleshooting strategy, further includes:
Processor 701 is based on fault message and generates fault log, and the fault log is achieved, and utilizes transmitter 703
It includes the hardware resource fault message, the operating system to report the fault log, the fault message to network management system
Fault message and the application and trouble information.
It should be noted that by the detailed description of earlier figures 2- Fig. 6 embodiments, those skilled in the art can clearly know
The implementation method for each functional unit that road device 70 is included, so in order to illustrate the succinct of book, details are not described herein.
Based on same inventive concept, a kind of dress of fault recovery that realizing cloud computing server provided in an embodiment of the present invention
80 are set, Fig. 8 is referred to, which includes multiple function modules, and each function module is described in detail as follows.
Fault detection module 801, it is former for obtaining the hardware resource that infrastructure services transmitted by IaaS management platforms
Hinder information, wherein the IaaS management platforms are used to manage the hardware resource of the cloud computing server, are additionally operable to described in detection
The hardware resource fault message of hardware resource, the IaaS management platforms are independently of the cloud computing server;It is additionally operable to obtain
The operating system failure information of the cloud computing server, the operating system failure information, which is used to indicate, is installed on the cloud meter
Calculate the failure that the operating system of server occurs;It is additionally operable to obtain the application and trouble information of the cloud computing server, it is described
Application and trouble information, which is used to indicate, is installed on the failure that the application of the operating system occurs;
Failure analysis module 802, for according to the accessed hardware resource fault message, operating system event
Barrier information and the application and trouble information determine the failure root of the cloud computing server because;
Failure strategy module 803 is used for according to the failure root because determining troubleshooting strategy;
Failure Recovery Module 804 carries out fault recovery for the operation indicated by the troubleshooting strategy.
In the particular embodiment, the operating system is applied with first agent;Fault detection module 801 is additionally operable to obtain
The operating system failure information of the cloud computing server is taken, including:The fault detection module 801 is additionally operable to by detecting institute
The heartbeat message of first agent's application is stated to determine that the operating system failure information, the heartbeat message are used to indicate the behaviour
Make whether system breaks down.
In a particular embodiment, second agent's application is installed in the operating system;The fault detection module 801
The application and trouble information for being additionally operable to obtain the cloud computing server includes:The fault detection module 801 is additionally operable to pass through institute
The state-detection script applied described in second agent's application call is stated, according to the determination of the return value of the state-detection script
Application and trouble information.
In a particular embodiment, failure analysis module 802 be used for according to the accessed hardware resource fault message,
The operating system failure information and the application and trouble information determine the failure root of the cloud computing server because at least wrapping
It includes:
The failure analysis module 802 in preset time for all detecting the hardware resource fault message and described
Under operating system failure information state, failure root is determined because the hardware resource breaks down;Or the failure analysis module
802 in preset time for detecting the operating system failure information and the application and trouble information, and does not detect
In the case of the hardware resource fault message, failure root is determined because the operating system breaks down;Or the accident analysis
Module 802 determines failure root because the application occurs for only being detected under application and trouble information state in preset time
Failure.
In a particular embodiment, failure strategy module 803 is used for according to the failure root because determining troubleshooting strategy packet
It includes:
In failure root because in the case that the hardware resource breaks down, the troubleshooting strategy includes restarting virtually
Machine, local reconstruction virtual machine and migration virtual machine;It is described or in the case where failure root breaks down because of the operating system
Troubleshooting strategy, which includes at least, restarts virtual machine;Or in the case where failure root breaks down because of the application, the event
Barrier processing strategy, which includes at least, restarts virtual machine, restarts application.
Wherein, in failure root because in the case that hardware resource breaks down, the troubleshooting strategy includes restarting void
Quasi- machine, local reconstruction virtual machine and migration virtual machine, specially:
In failure root because in the case that the hardware resource breaks down, the troubleshooting strategy is to restart virtually
Machine;In the case where execution restarts virtual machine and can not achieve hardware resource fault recovery, the troubleshooting strategy is local weight
Build virtual machine;In the case where virtual machine is restarted in execution and local reconstruction virtual machine all can not achieve hardware resource fault recovery,
The troubleshooting strategy is migration virtual machine.
Wherein, in failure root because in the case that the application is broken down, the troubleshooting strategy includes at least weight
It opens virtual machine, restart application, specially:
In failure root because in the case that the application is broken down, the troubleshooting strategy is to restart application;It is holding
Row restart virtual machine can not achieve application and trouble restore in the case of, the troubleshooting strategy be restart virtual machine.
In a particular embodiment, Failure Recovery Module 804 for indicated by the troubleshooting strategy operation into
Row fault recovery, including:
In failure root because in the case that the hardware resource breaks down, the operation indicated by troubleshooting strategy is executed
It includes at least:The IaaS management platforms interface is called to execute the operation indicated by corresponding troubleshooting strategy;Or in failure
Because in the case of operating system failure, execute the operation indicated by troubleshooting strategy includes root:Described in calling
IaaS management platform interfaces execute the operation indicated by corresponding troubleshooting strategy;Or in failure root because the application occurs
In the case of failure, executing the operation indicated by troubleshooting strategy includes:Call second agent's application execution corresponding
Operation indicated by troubleshooting strategy.
In a particular embodiment, described device 80 further includes fault warning module 805, and the fault warning module is used for base
Fault log is generated in fault message, the fault log is achieved, and the fault log, the event are reported to network management system
It includes the hardware resource fault message, the operating system failure information and the application and trouble information to hinder information.
It should be noted that by the detailed description of earlier figures 2- Fig. 6 embodiments, those skilled in the art can clearly know
The implementation method for each functional unit that road device 80 is included, so in order to illustrate the succinct of book, details are not described herein.
Based on same inventive concept, the embodiment of the present invention also provides another management system, referring to Figure 10, the management system
System includes IaaS management platforms 901, PaaS management platforms 902 and SaaS service platforms 903, wherein PaaS management platforms 902 are wrapped
Fault detection module 801, failure analysis module 802, failure strategy module 803 and Failure Recovery Module 804 are included, SaaS services are flat
Platform 903 includes agent application 806.The disparate modules of PaaS management platforms 902 pass through periodic communication with IaaS management platforms 901
Interface IF connections, the disparate modules of PaaS management platforms 902 are connect with SaaS service platforms 903 also by IF, different interfaces
It is described as follows:
Interface name | Interface connection relation |
IF1 | Connecting fault detection module 801 and failure analysis module 802 |
IF2 | Connecting fault policy module 803 and failure analysis module 802 |
IF3 | Connecting fault recovery module 804 and failure analysis module 802 |
IF4 | Connecting fault detection module 801 and IaaS management platforms 901 |
IF5 | Connecting fault detection module 801 and agent application 806 |
IF6 | Connecting fault recovery module 804 and IaaS management platforms 901 |
IF7 | Connecting fault recovery module 804 and agent application 806 |
It should be noted that the function of each management platform, module and each interface is implemented above in management system
Existing embodiment in example, for details, reference can be made to the associated descriptions of Fig. 2-Fig. 9, are not repeating herein.
In the above-described embodiments, it can be realized wholly or partly by software, hardware, firmware or arbitrary combination.
When implemented in software, it can realize in the form of a computer program product in whole or in part.The computer program
Product includes one or more computer instructions, when loading on computers and executing the computer program instructions, all or
It partly generates according to the flow or function described in the embodiment of the present invention.The computer can be all-purpose computer, special meter
Calculation machine, computer network or other programmable devices.The computer instruction is storable in computer readable storage medium, or
Person is transmitted from a computer readable storage medium to another computer readable storage medium, for example, the computer instruction
Wired (such as coaxial cable, optical fiber, digital subscriber can be passed through from a website, computer, server or data center
Line) or wirelessly (such as infrared, microwave etc.) mode is passed to another website, computer, server or data center
It is defeated.The computer readable storage medium can be any usable medium that computer can access, and can also be comprising one
Or the data storage devices such as integrated server, data center of multiple usable mediums.The usable medium can be magnetic medium
(such as floppy disk, hard disk, tape etc.), optical medium (such as DVD etc.) or semiconductor medium (such as solid state disk) etc..
In the above-described embodiments, it emphasizes particularly on different fields to the description of each embodiment, there is no the part being described in detail in some embodiment,
It may refer to the associated description of other embodiment.
Claims (19)
1. a kind of fault recovery method of cloud computing server, which is characterized in that be applied to cloud computing server, the method packet
It includes:
It is the hardware resource fault message serviced transmitted by IaaS management platforms to obtain infrastructure, and the IaaS management platforms are used
In the hardware resource fault message for detecting the hardware resource;
The operating system failure information of the cloud computing server is obtained, the operating system failure information, which is used to indicate, to be installed on
The failure that the operating system of the cloud computing server occurs;
The application and trouble information of the cloud computing server is obtained, the application and trouble information, which is used to indicate, is installed on the operation
The failure that systematic difference occurs;
According to the accessed hardware resource fault message, the operating system failure information and the application and trouble information
Determine the failure root of the cloud computing server because;
According to the failure root because determining troubleshooting strategy;
Operation indicated by the troubleshooting strategy carries out fault recovery.
2. according to the method described in claim 1, it is characterized in that, the operating system also has first agent's application;
The operating system failure information of the cloud computing server is obtained, including:
The operating system failure information, the heartbeat message are determined by detecting the heartbeat message of first agent's application
It is used to indicate whether the operating system breaks down.
3. method according to claim 1 or 2, which is characterized in that also there is second agent's application in the operating system;
The application and trouble information of the cloud computing server is obtained, including:
By the state-detection script applied described in second agent's application call, according to the return of the state-detection script
Value determines the application and trouble information.
4. method according to any one of claims 1 to 3, which is characterized in that according to the accessed hardware resource
Fault message, the operating system failure information and the application and trouble information determine the failure root of the cloud computing server
Cause includes at least:
It all detects under the hardware resource fault message and the operating system failure information state, determines in preset time
Failure root is because the hardware resource breaks down;Or
The operating system failure information and the application and trouble information are detected in preset time, and are not detected described
In the case of hardware resource fault message, failure root is determined because the operating system breaks down;Or
It is only detected under application and trouble information state in preset time, determines failure root because the application is broken down.
5. according to claim 4 any one of them method, which is characterized in that according to the failure root because determining troubleshooting plan
Slightly include:
Failure root because the hardware resource break down in the case of, the troubleshooting strategy include restart virtual machine,
It is local to rebuild virtual machine and migrate one or more in virtual machine;Or
In failure root because in the case that the operating system breaks down, the troubleshooting strategy includes restarting virtual machine;
Or in failure root because in the case that the application is broken down, the troubleshooting strategy includes restarting application and restarting virtual
One or both of machine.
6. according to claim 5 any one of them method, which is characterized in that in failure root because event occurs in the hardware resource
In the case of barrier, the troubleshooting strategy includes restarting virtual machine, local one kind rebuild in virtual machine and migration virtual machine
Or it is a variety of, specially:
In failure root because in the case that the hardware resource breaks down, the troubleshooting strategy is to restart virtual machine;
In the case where execution restarts virtual machine and can not achieve hardware resource fault recovery, the troubleshooting strategy is local weight
Build virtual machine;
In the case where virtual machine is restarted in execution and local reconstruction virtual machine all can not achieve hardware resource fault recovery, the event
Barrier processing strategy is migration virtual machine.
7. according to claim 5 any one of them method, which is characterized in that break down because of the application in failure root
In the case of, the troubleshooting strategy includes one or both of restarting application and restarting virtual machine, specially:
In failure root because in the case that the application is broken down, the troubleshooting strategy is to restart application;
In the case where execution is restarted using can not achieve application and trouble recovery, the troubleshooting strategy is to restart virtual machine.
8. according to the method described in claim 5, it is characterized in that, execute troubleshooting strategy indicated by operation, including:
In failure root because in the case that the hardware resource breaks down, the operation indicated by execution troubleshooting strategy is at least
Including:The IaaS management platforms interface is called to execute the operation indicated by corresponding troubleshooting strategy;Or
In failure root because in the case that the operating system breaks down, the operation packet indicated by troubleshooting strategy is executed
It includes:The IaaS management platforms interface is called to execute the operation indicated by corresponding troubleshooting strategy;Or
In failure root because in the case that the application is broken down, executing the operation indicated by troubleshooting strategy includes:It adjusts
With the operation indicated by the corresponding troubleshooting strategy of second agent's application execution.
9. a kind of device of fault recovery that realizing cloud computing server, which is characterized in that including:
Fault detection module services hardware resource fault message transmitted by IaaS management platforms for obtaining infrastructure,
Wherein, the IaaS management platforms are used to detect the hardware resource fault message of the hardware resource;It is additionally operable to obtain the cloud
The operating system failure information of calculation server, the operating system failure information, which is used to indicate, is installed on the cloud computing service
The failure that the operating system of device occurs;It is additionally operable to obtain the application and trouble information of the cloud computing server, the application event
Barrier information, which is used to indicate, is installed on the failure that the application of the operating system occurs;
Failure analysis module, for according to the accessed hardware resource fault message, the operating system failure information
With the application and trouble information determine the failure root of the cloud computing server because;
Failure strategy module is used for according to the failure root because determining troubleshooting strategy;
Failure Recovery Module carries out fault recovery for the operation indicated by the troubleshooting strategy.
10. device according to claim 9, which is characterized in that the operating system is applied with first agent;
Fault detection module is additionally operable to obtain the operating system failure information of the cloud computing server, including:
The fault detection module is additionally operable to determine the operation system by detecting the heartbeat message of first agent's application
System fault message, the heartbeat message are used to indicate whether the operating system breaks down.
11. device according to claim 9 or 10, which is characterized in that be equipped with second agent in the operating system
Using;
The application and trouble information that the fault detection module is additionally operable to obtain the cloud computing server includes:
The fault detection module is additionally operable to the state-detection script by being applied described in second agent's application call, according to
The return value of the state-detection script determines the application and trouble information.
12. according to claim 9 to 11 any one of them device, which is characterized in that failure analysis module 802 is used for according to institute
The hardware resource fault message, the operating system failure information and the application and trouble information got determines the cloud
The failure root of calculation server is because including at least:
The failure analysis module in preset time for all detecting the hardware resource fault message and operation system
It unites in the case of fault message, determines failure root because the hardware resource breaks down;Or
The failure analysis module in preset time for detecting the operating system failure information and the application and trouble
Information, and in the case of not detecting the hardware resource fault message, determine failure root because the operating system occurs therefore
Barrier;Or
The failure analysis module for only being detected under application and trouble information state in preset time, determine failure root because
The application is broken down.
13. according to claim 12 any one of them device, which is characterized in that failure strategy module 803 is used for according to
Failure root because determine troubleshooting strategy include:
Failure root because the hardware resource break down in the case of, the troubleshooting strategy include restart virtual machine,
It is local to rebuild one or more of virtual machine and migration virtual machine;Or
In failure root because in the case that the operating system breaks down, the troubleshooting strategy, which includes at least, restarts virtually
Machine;Or
Failure root because the application break down in the case of, the troubleshooting strategy include at least restart application and again
Open one or two of virtual machine.
14. according to claim 13 any one of them device, which is characterized in that in failure root because the hardware resource occurs
In the case of failure, the troubleshooting strategy includes restarting virtual machine, local one rebuild in virtual machine and migration virtual machine
It is a or multiple, specially:
In failure root because in the case that the hardware resource breaks down, the troubleshooting strategy is to restart virtual machine;
In the case where execution restarts virtual machine and can not achieve hardware resource fault recovery, the troubleshooting strategy is local weight
Build virtual machine;
In the case where virtual machine is restarted in execution and local reconstruction virtual machine all can not achieve hardware resource fault recovery, the event
Barrier processing strategy is migration virtual machine.
15. according to claim 13 or 14 any one of them devices, which is characterized in that in failure root because the application occurs
In the case of failure, the troubleshooting strategy, which includes at least, one or two of restarts application and restarts virtual machine, specifically
For:
In failure root because in the case that the application is broken down, the troubleshooting strategy is to restart application;
In the case where execution is restarted using can not achieve application and trouble recovery, the troubleshooting strategy is to restart virtual machine.
16. device according to claim 15, which is characterized in that Failure Recovery Module 804 is used at according to the failure
Operation indicated by reason strategy carries out fault recovery, including:
In failure root because in the case that the hardware resource breaks down, the operation indicated by execution troubleshooting strategy is at least
Including:The IaaS management platforms interface is called to execute the operation indicated by corresponding troubleshooting strategy;Or
In failure root because in the case that the operating system breaks down, the operation packet indicated by troubleshooting strategy is executed
It includes:The IaaS management platforms interface is called to execute the operation indicated by corresponding troubleshooting strategy;Or
In failure root because in the case that the application is broken down, executing the operation indicated by troubleshooting strategy includes:It adjusts
With the operation indicated by the corresponding troubleshooting strategy of second agent's application execution.
17. a kind of device of fault recovery that realizing cloud computing server, which is characterized in that including:Memory and with it is described
Processor, transmitter and the receiver of memory coupling, wherein:The transmitter is used to send director data, institute with to outside
Data of the receiver for receiving external transmission are stated, the memory is for storing program code and related data, the place
Reason device is for executing the program code stored in the memory, to execute a kind of fault recovery method of cloud computing server,
Wherein, the method is such as claim 1 to 8 any one of them method.
18. a kind of management system, including IaaS management platforms, PaaS management platforms and SaaS service platforms, wherein PaaS is managed
Platform includes fault detection module, failure analysis module, failure strategy module and Failure Recovery Module, and SaaS service platforms include
Agent application, PaaS management platforms are connect with IaaS management platforms and SaaS service platforms by periodic communication interface.It is described
Management system is for realizing such as claim 1-8 any one of them method.
19. a kind of computer readable storage medium, which is characterized in that including instruction, when run on a computer so that meter
Calculation machine executes such as claim 1-8 any one of them methods.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710160761.7A CN108632057A (en) | 2017-03-17 | 2017-03-17 | A kind of fault recovery method of cloud computing server, device and management system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710160761.7A CN108632057A (en) | 2017-03-17 | 2017-03-17 | A kind of fault recovery method of cloud computing server, device and management system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108632057A true CN108632057A (en) | 2018-10-09 |
Family
ID=63687046
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710160761.7A Pending CN108632057A (en) | 2017-03-17 | 2017-03-17 | A kind of fault recovery method of cloud computing server, device and management system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108632057A (en) |
Cited By (22)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111092855A (en) * | 2019-11-14 | 2020-05-01 | 山东中创软件商用中间件股份有限公司 | Server operation and maintenance system, method and device and computer readable storage medium |
CN111309515A (en) * | 2018-12-11 | 2020-06-19 | 华为技术有限公司 | Disaster recovery control method, device and system |
CN111355605A (en) * | 2019-10-18 | 2020-06-30 | 烽火通信科技股份有限公司 | Virtual machine fault recovery method and server of cloud platform |
CN111786827A (en) * | 2020-06-29 | 2020-10-16 | 中国工商银行股份有限公司 | Fault association positioning alarm method and device for distributed cloud computing environment |
CN111970147A (en) * | 2020-07-29 | 2020-11-20 | 苏州浪潮智能科技有限公司 | Method for processing large-scale host faults of cloud platform |
US10887382B2 (en) | 2018-12-18 | 2021-01-05 | Storage Engine, Inc. | Methods, apparatuses and systems for cloud-based disaster recovery |
CN112256498A (en) * | 2020-11-17 | 2021-01-22 | 珠海大横琴科技发展有限公司 | Fault processing method and device |
CN112350862A (en) * | 2020-10-30 | 2021-02-09 | 广州市汇聚支付电子科技有限公司 | Monitoring alarm and fault self-healing system |
CN112398668A (en) * | 2019-08-14 | 2021-02-23 | 北京东土科技股份有限公司 | IaaS cluster-based cloud platform and node switching method |
US10958720B2 (en) | 2018-12-18 | 2021-03-23 | Storage Engine, Inc. | Methods, apparatuses and systems for cloud based disaster recovery |
CN112543126A (en) * | 2020-12-22 | 2021-03-23 | 武汉联影医疗科技有限公司 | Cloud platform monitoring method and device, computer equipment and storage medium |
US10983886B2 (en) | 2018-12-18 | 2021-04-20 | Storage Engine, Inc. | Methods, apparatuses and systems for cloud-based disaster recovery |
CN112799910A (en) * | 2021-01-26 | 2021-05-14 | 中国工商银行股份有限公司 | Hierarchical monitoring method and device |
CN113438122A (en) * | 2021-05-14 | 2021-09-24 | 济南浪潮数据技术有限公司 | Heartbeat management method and device for server, computer equipment and medium |
US11178221B2 (en) | 2018-12-18 | 2021-11-16 | Storage Engine, Inc. | Methods, apparatuses and systems for cloud-based disaster recovery |
US11176002B2 (en) | 2018-12-18 | 2021-11-16 | Storage Engine, Inc. | Methods, apparatuses and systems for cloud-based disaster recovery |
CN113890903A (en) * | 2021-09-27 | 2022-01-04 | 中信科移动通信技术股份有限公司 | Alarm information management system and method |
US11252019B2 (en) | 2018-12-18 | 2022-02-15 | Storage Engine, Inc. | Methods, apparatuses and systems for cloud-based disaster recovery |
CN114095964A (en) * | 2021-11-19 | 2022-02-25 | 中国联合网络通信集团有限公司 | Fault recovery method and device and computer readable storage medium |
US11489730B2 (en) | 2018-12-18 | 2022-11-01 | Storage Engine, Inc. | Methods, apparatuses and systems for configuring a network environment for a server |
CN115665036A (en) * | 2022-10-14 | 2023-01-31 | 郑州浪潮数据技术有限公司 | Routing strategy fault processing method, device and medium |
EP4254191A1 (en) * | 2022-03-28 | 2023-10-04 | Nuctech Company Limited | Method and apparatus of implementing high availability of cluster virtual machine |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103167004A (en) * | 2011-12-15 | 2013-06-19 | 中国移动通信集团上海有限公司 | Cloud platform host system fault correcting method and cloud platform front control server |
CN104394194A (en) * | 2014-10-31 | 2015-03-04 | 北京思特奇信息技术股份有限公司 | Cloud system operation and maintenance monitoring method and system based on platform-as-a-service (PaaS) platform |
CN104486406A (en) * | 2014-12-15 | 2015-04-01 | 浪潮电子信息产业股份有限公司 | Layered resource monitoring method based on cloud data center |
CN106130809A (en) * | 2016-09-07 | 2016-11-16 | 东南大学 | A kind of IaaS cloud platform network failure locating method based on log analysis and system |
US9516112B1 (en) * | 2012-06-29 | 2016-12-06 | EMC IP Holding Company LLC | Sending alerts from cloud computing systems |
-
2017
- 2017-03-17 CN CN201710160761.7A patent/CN108632057A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103167004A (en) * | 2011-12-15 | 2013-06-19 | 中国移动通信集团上海有限公司 | Cloud platform host system fault correcting method and cloud platform front control server |
US9516112B1 (en) * | 2012-06-29 | 2016-12-06 | EMC IP Holding Company LLC | Sending alerts from cloud computing systems |
CN104394194A (en) * | 2014-10-31 | 2015-03-04 | 北京思特奇信息技术股份有限公司 | Cloud system operation and maintenance monitoring method and system based on platform-as-a-service (PaaS) platform |
CN104486406A (en) * | 2014-12-15 | 2015-04-01 | 浪潮电子信息产业股份有限公司 | Layered resource monitoring method based on cloud data center |
CN106130809A (en) * | 2016-09-07 | 2016-11-16 | 东南大学 | A kind of IaaS cloud platform network failure locating method based on log analysis and system |
Cited By (28)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111309515A (en) * | 2018-12-11 | 2020-06-19 | 华为技术有限公司 | Disaster recovery control method, device and system |
CN111309515B (en) * | 2018-12-11 | 2023-11-28 | 华为技术有限公司 | Disaster recovery control method, device and system |
US10958720B2 (en) | 2018-12-18 | 2021-03-23 | Storage Engine, Inc. | Methods, apparatuses and systems for cloud based disaster recovery |
US11489730B2 (en) | 2018-12-18 | 2022-11-01 | Storage Engine, Inc. | Methods, apparatuses and systems for configuring a network environment for a server |
US11252019B2 (en) | 2018-12-18 | 2022-02-15 | Storage Engine, Inc. | Methods, apparatuses and systems for cloud-based disaster recovery |
US10887382B2 (en) | 2018-12-18 | 2021-01-05 | Storage Engine, Inc. | Methods, apparatuses and systems for cloud-based disaster recovery |
US11176002B2 (en) | 2018-12-18 | 2021-11-16 | Storage Engine, Inc. | Methods, apparatuses and systems for cloud-based disaster recovery |
US11178221B2 (en) | 2018-12-18 | 2021-11-16 | Storage Engine, Inc. | Methods, apparatuses and systems for cloud-based disaster recovery |
US10983886B2 (en) | 2018-12-18 | 2021-04-20 | Storage Engine, Inc. | Methods, apparatuses and systems for cloud-based disaster recovery |
CN112398668B (en) * | 2019-08-14 | 2022-08-23 | 北京东土科技股份有限公司 | IaaS cluster-based cloud platform and node switching method |
CN112398668A (en) * | 2019-08-14 | 2021-02-23 | 北京东土科技股份有限公司 | IaaS cluster-based cloud platform and node switching method |
CN111355605A (en) * | 2019-10-18 | 2020-06-30 | 烽火通信科技股份有限公司 | Virtual machine fault recovery method and server of cloud platform |
CN111092855A (en) * | 2019-11-14 | 2020-05-01 | 山东中创软件商用中间件股份有限公司 | Server operation and maintenance system, method and device and computer readable storage medium |
CN111786827A (en) * | 2020-06-29 | 2020-10-16 | 中国工商银行股份有限公司 | Fault association positioning alarm method and device for distributed cloud computing environment |
CN111970147A (en) * | 2020-07-29 | 2020-11-20 | 苏州浪潮智能科技有限公司 | Method for processing large-scale host faults of cloud platform |
US11881984B2 (en) | 2020-07-29 | 2024-01-23 | Inspur Suzhou Intelligent Technology Co., Ltd. | Method for handling large-scale host failures on cloud platform |
CN111970147B (en) * | 2020-07-29 | 2022-05-06 | 苏州浪潮智能科技有限公司 | Method for processing large-scale host faults of cloud platform |
CN112350862A (en) * | 2020-10-30 | 2021-02-09 | 广州市汇聚支付电子科技有限公司 | Monitoring alarm and fault self-healing system |
CN112256498A (en) * | 2020-11-17 | 2021-01-22 | 珠海大横琴科技发展有限公司 | Fault processing method and device |
CN112543126A (en) * | 2020-12-22 | 2021-03-23 | 武汉联影医疗科技有限公司 | Cloud platform monitoring method and device, computer equipment and storage medium |
CN112799910A (en) * | 2021-01-26 | 2021-05-14 | 中国工商银行股份有限公司 | Hierarchical monitoring method and device |
CN113438122B (en) * | 2021-05-14 | 2022-05-17 | 济南浪潮数据技术有限公司 | Heartbeat management method and device for server, computer equipment and medium |
CN113438122A (en) * | 2021-05-14 | 2021-09-24 | 济南浪潮数据技术有限公司 | Heartbeat management method and device for server, computer equipment and medium |
CN113890903A (en) * | 2021-09-27 | 2022-01-04 | 中信科移动通信技术股份有限公司 | Alarm information management system and method |
CN114095964A (en) * | 2021-11-19 | 2022-02-25 | 中国联合网络通信集团有限公司 | Fault recovery method and device and computer readable storage medium |
CN114095964B (en) * | 2021-11-19 | 2023-05-26 | 中国联合网络通信集团有限公司 | Fault recovery method and device and computer readable storage medium |
EP4254191A1 (en) * | 2022-03-28 | 2023-10-04 | Nuctech Company Limited | Method and apparatus of implementing high availability of cluster virtual machine |
CN115665036A (en) * | 2022-10-14 | 2023-01-31 | 郑州浪潮数据技术有限公司 | Routing strategy fault processing method, device and medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108632057A (en) | A kind of fault recovery method of cloud computing server, device and management system | |
US9740546B2 (en) | Coordinating fault recovery in a distributed system | |
EP2710484B1 (en) | Cross-cloud management and troubleshooting | |
CN102346460B (en) | Transaction-based service control system and method | |
CN102231681B (en) | High availability cluster computer system and fault treatment method thereof | |
CN105659562B (en) | It is a kind of for hold barrier method and data processing system and include for holds hinder computer usable code storage equipment | |
CN105095001B (en) | Virtual machine abnormal restoring method under distributed environment | |
CN108270726B (en) | Application instance deployment method and device | |
US20080307258A1 (en) | Distributed Job Manager Recovery | |
CN104408071A (en) | Distributive database high-availability method and system based on cluster manager | |
AU2012259086A1 (en) | Cross-cloud management and troubleshooting | |
CN106559441B (en) | Virtual machine monitoring method, device and system based on cloud computing service | |
CN104516789A (en) | Method and system for failover detection and treatment in checkpoint systems | |
CN112948063B (en) | Cloud platform creation method and device, cloud platform and cloud platform implementation system | |
CN110445662A (en) | OpenStack control node is adaptively switched to the method and device of calculate node | |
Melo et al. | Comparative analysis of migration-based rejuvenation schedules on cloud availability | |
CN101442437A (en) | Method, system and equipment for implementing high availability | |
JP2014048933A (en) | Plant monitoring system, plant monitoring method, and plant monitoring program | |
CN116192885A (en) | High-availability cluster architecture artificial intelligent experiment cloud platform data processing method and system | |
Mathews et al. | Service resilience framework for enhanced end-to-end service quality | |
CN111966469B (en) | Cluster virtual machine high availability method and system | |
CN114691304A (en) | Method, device, equipment and medium for realizing high availability of cluster virtual machine | |
CN107122228A (en) | The dispositions method and device of the management platform of super emerging system | |
US10985985B2 (en) | Cloud service system | |
CN107783855B (en) | Fault self-healing control device and method for virtual network element |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20181009 |
|
RJ01 | Rejection of invention patent application after publication |