CN103152414B

CN103152414B - A kind of high-availability system based on cloud computing

Info

Publication number: CN103152414B
Application number: CN201310065647.8A
Authority: CN
Inventors: 王电钢; 常健; 王铁军; 周毅
Original assignee: SICHUAN ELECTRIC POWER Corp INFORMATION COMMUNICATION CO Ltd
Current assignee: SICHUAN ELECTRIC POWER Corp INFORMATION COMMUNICATION CO Ltd
Priority date: 2013-03-01
Filing date: 2013-03-01
Publication date: 2016-03-30
Anticipated expiration: 2033-03-01
Also published as: CN103152414A

Abstract

The invention discloses a kind of high-availability system based on cloud computing and its implementation, this system comprises central control management service subsystem and autonomous control agents subsystem, protocol interconnection is passed through between central control management service subsystem and autonomous control agents subsystem, described central control management service subsystem comprises kernel service, resource management and task management etc. 5 layers, and described autonomous control agents subsystem comprises core frame, state acquisition and process monitoring etc. 5 layers; The method comprises establishment Information application mirror, running status is applied in monitoring, start the mirror image virtual machine of corresponding fault application, close the steps such as the mirror image virtual machine of the normal fault application of recovery.It is N:1 than 1:1 that the present invention changes active/standby number of servers in traditional dual-computer hot-standby high-availability system, thus saves a large amount of standby server resource, provides the utilance of server resource, has good flexibility and extensibility.

Description

A kind of high-availability system based on cloud computing

Technical field

The present invention relates to a kind of high-availability system based on cloud computing and its implementation.

Background technology

High availability (HighAvailability, HA) refers to the downtime by as far as possible shortening because routine maintaining operations (plan) and the system crash (unplanned) that happens suddenly cause, to improve the availability of system and application.HA system is that current enterprise prevents kernal computer system because of the most effective means of disorderly closedown.Along with the development of information application, data are more and more wider in the application of enterprise, and the high availability how improving Information application becomes one of top priority of building sane computer system.Information application adopts double-machine standby technology to improve the high availability of system usually.

Two-node cluster hot backup is refered in particular to hot standby (or High Availabitity) based on the two-server in high-availability system.Dual system banked solves a kind of inevitable plan or unplanned system and to delay the system (software or hardware) of machine problem, any system that causes is delayed the fault of machine and service disruption, capital is triggered corresponding flow process and is carried out mistake judgement, Fault Isolation, and parallel machine recovers to perform interrupted service.By the switching mode in work, two-shipper High Availabitity can be divided into: active/standby mode (Active-Standby mode) and two host mode (Active-Active mode).Wherein, namely active/standby mode refers to the state of activation (i.e. Active state) that a station server is in certain business, and another station server is in the stand-by state (i.e. Standby state) of this business; And namely two host mode refers to two kinds of different business activestandby state (i.e. Active-Standby and Standby-Active state) each other on two-server respectively.

The scheme of current composition two-node cluster hot backup mainly contains three kinds of modes: based on the mode of shared storage (disk array), the mode of full redundancy (two-shipper is two to be stored) and the mode based on data Replica.

Mode based on shared storage (disk array) is the mode the most often used, provides after switching data integrity and successional guarantee mainly through disk array.User data generally can be placed on disk array, and after machine delayed by main frame, standby host continues to obtain legacy data from disk array.Traditional two-node cluster hot backup mode based on separate unit storage is made up of a station server main frame, a station server standby host and a disk array, and this mode, because use a memory device, is often called disk Single Point of Faliure by insider.But the fail safe stored in general is higher.If so when ignoring storage device failure, this mode is also adopt in the industry maximum hot standby modes.

Really there is the situation of Single Point of Faliure in the traditional two-node cluster hot backup mode based on separate unit storage, for realizing storage redundancy, storing High Availabitityization and being also more and more easily accepted by a user.Can understand like this; two-node cluster hot backup is to delay the solution of machine for the planned shutdown of settlement server and unplanned property the earliest; but the server outage that the planned shutdown of memory device and the unplanned property machine of delaying bring cannot be realized; and memory device is as the equipment storing data unique in two-node cluster hot backup, it often causes Dual-Computer Hot-Standby System total collapse once break down.

Based on the High Availabitity two-node cluster hot backup scheme of two memory device, eliminate because the Single Point of Faliure brought shut down by separate unit memory device, enter the full redundancy two-node cluster hot backup mode not having Single Point of Faliure.

Full redundancy two-node cluster hot backup mode is made up of two memory devices, a station server main frame and a station server standby host, it is advantageous that: the data Replica between (1) memory device is without network, and two is copy by between memory device; Copying between (2) two memory devices is completely real-time, there is not time delay any time; (3) switching time between active and standby storage is less than 500ms, does not produce time delay during to guarantee system storage; (4) disk identifier of hard disk and subregion do not change because of the switching between active and standby storage; (5) switching of server, does not affect the initialization between storage, increment synchronization and data Replica; (6) the planned shutdown of a certain memory device, does not affect the work of whole server Dual-Computer Hot-Standby System; (7) use data de-duplication technology between memory device, complete increment synchronization work; (8) real 7 × 24 hours or switch full redundancy scheme.But this full redundancy two-node cluster hot backup mode cost is high, and complex management, is not suitable for small-scale Information application.

Mode based on data Replica mainly utilizes the method for synchronization of data, ensures the data consistency of active/standby server.Distributed copy block equipment (DistributedReplicatedBlockDevice, DRBD) is a data cluster scheme of increasing income, and it can provide the data syn-chronization between a kind of dynamic main frame.DRBD is responsible for receiving data, data is write local disk, then sends to another main frame.Data are deposited in the disk of oneself by another main frame again.Assembly needed for other has cluster member service, as TurboHA or heartbeat connect, and some application programs can run on block device.Such as: naked I/O, file system and fsck, there is the database of recovery capability.

The double-machine standby technology of above-mentioned three kinds of modes, is all at least needed 2 physical servers, is realized the High Availabitity of information system by the mode of redundancy.Time more, the equipment of these redundancies is in stand-by state.Along with the increase of information system quantity in enterprise, in order to ensure the High Availabitity of system, bulk redundancy equipment will certainly be brought.This situation, for medium-sized and small enterprises, must bring the increase of construction and maintenance cost.

Summary of the invention

The object of the invention is to overcome the deficiencies in the prior art, provide a kind of ensure information application high availability while reduce a kind of high-availability system based on cloud computing of entreprise cost and its implementation.

The object of the invention is to be achieved through the following technical solutions: a kind of high-availability system based on cloud computing, it comprises a central control management service subsystem and at least one autonomous control agents subsystem, protocol interconnection is passed through between central control management service subsystem and autonomous control agents subsystem, described central control management service subsystem comprises kernel service layer, resource management layer, task management layer, intelligent scheduling layer, monitoring alarm layer and mirror image management level, described autonomous control agents subsystem comprises core frame layer, Host Status acquisition layer, state acquisition layer, incident management layer, process monitoring layer and Joblet running environment layer,

Described kernel service layer provides the basic framework of system cloud gray model, at least comprise safety management, incident management and log management, and be responsible for setting up the communication with autonomous control agents subsystem, monitor, gather the information that all managed service devices send, be responsible for the telecommunication management of setting up with (LDAP) directory service of bottom Light Directory Access Protocol and database server, be responsible for adopting RESTful mode to carry out the telecommunication management of the system communicated with other;

Described resource management layer carries out unified management for the resource situation to all physical machine in high-availability system and virtual machine, resource service condition, running state information;

Described task management layer is used for amendment, creation task, and task scheduling and monitoring ruuning situation, to ensure the operation that virtual machine completing when needing startups, stopping and moving;

Described intelligent scheduling layer is used for completing intelligent scheduling to the physical machine in high-availability system and virtual machine, at least comprises High Availabitity scheduling, resources balance scheduling and energy saving scheduling;

Described monitoring alarm layer is used for gathering Information application and virtual machine running state data, gather and representing, and the person liable that the abnormal application of notice is relevant, initiate alarm to it;

Described mirror image management level have been responsible for creating the image file of virtual machine, have deleted, have been inquired about and retouching operation;

Described core frame layer corresponds to the kernel service layer of central control management service subsystem, for providing basis for the system safety in autonomous control agents subsystem, daily record, network connection, RESTful framework;

Described Host Status acquisition layer is responsible for the running status periodically gathering physical machine and virtual machine in resource pool, comprise static information and the multidate information of CPU, internal memory, disk and network, and by core frame layer by the information reporting that collects to central control management service subsystem;

Described state acquisition layer is used for gathering the system running state of Information application server, and by core frame layer, the information collected is uploaded to central control management service subsystem;

Described incident management layer is used for managing the event produced in autonomous control agents subsystem, comprises establishment, deletion, query event state;

Described process monitoring layer is for monitoring the critical processes be configured with on the Information application server of autonomous control agents subsystem, when finding that critical processes breaks down, process failure event is sent to central control management service subsystem, to excite corresponding virtual machine, the High Availabitity of guarantee information application, wherein, critical processes is manually configured the process needing to monitor according to the difference of Information application by administrative staff, comprise database, Web service;

Described Joblet running environment layer is used for providing basis for Joblet runs in autonomous control agents subsystem, Joblet and task Job is one to one, wherein, Job performs in center-control service subsystem, be responsible for initialization, management Joblet runs, Joblet is distributed in each physical machine in resource pool and performs, and completes actual task.

Based on an implementation method for the high-availability system of cloud computing, it comprises the following steps:

(1) Information application mirror image is created, i.e. the corresponding main and standby relation of configuration information application server and virtual machine on cloud host server;

(2) installation and deployment Agent assembly, namely configures the Agent information of cloud host server and Information application server;

(3) operation of Agent layer to application is monitored, namely the autonomous control agents of virtual cloud main frame and the autonomous control agents of Information application main frame are monitored by the operation of running status to application of collection virtual machine, cloud host server, Information application server, and monitor message is reported to central control management service layer;

(4) when monitoring application and breaking down, central control management service layer sends task Job to the Agent layer starting emergency measures;

(5) Agent layer is according to the mirror image virtual machine of the automatic startup separator application of the instruction of carrying in Joblet;

(6) Agent layer continues to monitor the operation of application;

(7) after fault application recovery being detected, the mirror image virtual machine of closing fault application.

A kind of implementation method of the high-availability system based on cloud computing also comprises one is initiated alarm to the director of abnormal application step when server exception being detected.

The invention has the beneficial effects as follows:

(1) the present invention changes active/standby number of servers in traditional dual-computer hot-standby high-availability system is N:1 than 1:1, thus saves a large amount of standby server resource, provides the utilance of server resource;

(2) service data collection of the present invention is realized physical server, virtual machine and critical processes respectively by autonomous control agents, adopt H2 memory database technology storage of collected data, and establish the monitor data analytical model and fast algorithm that adapt to various strategy, thus meet the demands such as real-time data analysis, abnormity early warning, scheduling of resource;

(3) the present invention has extensibility and the flexibility of function, and task management technology of the present invention is based on script edit, and staff only need use Python to write the amendment and expansion that can realize function;

(4) the present invention utilizes computational resource intelligent scheduling technology using physical servers all in resource pool as shared standby resources, unify to provide HA to support for all Information application, after heartbeat detection notes abnormalities, native system is by mirror image virtual machine corresponding for the application of this abnormal information of Automatic dispatching, the physical server that load is lighter in resource pool runs, takes over the Information application broken down;

(5) the propelling movement mode of abnormality alarming information of the present invention is various, comprise the multiple message transmit mechanisms such as mail, note, instant messaging, can ensure that important information is reliably sent to director in time, enable the failure condition of operation management personnel Information application in time, to take corresponding subsequent treatment measure in time;

(6) autonomous control agents is deployed in the operating system of Information application, the heartbeat that autonomous control agents is set up for Information application by network and central control management server is connected, intelligent agent is monitored in real time by according to the running status of the strategy of specifying to Information application, when detecting abnormal application and occurring, intelligent agent will perform corresponding policy action;

(7) the present invention adopts and has SIGAR(SystemInformationGathererAndReporter that is open, ripe, lightweight advantage) as state collection method, and by SIGAR Components integration in intelligent agent, in real time monitor data can be delivered in central control management service, and regularly preserve data according to certain employing frequency.

Accompanying drawing explanation

Fig. 1 is the block diagram of system of the present invention;

Fig. 2 is method flow schematic diagram of the present invention;

Fig. 3 is system physical configuration diagram of the present invention.

Embodiment

Below in conjunction with accompanying drawing, technical scheme of the present invention is described in further detail, but protection scope of the present invention is not limited to the following stated.

As shown in Figure 1, a kind of high-availability system based on cloud computing, it comprises a central control management service subsystem and at least one autonomous control agents subsystem, protocol interconnection is passed through between central control management service subsystem and autonomous control agents subsystem, described central control management service subsystem comprises kernel service layer, resource management layer, task management layer, intelligent scheduling layer, monitoring alarm layer and mirror image management level, described autonomous control agents subsystem comprises core frame layer, Host Status acquisition layer, state acquisition layer, incident management layer, process monitoring layer and Joblet running environment layer,

Described kernel service layer is the core of whole high-availability system, it provides the basic framework of system cloud gray model, at least comprise the functions such as safety management, incident management, log management, be responsible for setting up the communication with autonomous control agents subsystem simultaneously, monitor, gather the information that all managed service devices send, and responsible telecommunication management of setting up with the service of bottom ldap directory and database server, be responsible for adopting RESTful mode to carry out the telecommunication management of the system communicated with other;

Described task management layer is used for amendment, creation task, and task scheduling and to management work such as ruuning situation monitoring, to ensure the operation that virtual machine completing when needing startups, stopping and moving;

The operations such as described mirror image management level have been responsible for creating the image file of virtual machine, delete, inquire about, amendment;

Described Host Status acquisition layer is responsible for the running status periodically gathering physical machine and virtual machine in resource pool, comprise static information and the multidate informations such as CPU, internal memory, disk, network, and by core frame layer by the information reporting that collects to central control management service subsystem;

Described process monitoring layer is for monitoring the critical processes be configured with on the Information application server of autonomous control agents subsystem, when finding that critical processes breaks down, process failure event is sent to central control management service subsystem, to excite corresponding virtual machine, the High Availabitity of guarantee information application, wherein, critical processes is manually configured the process needing to monitor according to the difference of Information application by administrative staff, as database, Web service etc.;

As shown in Figure 2, a kind of implementation method of the high-availability system based on cloud computing, it comprises the following steps:

(6) Agent layer continues to monitor the operation of application;

A kind of high-availability system based on cloud computing, as shown in Figure 3, divide from physical structure, central control management server, cloud host server resource pool and three, Information application server resource pond part can be divided into, cloud host server resource pool comprises at least one cloud host server, Information application server resource pond comprises at least one station information application server, is communicated between cloud host server and Information application server by network with central control management server.Server in cloud host server resource pool is configured with virtual machine, and there is main and standby relation in the server in these virtual machines and Information application server resource pond, central control management network in charge carries out task management and intelligent scheduling to Servers-all.Divide from logical architecture, Agent layer and Server layer can be divided into, the service data (information such as operating load, critical processes running status of Servers-all and virtual machine) that Agent layer is responsible for Information application is monitored, is reported, receive simultaneously, explain and perform the order from Server layer, Server layer is collected the running status of main frame and is sent control command according to dispatching algorithm to respective host, realizes the scheduling to resource and management.

Claims

1. the high-availability system based on cloud computing, it is characterized in that: it comprises a central control management service subsystem and at least one autonomous control agents subsystem, protocol interconnection is passed through between central control management service subsystem and autonomous control agents subsystem, described central control management service subsystem comprises kernel service layer, resource management layer, task management layer, intelligent scheduling layer, monitoring alarm layer and mirror image management level, described autonomous control agents subsystem comprises core frame layer, Host Status acquisition layer, state acquisition layer, incident management layer, process monitoring layer and Joblet running environment layer,

Described kernel service layer is the core of whole high-availability system, it provides the basic framework of system cloud gray model, at least comprise safety management, incident management and log management, and be responsible for setting up the communication with autonomous control agents subsystem, monitor, gather the information that all managed service devices send, be responsible for the telecommunication management of setting up with (LDAP) directory service of bottom Light Directory Access Protocol and database server, be responsible for adopting RESTful mode to carry out the telecommunication management of the system communicated with other;

Described task management layer be used for amendment, creation task, and task scheduling and to ruuning situation monitoring, with ensure virtual machine need in complete startups, stop and move operation;

Described Host Status acquisition layer is responsible for the running status periodically gathering physical machine and virtual machine in autonomous control agents subsystem, comprise static information and the multidate information of CPU, internal memory, disk and network, and by core frame layer by the information reporting that collects to central control management service subsystem;

Described process monitoring layer is for monitoring the critical processes be configured with on the Information application server of autonomous control agents subsystem, when finding that critical processes breaks down, process failure event is sent to central control management service subsystem, to excite corresponding virtual machine, the High Availabitity of guarantee information application, wherein, critical processes is manually be configured by administrative staff the process needing to monitor according to the difference of Information application, comprises the process of database, the process of Web service;

Described Joblet running environment layer is used for providing basis for Joblet runs in autonomous control agents subsystem, Joblet and task Job is one to one, wherein, Job performs in center-control service subsystem, be responsible for initialization, management Joblet runs, Joblet is distributed in each physical machine in autonomous control agents subsystem and performs, and completes actual task.