CN106330523A - Cluster server disaster recovery system and method, and server node - Google Patents

Cluster server disaster recovery system and method, and server node Download PDF

Info

Publication number
CN106330523A
CN106330523A CN201510385495.9A CN201510385495A CN106330523A CN 106330523 A CN106330523 A CN 106330523A CN 201510385495 A CN201510385495 A CN 201510385495A CN 106330523 A CN106330523 A CN 106330523A
Authority
CN
China
Prior art keywords
work
monitoring process
progress
monitoring
node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510385495.9A
Other languages
Chinese (zh)
Inventor
刘晓峰
梁耿
成勇
彭肇
阳天保
钟余海
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Group Guangxi Co Ltd
Original Assignee
China Mobile Group Guangxi Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Group Guangxi Co Ltd filed Critical China Mobile Group Guangxi Co Ltd
Priority to CN201510385495.9A priority Critical patent/CN106330523A/en
Publication of CN106330523A publication Critical patent/CN106330523A/en
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0654Management of faults, events, alarms or notifications using network fault recovery
    • H04L41/0668Management of faults, events, alarms or notifications using network fault recovery by dynamic selection of recovery network elements, e.g. replacement by the most appropriate element after failure
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L1/00Arrangements for detecting or preventing errors in the information received
    • H04L1/22Arrangements for detecting or preventing errors in the information received using redundant apparatus to increase reliability
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/10Active monitoring, e.g. heartbeat, ping or trace-route
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/02Protocols based on web technology, e.g. hypertext transfer protocol [HTTP]
    • H04L67/025Protocols based on web technology, e.g. hypertext transfer protocol [HTTP] for remote control or remote monitoring of applications
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network

Abstract

The invention discloses a cluster server disaster recovery system and method, and a server node, and is used for providing a disaster recovery mechanism based on a cluster server and ensuring the operation stability and service reliability of the cluster server. The cluster server disaster recovery system comprises multiple monitor nodes and multiple work nodes. The monitor nodes are used for operating a first monitor process. According to the first monitor process, whether various monitor processes in a stored monitor process list are abnormal or not is monitored according to preset first monitor frequency. When it is monitored that at least one monitor process is abnormal, abnormal monitor process recovery is carried out. According to the first monitor process, whether various work processes in a stored work process list are abnormal or not is monitored according to preset second frequency. When it is monitored that at least one work process is abnormal, abnormal work process recovery is carried out. The work nodes are used for operating the work processes. The work processes report own service statuses to various monitor processes in the monitor process list on the basis of the second monitor frequency according to the stored monitoring process list.

Description

A kind of cluster server disaster tolerance system, method and server node
Technical field
The present invention relates to cloud technical field of data processing, particularly relate to a kind of cluster server disaster tolerance system, side Method and server node.
Background technology
Disaster tolerance mechanism is that (Business&Operation Support System, service operation supports system to BOSS One of system) the emergent indispensable technology of operation system fault, the computing architecture that disaster tolerance mechanism presses BOSS system is different And difference.Existing BOSS system, based on traditional Single-Server single multi-process (thread) calculating support Structure, its Disaster Tolerant Scheme is divided into program level and host-level disaster tolerance mechanism by fault rank.For program level disaster tolerance, Typically use crontab (scheduled task) by frequency monitoring internal memory active process table, for the exception monitored Stopping process, uses the scheme restarting process to carry out disaster tolerance;For host-level disaster tolerance, abnormal at process host Delay machine time, use HA switching mode by process switching to alternate program main frame, and start this standby meet an urgent need Main frame and process, restart process in can standing the time, to continue to provide service.
Along with the development of cluster computing, BOSS operation system based on cluster necessarily replaces existing list A kind of selection of high-performance main frame, BOSS operation system based on cluster, invest for protection existing hardware, Be faced with and be deployed in minority high-performance main frame and the situation of a large amount of common machines simultaneously, be characterized in AIX and LINUX isomery programming system coexists, the high-performance node of minority and substantial amounts of ordinary node be simultaneously to calculate The form of cloud provides service.
Specific due to cluster cloud computing is to simultaneously participate in calculating across machine multinode, and existing PC cluster is put down Platform does not provide reliable process fault tolerant mechanism, how to ensure the process of BOSS operation system based on cluster Calculate stability and avoid host node actual effect on service impact, become urgently to be resolved hurrily in prior art One of technical problem.
Summary of the invention
The embodiment of the present invention provides a kind of cluster server disaster tolerance system, method and server node, in order to carry For a kind of disaster tolerance based on cluster server mechanism, it is ensured that the operation stability of cluster server and service are reliable Property.
The embodiment of the present invention provides a kind of cluster server disaster tolerance system, including multiple monitor nodes and multiple work Make node, wherein:
Described monitor node is for running the first monitoring process, and described first monitoring process is according to default first Each monitoring process in the monitoring process list of monitoring frequency monitoring storage is the most abnormal;And monitor to During a few monitoring process exception, perform abnormal monitoring process resumption;And described first monitoring process according to In the progress of work list of the second frequency monitoring storage preset, each progress of work is the most abnormal;Monitor to During a few progress of work exception, perform abnormal work process resumption;
Described working node is used for running the progress of work, and the described progress of work is according to the monitoring process list of storage The service of self is reported according to the described second monitoring frequency each monitoring process in described monitoring process list State.
Described monitor node, specifically for when monitoring at least one monitoring process exception, by described One monitoring process other second monitoring process in monitoring process list sends condition monitoring message, root Identify according to the monitoring process of testing result recording exceptional;Determine the prison that the abnormal monitoring process identification (PID) of record is corresponding Control process exception.
Described monitor node, specifically for comprise according to monitoring process list by described first monitoring process The order composition message transmission loop chain that monitoring process sequence number that each monitoring process is corresponding is ascending, and initiate to disappear Breath transmission;If do not receive the message of transmission in preset duration, it is determined that monitor at least one prison Control process exception.
Described monitor node, specifically for the quantity according to abnormal monitoring process, by described first monitor into Journey selects in addition to the monitor node that monitor node list comprises, corresponding from cluster server node listing The working node of quantity is respectively created a monitoring process for replacing abnormal monitoring process;And by described the One monitoring process receives monitoring process sequence number and the joint of place node thereof that each newly created monitoring process returns Point identification;Monitoring process sequence number and node identification according to receiving update monitoring process list and monitoring respectively Node listing;And notify that in monitoring process list, each monitoring process updates the monitoring process list of its storage;With And each progress of work updates the monitoring process list of its storage in notice progress of work list.
Described monitor node, specifically for saving to the work being selected respectively by described first monitoring process Any operative process that point runs sends monitoring process request to create;
The described working node being selected, is used for by self-operating, receives the establishment of described monitoring process The progress of work of request creates the subprocess of this progress of work, and to the subprocess created forward described monitor into Journey request to create;Described subprocess, according to the monitoring process request to create received, identifies own process classification For monitoring process;And return newly created monitoring process sequence by the described progress of work to described first monitoring process Number and the node identification of place node.
Described monitor node, specifically for when monitoring at least one progress of work exception, by described One monitoring process sends service state detection message according to the progress of work list of storage to each progress of work;Root Identify according to the progress of work of testing result recording exceptional;Determine the work that the abnormal work process identification (PID) of record is corresponding Make process exception.
Described monitor node, specifically for after determining abnormal work process, monitors by described first Process detection abnormal work process place working node is the most abnormal;And according to the work of result of detection recording exceptional Make node identification;According to the quantity of abnormal work process, select from cluster server node listing except abnormal Node beyond working node, respective numbers is respectively created a progress of work and enters for replacing abnormal work Journey;And by described first monitoring process receive progress of work sequence number that each newly created progress of work returns and The working node mark of its place working node;Identify according to the progress of work sequence number received and working node Update progress of work list and the working node list of storage respectively;And notify monitoring process list respectively monitors Process updates the progress of work list of its storage;And each progress of work updates it in notice progress of work list The progress of work list of storage.
Described monitor node, specifically for transporting to the node being selected respectively by described first monitoring process Arbitrary process of row sends monitoring process request to create;
Described selected node is used for by self-operating, receives entering of described progress of work request to create Journey creates the subprocess of this process, and forwards described progress of work request to create to the subprocess created;Described Subprocess is according to the progress of work request to create received, and mark own process classification is the progress of work;And lead to Cross the process receiving described progress of work request to create and return newly created work to described first monitoring process Process sequence number and the node identification of place node.
The embodiment of the present invention provides a kind of cluster server disaster recovery method, including:
Monitor node by the first monitoring process of self-operating respectively according to the first default monitoring frequency and Each monitoring process in the monitoring process list of the second monitoring frequency monitoring storage and the progress of work row of storage In table, each progress of work is the most abnormal;
When monitoring at least one monitoring process exception, perform abnormal monitoring process resumption;And
When monitoring at least one progress of work exception, perform abnormal work process resumption.
Preferably, determine that at least one monitoring process is abnormal according to procedure below:
Described monitor node is when monitoring at least one monitoring process exception, by described first monitoring process Other second monitoring process in monitoring process list sends condition monitoring message;
Monitoring process mark according to testing result recording exceptional;
Determine the monitoring process exception that the abnormal monitoring process identification (PID) of record is corresponding.
Preferably, determine that to monitor at least one monitoring process abnormal in accordance with the following methods:
Described monitor node by described first monitoring process according to respectively monitoring of comprising of monitoring process list into The order composition message transmission loop chain that monitoring process sequence number corresponding to journey is ascending, and initiate message transmission;
If do not receive the message of transmission in preset duration, it is determined that monitor at least one monitor into Cheng Yichang.
Preferably, according to procedure below execution abnormal monitoring process resumption:
Described monitor node, according to the quantity of abnormal monitoring process, is taken from cluster by described first monitoring process Business device node listing selects in addition to the monitor node that monitor node list comprises, the work of respective numbers Node is respectively created a monitoring process for replacing abnormal monitoring process;And
By described first monitoring process receive monitoring process sequence number that each newly created monitoring process returns and The node identification of its place node;
Monitoring process list and monitoring joint is updated respectively according to the monitoring process sequence number received and node identification Point list;
In notice monitoring process list, each monitoring process updates the monitoring process list of its storage;And
In notice progress of work list, each progress of work updates the monitoring process list of its storage.
Preferably, determine that at least one progress of work is abnormal according to procedure below:
Described monitor node is when monitoring at least one progress of work exception, by described first monitoring process Progress of work list according to storage sends service state detection message to each progress of work;
Progress of work mark according to testing result recording exceptional;
Determine the progress of work exception that the abnormal work process identification (PID) of record is corresponding.
Preferably, according to procedure below execution abnormal work process resumption:
Described monitor node is after determining abnormal work process, different by described first monitoring process detection Often progress of work place working node is the most abnormal;And
Working node mark according to result of detection recording exceptional;And
According to the quantity of abnormal work process, select except abnormal work node from cluster server node listing In addition, the node of respective numbers be respectively created a progress of work for replacing abnormal work process;
Described monitor node receives, by described first monitoring process, the work that each newly created progress of work returns Make the working node mark of process sequence number and place working node thereof;
The progress of work row updating storage respectively are identified according to the progress of work sequence number received and working node Table and working node list;And
In notice monitoring process list, each monitoring process updates the progress of work list of its storage;And
In notice progress of work list, each progress of work updates the progress of work list of its storage.
The embodiment of the present invention provides a kind of server node, including:
Monitoring unit, for the first monitoring process by running on described server node respectively according to presetting First monitoring frequency and second monitoring frequency monitoring storage monitoring process list in each monitoring process and In the progress of work list of storage, each progress of work is the most abnormal;
Exception processing unit, for when monitoring at least one monitoring process exception, performs abnormal monitoring and enters Cheng Huifu;And when monitoring at least one progress of work exception, perform abnormal work process resumption.
Described monitoring unit, specifically for when monitoring at least one monitoring process exception, by described One monitoring process other second monitoring process in monitoring process list sends condition monitoring message;Root Identify according to the monitoring process of testing result recording exceptional;Determine the prison that the abnormal monitoring process identification (PID) of record is corresponding Control process exception.
Described monitoring unit, specifically for comprise according to monitoring process list by described first monitoring process The order composition message transmission loop chain that monitoring process sequence number that each monitoring process is corresponding is ascending, and initiate to disappear Breath transmission;If do not receive the message of transmission in preset duration, it is determined that monitor at least one prison Control process exception.
Described exception processing unit, including:
First selects subelement, for the quantity according to abnormal monitoring process, by described first monitoring process In addition to the monitor node that monitor node list comprises, respective counts is selected from cluster server node listing The working node of amount is respectively created a monitoring process for replacing abnormal monitoring process;
First receives subelement, returns for receiving each newly created monitoring process by described first monitoring process The monitoring process sequence number returned and the node identification of place node thereof;
First updates subelement, for updating prison respectively according to the monitoring process sequence number received and node identification Control process list and monitor node list;
First notice subelement, for notifying that in monitoring process list, each monitoring process updates the monitoring of its storage Process list;And each progress of work updates the monitoring process list of its storage in notice progress of work list.
Described monitoring unit, specifically for when monitoring at least one progress of work exception, by described One monitoring process sends service state detection message according to the progress of work list of storage to each progress of work;Root Identify according to the progress of work of testing result recording exceptional;Determine the work that the abnormal work process identification (PID) of record is corresponding Make process exception.
Described exception processing unit, including:
Detection subelement, for after determining abnormal work process, is visited by described first monitoring process Survey abnormal work process place working node the most abnormal;And according to the working node of result of detection recording exceptional Mark;
Second selects subelement, for the quantity according to abnormal work process, from cluster server node listing Middle selection in addition to abnormal work node, the node of respective numbers be respectively created a progress of work for replacing The normal progress of work of transversion;
Second receives subelement, returns for receiving each newly created progress of work by described first monitoring process The progress of work sequence number returned and the working node mark of place working node thereof;
Second updates subelement, for identifying the most more according to the progress of work sequence number received and working node The progress of work list of new storage and working node list;
Second notice subelement, for notifying that in monitoring process list, each monitoring process updates the work of its storage Process list;And each progress of work updates the progress of work list of its storage in notice progress of work list.
Cluster server disaster tolerance system, method and the server node that the embodiment of the present invention provides, monitor node Arranged according to the monitoring process of the first monitoring frequency monitoring self storage by the first monitoring process of self-operating Whether each monitoring process in table there is exception, and the work according to the second monitoring frequency monitoring self storage Whether each process of group altogether made in process list there is exception, and different monitoring at least one monitoring process Chang Shi, performs monitoring process and recovers, and when monitoring at least one progress of work exception, performs the progress of work Recover, pass through said process, it is achieved that the abnormal monitoring of cluster server and recovery, it is ensured that cluster service The operation stability of device and service reliability.
Other features and advantages of the present invention will illustrate in the following description, and, partly from explanation Book becomes apparent, or understands by implementing the present invention.The purpose of the present invention and other advantages can Realize by structure specifically noted in the description write, claims and accompanying drawing and obtain ?.
Accompanying drawing explanation
Accompanying drawing described herein is used for providing a further understanding of the present invention, constitutes of the present invention Point, the schematic description and description of the present invention is used for explaining the present invention, is not intended that to the present invention not Work as restriction.In the accompanying drawings:
Fig. 1 is in the embodiment of the present invention, the structural representation of cluster server disaster tolerance system;
Fig. 2 is in the embodiment of the present invention, the message transmission schematic diagram of monitoring process and the progress of work;
Fig. 3 is in the embodiment of the present invention, and cluster server initializes schematic flow sheet;
Fig. 4 is in the embodiment of the present invention, and the progress of work carries out service state detection and the progress of work that notes abnormalities Time be operated the implementing procedure schematic diagram of process resumption;
Fig. 5 is in the embodiment of the present invention, the implementing procedure schematic diagram of cluster server disaster recovery method;
Fig. 6 is in the embodiment of the present invention, determines the implementing procedure schematic diagram that at least one monitoring process is abnormal;
Fig. 7 is in the embodiment of the present invention, determines that the implementing procedure monitoring at least one monitoring process abnormal shows It is intended to;
Fig. 8 is in the embodiment of the present invention, performs the implementing procedure schematic diagram of abnormal monitoring process resumption;
Fig. 9 is in the embodiment of the present invention, determines the implementing procedure schematic diagram that at least one progress of work is abnormal;
Figure 10 is in the embodiment of the present invention, performs the implementing procedure schematic diagram of abnormal work process resumption;
Figure 11 is in the embodiment of the present invention, the structural representation of server node.
Detailed description of the invention
In order to provide a kind of disaster tolerance based on cluster server mechanism, it is ensured that the operation stability of cluster server And service reliability, embodiments provide a kind of cluster server disaster tolerance system, method and server Node.
Below in conjunction with Figure of description, the preferred embodiments of the present invention are illustrated, it will be appreciated that this place The preferred embodiment described is merely to illustrate and explains the present invention, is not intended to limit the present invention, and not In the case of conflict, the embodiment in the present invention and the feature in embodiment can be mutually combined.
As it is shown in figure 1, the structural representation of the cluster server disaster tolerance system provided for the embodiment of the present invention. During it is also preferred that the left be embodied as, each server node that can be comprised by cluster server is according to joint behavior (as being the computing capability of node) arranges from high to low, sets up server node list, at this On the basis of, set up two node listings, be respectively as follows: monitor node list and working node list.Wherein, M (M is the positive integer more than or equal to 1) individual server can be selected from server node list successively to save All of server node, as monitor node, is added to working node list as working node by point. When aggregated server system run after, for the server node of monitor node list create 1 monitor into Journey, creates 1 or multiple progress of work for the server node in working node list.So, Server node in monitor node list will only run a monitoring process, and in working node list Server node will at least run a progress of work, also part server node its both at monitor node In list again in working node list, it will run a monitoring process and at least one progress of work.
The operation principle of the cluster server disaster tolerance system that the embodiment of the present invention provides is as follows: cluster server system After system runs, operation process is divided into monitoring process group and progress of work group, wherein, monitoring process group Monitoring process monitors progress of work running status and the service state of progress of work group by assigned frequency, supervises simultaneously Controlling the running status of other monitoring process, the process of calculation procedure group is responsible for business logic processing, and preservation is worked as Front task list, is reported self running status and service by assigned frequency to the monitoring process of monitoring process group State, as in figure 2 it is shown, monitoring process and the progress of work can be done and switch.
When aggregated server system initializes, enter by each server node capabilities list starter node and process Row work, for simultaneously at monitor node list and the machine of working node list, at least generates 2 processes; For only at the machine of working node list, at least generating a process, Mk system state is for initializing shape State;Take minmal sequence number (rank) process monitoring process as current monitor node of each monitor node, Labeling process classification is monitoring process class, is registered in monitoring process group, monitoring process press process sequence number from Little to big composition chaining message transferring structure, test that all monitoring process are working properly, place node is just working Often, then the first of regulation monitoring process group frequency is monitored.Collect all non-supervised processes, labeling process class Not Wei the progress of work, be registered in progress of work group, test all working process works is normal, place node Working properly, then (stagger with the first monitoring frequency by the report frequency of the second monitoring allocation working group Time).The monitoring process group set up preserves prison by each process of frequency broadcast notice progress of work group The monitoring process list of control process group, provides unified report frequency (the i.e. second monitoring frequency of the progress of work simultaneously Rate), the progress of work presses this frequency configuration clock, and this value dormancy pressed by clock, at the appointed time by system wake-up After, to each the monitoring process report services situation being saved in the local monitoring process list of process.Cluster Server disaster tolerance system has initialized, and Mk system state is duty.
In order to be better understood from the embodiment of the present invention, below the initialization procedure of aggregated server system is carried out Describe in detail.
First setting aggregated server system uses hydra process manager to start and managing process, arranges With communications access in the case of losing efficacy in process whether error handler is MPI_ERRORS_RETURN, Return process failure error code, concrete, as it is shown on figure 3, may comprise steps of:
The process sequence number of the process each process of acquisition that S31, process sequence number are minimum and each process place node Node identification.
Concrete, start aggregated server system by configuration file, generate whole process, pass through process sequence The process of number minimum (rank=0) is initiated MonitorProcessGroupCreate and is called, and starts component monitor Process group, MonitorProcessGroupCreate sends and obtains process sequence number request to Servers-all joint Point.After the process run on Servers-all node receives acquisition process sequence number request, return its process Serial number, the node identification of place node give visiting process.
The minimum process of S32, process sequence number determines monitoring process.
Concrete, MonitorProcessGroupCreate is according to the process sequence number obtained, to monitor node Process that run on each monitor node in list, that minimum process sequence number is corresponding sends this process of labelling and is The request of monitoring process classification;Target process (the minimum process sequence correspondence run on the most each monitor node Process) receive amendment process category request after, revising its process classification is monitoring process, and to rank=0 Process return process classification information.
The minimum process of S33, process sequence number generates monitoring process list and notifies each monitoring process storage prison Control process list.
Concrete, the process of rank=0 generates monitoring process list and stores, and each to monitoring process group Process sends the information preserving monitoring process list, after each process in monitoring process list receives information Storage monitoring process list.
The process establishment progress of work that S34, process sequence number are minimum.
The process of rank=0 is initiated ServiceProcessGroupCreate and is called, and starts building work process group. ServiceProcessGroupCreate obtains the not process sequence number in monitoring process list and place clothes thereof The node identification of business device node.According to the process sequence number obtained, send this process of labelling to corresponding process Request for progress of work classification;Target process (process that the process sequence number that i.e. obtains is corresponding) receives After amendment process category request, revising its process classification is the progress of work, and returns to the process of rank=0 Process classification information.
The minimum process of S35, process sequence number generates progress of work list and notifies that each process storage work is entered Cheng Liebiao and notify each progress of work store monitoring process list.
The process of rank=0 generates progress of work list and stores, send preserve progress of work list to all enter Journey (includes monitoring process and the progress of work), and each target process receives progress of work list storage. The process of rank=0 each progress of work in progress of work list sends the letter preserving monitoring process list Breath, each target process (the most each progress of work) stores monitoring process list after receiving information.
The minimum process of S36, process sequence number notifies monitoring process and the monitoring of its correspondence of the progress of work respectively Frequency.
The process of rank=0 monitoring process in monitoring process list sends the first monitoring frequency, monitoring process Monitoring process in list is monitored process service detection by this frequency;The process of rank=0 is to the progress of work Each progress of work in list sends report frequency (the i.e. second monitoring frequency), in progress of work list The progress of work reports frequency by this frequency configuration;Initialization procedure terminates, and initialization system state is duty, Monitoring process in monitoring process list starts monitoring process and is monitored, and the work in progress of work list is entered Cheng Qidong calculates process and provides business service.
Design in the system of aggregated server system, set up a clock and the letter of correspondence thereof for the progress of work Number process function, the progress of work by second monitoring frequency (report frequency) value dormancy, when clock is waken up touch Send out corresponding signal process function;For the progress of work, signal process function drives the progress of work to collect self Service state, each monitoring process being sent in monitoring process group, monitoring information sends situation simultaneously, if Sending unsuccessfully to the information of all monitoring process, write monitoring abnormal log, return works on, and treats that clock is stopped Sleep and continue to attempt to send service information to subsequent period.
Monitoring process group is according to the fortune of each monitor node in the monitoring frequency check monitor node list set The duty of row situation, monitoring process self and the service state of the progress of work.Monitoring process group each Monitoring process is monitored process status detection by the first monitoring frequency.Monitoring process is being carried out status monitoring Time, first putting system mode is that monitoring process group detects state, by monitoring process sequence to other all monitoring It is normal to confirm oneself place server node service that process sends broadcast message, receives monitoring process group simultaneously The status information that the monitoring process that other monitor node is run by himself is sent, if all information sends Receiving normal, monitoring process has detected, without exception;Otherwise, if some monitoring process sends or receives letter Breath is abnormal, and this monitoring process preserves abnormal monitoring process sequence number (rank), the joint of place node monitored Point identification and abnormal cause, this monitoring process state in the monitoring process list of this monitoring process storage of putting is different Often state, complete other monitoring process all to be detected, the monitoring process list retained according to this process, Xiang Qi In normal monitoring process (being not labeled as the monitoring process of abnormal monitoring process) initiate monitoring process group different Often message, now state normal monitoring process exchange monitoring process group information, confirm abnormal monitoring process. For abnormal monitoring process, shield in monitoring process list, and except monitoring joint from server node list Server node (i.e. working node) beyond the monitor node that point list comprises select a working node add Entering monitor node list and generate new process, arranging new process classification is monitoring process, joins monitoring process row In table, update the local monitor node listing of each monitoring process of monitoring process group, local monitoring process list, In progress of work group, each process sends current monitor process list, writes log alarming simultaneously.Finally, put System mode is idle condition.
The service state of each monitoring process regular check progress of work of monitoring process group.The progress of work time Clock is waken up up by report frequency, to the progress of work sequence number (rank) of monitoring process time-triggered report self, service shape State and affiliated node and node load, if service is normal, send normal service information, if abnormal, send different Often information on services, if the progress of work is hung dead, it is impossible to send information;Monitoring process receives the progress of work and is reported Information, first put system mode be working group detect state, to each progress of work received report, brush Work at present progress information in new local progress of work list, for being in local progress of work list but overtime Not receiving the progress of work of report information, monitoring process actively initiates process status detection, if invalid, puts Its progress of work service state is invalid;After all working process report information receives or detected, press Node load average in progress of work report updates the load shape of corresponding working node in working node list State.And exchange, with other all monitoring process in monitoring process group, the information that each progress of work reports, by same Node load average in the report of seat process updates the load condition of corresponding working node in working node list, Finally update server node capabilities list according to the load condition of each working node;Its in monitoring process group Its all monitoring process exchanges the information that each progress of work reports, and is abnormal for progress of work service state The progress of work, monitoring process actively initiates the detection of its working node Host Status;If working node main frame is normal, Working node belonging to notice removes this abnormal work process, and it is arranged from progress of work group by each progress of work Remove, from server node, select a server node, use cluster interface function dynamically generate one new Process, this new process classification of labelling is progress of work class, joins in progress of work list and sends all joints Point;If corresponding working node host fails, this working node is shielded in server node list, simultaneously In working node list shield, update working node list and send all processes (include monitoring process and The progress of work), to update the progress of work list of each process storage in monitoring process group and progress of work group; Progress of work service detection completes, and putting system mode is idle condition.
Based on above-mentioned operation principle, embodiments provide a kind of cluster server disaster tolerance system, as Shown in Fig. 1, including multiple monitor nodes 11 and multiple working node 12, wherein:
Monitor node 11 is for running the first monitoring process, and the first monitoring process is according to the first default monitoring Each monitoring process in the monitoring process list of frequency monitoring storage is the most abnormal;And determining at least one During monitoring process exception, perform abnormal monitoring process resumption;And first monitoring process according to default second In the progress of work list of frequency monitoring storage, each progress of work is the most abnormal;Determine at least one work into During Cheng Yichang, perform abnormal work process resumption;
Working node 12 is used for running the progress of work, the progress of work according to storage monitoring process list according to The described second monitoring frequency each monitoring process in described monitoring process list reports the service state of self.
During it should be noted that be embodied as, due to the possible operation monitoring process simultaneously of part server node And the progress of work, for the ease of distinguishing, in the embodiment of the present invention, when describing it and performing monitoring process function Referred to as monitor node, is referred to as working node when describing it and performing progress of work function.
It is monitored process and services disaster-tolerant recovery process to this below in conjunction with aggregated server system The detailed description of the invention of bright embodiment illustrates.
One, monitor node carries out the implementation process of condition monitoring to monitoring process
When system mode is changed to duty, initialization procedure terminates, the monitoring in monitoring process list Process initiation enters monitoring process, and the progress of work in progress of work list launches into calculating process, mutually Accompany and attend to, according to the running status of the whole aggregated server system of frequency monitoring of respective place process group.
Monitoring process initiates monitoring process state-detection by the first monitoring frequency, it is ensured that the progress of work reports service Before state, the monitoring process in monitoring process list can normally work.
Concrete, monitor node is for each monitoring comprised according to monitoring process list by the first monitoring process The order composition message transmission loop chain that monitoring process sequence number corresponding to process is ascending, and initiate message and pass Pass;If do not receive the message of transmission in preset duration, it is determined that monitor at least one monitor into Cheng Yichang.When monitoring at least one monitoring process exception, the first monitoring process is in monitoring process list Other second monitoring process send condition monitoring message, enter according to the monitoring of testing result recording exceptional Journey identifies;Determine the monitoring process exception that the abnormal monitoring process identification (PID) of record is corresponding.
Enter it is also preferred that the left the first monitoring process can be the monitoring that in monitoring process list, process sequence number is minimum Journey, the most above-mentioned monitor node refers to the server node at the minimum monitoring process place of this process sequence number.
When being embodied as, the monitoring process that in monitoring process group, minimum process sequence number is corresponding is initiated MonitorProcessGroupStatusCheck process, MonitorProcessGroupStatusCheck is set up and is disappeared Breath chain (pressing process sequence number arrange from small to large), and initiate message transmission, if message can through monitoring into Cheng Huanlian normally returns to the monitoring process that minimum process sequence number is corresponding, then may determine that monitoring process list In all monitoring process normal, wait that the progress of work in progress of work list reports its service state.
If message loop chain type transmission time-out is not restored to minimum monitoring process serial number, in monitoring process list The monitoring process that minimum process number is corresponding initiates MonitorProcessGroupRebuild process, The MonitorProcessGroupRebuild process first each monitoring process in monitoring process list order Send condition monitoring information, if abnormal state, by < process sequence number of abnormal monitoring process, its institute Node identification at monitor node > add in abnormal state monitoring process list by this form, repeat this process Until having detected other all monitoring process states in monitoring process group.
After determining abnormal monitoring process, according to the quantity of abnormal monitoring process, monitor node is by the One monitoring process select from cluster server node listing the monitor node that comprises except monitor node list with Outer, the working node of respective numbers is respectively created a monitoring process for replacing abnormal monitoring process;And Monitoring process sequence number and the institute thereof that each newly created monitoring process returns is received by described first monitoring process Node identification at node;Monitoring process sequence number and node identification according to receiving update monitoring process respectively List and monitor node list;And notify that each monitoring process in monitoring process list updates monitoring of its storage and enters Cheng Liebiao;And each progress of work updates the monitoring process list of its storage in notice progress of work list.
When being embodied as, monitor node is run to the working node being selected respectively by the first monitoring process Any operative process send monitoring process request to create.It is also preferred that the left monitor node can be by the first monitoring The progress of work that process is run on the working node being selected, that process sequence number is minimum send monitor into Journey request to create.The working node being selected, is used for by self-operating, receives monitoring process establishment The progress of work of request creates the subprocess of this progress of work, and to the subprocess created forward described monitor into Journey request to create;This subprocess according to the monitoring process request to create received, mark own process classification is Monitoring process;And return newly created monitoring process sequence number and place by this progress of work to the first monitoring process The node identification of node.
Concrete, MonitorProcessGroupRebuild process calculates in abnormal state monitoring process list Abnormal monitoring process number n, obtains performance ranking in server list forward and not in monitor node list N working node (when being embodied as, it is also possible to arbitrarily select from the working node in addition to monitor node Selecting n, this is not limited by the embodiment of the present invention), send the progress of work of this n node of information query, Obtain progress of work serial number (rank).MonitorProcessGroupRebuild is to each work inquired The progress of work that on node, process sequence number is minimum sends dynamic monitoring process request to create.When being embodied as, Arbitrary progress of work dynamic monitoring process request to create, the present invention can be sent on the working node selected This is not limited by embodiment.Target operation process (process sequence minimum on the working node i.e. selected The progress of work) receive monitoring process request to create after, use MPI_Comm_spawn to create dynamically Subprocess, target operation process creates communication domain between father and son's process group and is connected with newly created subprocess, target The progress of work forwards monitoring process request to create, newly created subprocess mark own process class to this subprocess Not Wei monitoring process, the state of setting is init state;Target operation process to MonitorProcessGroupRebuild returns < process sequence number of new dynamic creation, the node of place node Mark, target operation process sequence number, the node identification of target operation process place node > give (wherein, target operation process sequence number is as newly created monitoring to MonitorProcessGroupRebuild Process is converted in communicating with the monitoring process in former monitoring process list), Current abnormal monitoring process is deleted from monitoring process list by MonitorProcessGroupRebuild, simultaneously Add newly created monitoring process in monitoring process list, as the replacement of former abnormal monitoring process monitor into Journey.MonitorProcessGroupRebuild repeats said process, until the replacement process of abnormal monitoring process All create complete.
MonitorProcessGroupRebuild sends to all processes (including monitoring process and the progress of work) Update it and store new monitoring process list message, after other process receives this message, update self storage Monitoring process list;MonitorProcessGroupRebuild is (the most newly-increased to the parent process of newly-increased monitoring process Run on the node of monitoring process place, the minimum progress of work of process sequence number) initiate this newly-increased monitoring of change Process works status request, target operation process (i.e. increases run on the node of monitoring process place, process newly The serial number minimum progress of work) receive request after, first to newly-increased monitoring process send preserve the progress of work List, monitoring process list and the request of each server node capabilities list, and forward to newly-increased monitoring process The request of request change duty, newly-increased monitoring process preserve progress of work list, monitoring process list and Server node capabilities list, and the system mode arranging self is duty, proceeds by monitoring.When After all newly-increased monitoring process all enter duty, monitoring process recovering state process completes, can be normal Receive the report of each progress of work in progress of work list.
Monitoring process starts ServiceProcessReport process by the first monitoring frequency preset, The service state of ServiceProcessReport process collection self, system resource state, to the monitoring of self Each monitoring process in process list sends report message, reports own services information.
If after monitoring process directly or indirectly receives the report message of the progress of work in progress of work list, Message to each progress of work, updates the corresponding service state of the progress of work, process place working node Performance state, after having updated the report information of all working process, send collect monitoring informational message to work as The monitoring process that in front monitoring process list, process sequence number is the most small size, the monitoring that minimum process sequence number is corresponding Process weights the performance of each working node that each monitoring process obtains, and rearranges server node accordingly Can list.
The monitoring process that in monitoring process list, minimum process sequence number is corresponding sends messages to Servers-all Node, updates server node capabilities list, and initialization system state is duty, completes entirely to monitor process. If above-mentioned steps sending abnormal, when not receiving the report of some progress of work in progress of work list, say These progresses of work bright may be abnormal, need to start service disaster-tolerant recovery process.
Two, monitor node carries out the implementation process of service state detection to the progress of work
When being embodied as, monitoring process finds that during monitoring some progress of work is not according to default the When two monitoring frequencies report service state message, then illustrate that these progresses of work or its place working node are sent out Raw abnormal.Concrete, monitor node, may be used for monitoring at least one progress of work (the most not Receive the service state message that this at least one progress of work reports) time, the first monitoring process can be passed through Progress of work list according to storage sends service state detection message according to testing result to each progress of work The progress of work mark of recording exceptional;Determine that the progress of work corresponding to the abnormal work process identification (PID) of record is different Often.After determining abnormal work process, monitor node can be by the first abnormal work of monitoring process detection Make process place working node the most abnormal;And identify according to the working node of result of detection recording exceptional;Root According to the quantity of abnormal work process, select in addition to abnormal work node from cluster server node listing , the node of respective numbers be respectively created a progress of work for replacing abnormal work process;And pass through institute State the first monitoring process and receive progress of work sequence number and the place work thereof that each newly created progress of work returns The working node mark of node;Identify to update respectively according to the progress of work sequence number received and working node and deposit The progress of work list of storage and working node list;And notify that in monitoring process list, each monitoring process updates it The progress of work list of storage;And each progress of work updates the work of its storage in notice progress of work list Process list.
Wherein, monitor node, may be used for being run to the node being selected respectively by the first monitoring process Arbitrary process send monitoring process request to create;Selected node is used for by self-operating, receives The subprocess of this process of process creation of described progress of work request to create, and forward institute to the subprocess created State progress of work request to create;This subprocess, according to the progress of work request to create received, identifies self and enters Journey classification is the progress of work;And by receiving the process of described progress of work request to create to described first prison Control process returns newly created progress of work sequence number and the node identification of place node.
Concrete, as shown in Figure 4, the progress of work is carried out service state detection and the progress of work that notes abnormalities Time be operated process resumption and may comprise steps of:
The monitoring process that in S41, monitoring process list, minimum process sequence number is corresponding determines the work of abnormality Make process.
The monitoring process that in monitoring process list, minimum process sequence number is corresponding is initiated ServiceProcessGroupRebuild process, ServiceProcessGroupRebuild is first to monitoring process In list, all monitoring process send message, and putting all monitoring process is disaster-tolerant recovery state;Then to work Each procedural sequences of process sends service state detection message, however, it is determined that abnormal state, by < abnormal work The process sequence number of process, the node identification of its place working node > the form addition abnormal state progress of work List, if abnormal work process place working node is abnormal, adds in abnormal work node listing, repeats this Process is until having detected all working process status in work process list.
The monitoring process that in S42, monitoring process list, minimum process sequence number is corresponding select server node and Certain process on this server node starts process as the agency creating new process.
ServiceProcessGroupRebuild process calculates abnormal work in abnormal state progress of work list Process number n, obtains ranking in server node capabilities list forward and not in abnormal work node listing N server node, send the progress of work of this n node of information query or monitoring process, and select Process (may be likely to as monitoring process for the progress of work) conduct that minimum process sequence number (rank) is corresponding The agency of current abnormal work process dynamically starts process.During it should be noted that be embodied as, it is also possible to Randomly choosing arbitrary process, this is not limited by the embodiment of the present invention.
The monitoring process that in S43, monitoring process list, minimum process sequence number is corresponding is by acting on behalf of startup process Generate the replacement process of abnormal work process.
ServiceProcessGroupRebuild process opens to the agency of each abnormal work process inquired Dynamic process be (that run on n the server node i.e. selected, minimum process sequence number (rank) correspondence Process) send progress of work request to create, this agency starts after process receives progress of work request to create, Use MPI_Comm_spawn to create dynamic subprocess, act on behalf of startup process and create with newly created subprocess Building communication domain between father and son's process group to connect, the monitoring process acting on behalf of startup process forwarding process sequence number minimum is sent out The progress of work request to create sent, newly created subprocess identification process classification is the progress of work, arranges son and enters Journey state is init state;Act on behalf of startup process minimum monitoring process serial number pair in monitoring process list The monitoring process answered return < process sequence number of the process of new dynamic creation, the node identification of place node, Act on behalf of the process sequence number of startup process, act on behalf of the node identification of startup process place node, wherein, generation The process sequence number of reason startup process is as the newly created progress of work and the work in former progress of work list Conversion in process communication), ServiceProcessGroupRebuild is by currently processed abnormal work process Delete from progress of work list, add the newly created progress of work in progress of work list simultaneously, as The replacement process of former abnormal work process.ServiceProcessGroupRebuild repeats this process, until losing The replacement process of the progress of work of effect is all set up.
The monitoring process that in S44, monitoring process list, minimum process sequence number is corresponding notifies that all processes store New progress of work list and notify that newly created process changes its duty.
ServiceProcessGroupRebuild sends to all processes (including the progress of work and monitoring process) Preserve new progress of work list message, after other process receives this message, update the progress of work row of storage Table;ServiceProcessGroupRebuild sends change newly by acting on behalf of startup process to new work process The duty request of the progress of work, acts on behalf of after startup process receives this request, first to newly can work into Journey transmission storage progress of work list, monitoring process list and the request of server joint behavior list, and to The request of new work process forwarding its duty of change, the storage progress of work list of new work process, Monitoring process list and server joint behavior list, and the system mode arranging self is duty, opens Begin to provide service.After all new progresses of work all enter duty, abnormal work process resumption process is tied Bundle, can normally provide service.
The monitoring process that in S45, monitoring process list, minimum process sequence number is corresponding updates system mode.
Disaster-tolerant recovery process terminates, and the monitoring process that in monitoring process list, middle minimum process sequence number is corresponding sets Determining system mode is duty.
The cluster server disaster recovery method that the embodiment of the present invention provides, for service based on PC cluster framework The online disaster-tolerant recovery of device system, by setting up monitoring process list and progress of work list, it is provided that monitoring process With the abnormality eliminating method of the progress of work, and each provide abnormal monitoring process and abnormal work process is online Restoration methods, it is possible to the online service ability recovering aggregated server system, it is achieved that system disaster tolerance and service Integrated design and operation, improve operation stability and the service reliability of cluster server.
Based on same inventive concept, the embodiment of the present invention additionally provides a kind of cluster server disaster recovery method and Server node, owing to said method and equipment solve principle and the cluster server disaster tolerance system phase of problem Seemingly, therefore the enforcement of said method and equipment may refer to the enforcement of system, repeats no more in place of repetition.
As it is shown in figure 5, the implementing procedure signal of the cluster server disaster recovery method provided for the embodiment of the present invention Figure, may comprise steps of:
S51, monitor node are by the first monitoring process of self-operating respectively according to the first default monitoring frequently Each monitoring process and the work of storage in the monitoring process list of rate and the second monitoring frequency monitoring storage are entered In Cheng Liebiao, each progress of work is the most abnormal.
S52, when determining at least one monitoring process exception, perform abnormal monitoring process resumption;Determining During at least one progress of work exception, perform abnormal work process resumption.
When being embodied as, in step S52, can determine that at least one monitors according to the process shown in Fig. 6 Process exception:
S61, described monitor node, when monitoring at least one monitoring process exception, are supervised by described first Control process other second monitoring process in monitoring process list sends condition monitoring message;
S62, identify according to the monitoring process of testing result recording exceptional;
S63, determine that monitoring process corresponding to the abnormal monitoring process identification (PID) of record is abnormal.
Wherein, in S61, can determine according to the flow process shown in Fig. 7 that to monitor at least one monitoring process different Normal:
Each prison that S71, described monitor node are comprised according to monitoring process list by described first monitoring process The order composition message transmission loop chain that monitoring process sequence number corresponding to control process is ascending, and initiate message and pass Pass;
If S72 does not receives the message of transmission in preset duration, it is determined that monitor at least one Monitoring process is abnormal.
When being embodied as, when determining at least one monitoring process exception, can be according to the flow process shown in Fig. 8 Execution abnormal monitoring process resumption:
S81, described monitor node according to the quantity of abnormal monitoring process, by described first monitoring process from Cluster server node listing selects in addition to the monitor node that monitor node list comprises, respective numbers Working node be respectively created a monitoring process for replacing abnormal monitoring process;
S82, the monitoring process sequence returned by the described first each newly created monitoring process of monitoring process reception Number and the node identification of place node;
Monitoring process sequence number and node identification that S83, basis receive update monitoring process list and prison respectively Control node listing;
In S84, notice monitoring process list, each monitoring process updates the monitoring process list of its storage;And In notice progress of work list, each progress of work updates the monitoring process list of its storage.
When being embodied as, in step S52, can according to the flow process shown in Fig. 9 determine at least one work into Cheng Yichang:
S91, described monitor node, when monitoring at least one progress of work exception, are supervised by described first Control process sends service state detection message according to the progress of work list of storage to each progress of work;
S92, identify according to the progress of work of testing result recording exceptional;
S93, determine that the progress of work corresponding to the abnormal work process identification (PID) of record is abnormal.
When being embodied as, when determining at least one progress of work exception, can be according to the stream shown in Figure 10 Cheng Zhihang abnormal work process resumption:
S101, described monitor node are after determining abnormal work process, by described first monitoring process Detection abnormal work process place working node is the most abnormal;
S102, identify according to the working node of result of detection recording exceptional;
S103, quantity according to abnormal work process, select except abnormal work from cluster server node listing Make node beyond node, respective numbers to be respectively created a progress of work and enter for replacing abnormal work Journey;
S104, described monitor node receive each newly created progress of work by described first monitoring process and return Progress of work sequence number and place working node working node mark;
S105, the work updating storage according to the progress of work sequence number that receives and working node mark respectively are entered Cheng Liebiao and working node list;
In S106, notice monitoring process list, each monitoring process updates the progress of work list of its storage;And In notice progress of work list, each progress of work updates the progress of work list of its storage.
The structural representation of the server node provided for the embodiment of the present invention as shown in figure 11, may include that
Monitoring unit 111, for by run on the first monitoring process of described server node respectively according to Preset first monitoring frequency and second monitoring frequency monitoring storage monitoring process list in respectively monitor into In the progress of work list of journey and storage, each progress of work is the most abnormal;
Exception processing unit 112, for when determining at least one monitoring process exception, performs abnormal monitoring Process resumption;And when determining at least one progress of work exception, perform abnormal work process resumption.
Wherein, monitoring unit 111, may be used for, when monitoring at least one monitoring process exception, passing through Described first monitoring process other second monitoring process in monitoring process list sends condition monitoring Message;Monitoring process mark according to testing result recording exceptional;Determine the abnormal monitoring process identification (PID) of record Corresponding monitoring process is abnormal.
It is also preferred that the left monitoring unit 111 may be used for by described first monitoring process according to monitoring process list The order composition message transmission loop chain that monitoring process sequence number that each monitoring process of comprising is corresponding is ascending, and Initiation message is transmitted;If do not receive the message of transmission in preset duration, it is determined that monitor at least One monitoring process is abnormal.
When being embodied as, exception processing unit 112 may include that
First selects subelement, for the quantity according to abnormal monitoring process, by described first monitoring process In addition to the monitor node that monitor node list comprises, respective counts is selected from cluster server node listing The working node of amount is respectively created a monitoring process for replacing abnormal monitoring process;
First receives subelement, returns for receiving each newly created monitoring process by described first monitoring process The monitoring process sequence number returned and the node identification of place node thereof;
First updates subelement, for updating prison respectively according to the monitoring process sequence number received and node identification Control process list and monitor node list;
First notice subelement, for notifying that in monitoring process list, each monitoring process updates the monitoring of its storage Process list;And each progress of work updates the monitoring process list of its storage in notice progress of work list.
When being embodied as, monitoring unit 111, may be used for when monitoring at least one progress of work exception, Service state is sent according to the progress of work list of storage to each progress of work by described first monitoring process Detection message;Progress of work mark according to testing result recording exceptional;Determine the abnormal work process of record The progress of work of mark correspondence is abnormal.
When being embodied as, exception processing unit 112, may include that
Detection subelement, for after determining abnormal work process, is visited by described first monitoring process Survey abnormal work process place working node the most abnormal;And according to the working node of result of detection recording exceptional Mark;
Second selects subelement, for the quantity according to abnormal work process, from cluster server node listing Middle selection in addition to abnormal work node, the node of respective numbers be respectively created a progress of work for replacing The normal progress of work of transversion;
Second receives subelement, returns for receiving each newly created progress of work by described first monitoring process The progress of work sequence number returned and the working node mark of place working node thereof;
Second updates subelement, for identifying the most more according to the progress of work sequence number received and working node The progress of work list of new storage and working node list;
Second notice subelement, for notifying that in monitoring process list, each monitoring process updates the work of its storage Process list;And each progress of work updates the progress of work list of its storage in notice progress of work list.
For convenience of description, above each several part is divided by function and is respectively described for each module (or unit). Certainly, when implementing the present invention can the function of each module (or unit) at same or multiple softwares or Hardware realizes.
Cluster server disaster tolerance system, method and the server node that the embodiment of the present invention provides, monitor node Arranged according to the monitoring process of the first monitoring frequency monitoring self storage by the first monitoring process of self-operating Whether each monitoring process in table there is exception, and the work according to the second monitoring frequency monitoring self storage Whether each process of group altogether made in process list there is exception, and different monitoring at least one monitoring process Chang Shi, performs monitoring process and recovers, and when monitoring at least one progress of work exception, performs the progress of work Recover, pass through said process, it is achieved that the abnormal monitoring of cluster server and recovery, it is ensured that cluster service The operation stability of device and service reliability.
Those skilled in the art are it should be appreciated that embodiments of the invention can be provided as method, system or meter Calculation machine program product.Therefore, the present invention can use complete hardware embodiment, complete software implementation or knot The form of the embodiment in terms of conjunction software and hardware.And, the present invention can use and wherein wrap one or more Computer-usable storage medium containing computer usable program code (include but not limited to disk memory, CD-ROM, optical memory etc.) form of the upper computer program implemented.
The present invention is with reference to method, equipment (system) and computer program product according to embodiments of the present invention The flow chart of product and/or block diagram describe.It should be understood that can by computer program instructions flowchart and / or block diagram in each flow process and/or flow process in square frame and flow chart and/or block diagram and/ Or the combination of square frame.These computer program instructions can be provided to general purpose computer, special-purpose computer, embedding The processor of formula datatron or other programmable data processing device is to produce a machine so that by calculating The instruction that the processor of machine or other programmable data processing device performs produces for realizing at flow chart one The device of the function specified in individual flow process or multiple flow process and/or one square frame of block diagram or multiple square frame.
These computer program instructions may be alternatively stored in and computer or the process of other programmable datas can be guided to set In the standby computer-readable memory worked in a specific way so that be stored in this computer-readable memory Instruction produce and include the manufacture of command device, this command device realizes in one flow process or multiple of flow chart The function specified in flow process and/or one square frame of block diagram or multiple square frame.
These computer program instructions also can be loaded in computer or other programmable data processing device, makes Sequence of operations step must be performed to produce computer implemented place on computer or other programmable devices Reason, thus the instruction performed on computer or other programmable devices provides for realizing flow chart one The step of the function specified in flow process or multiple flow process and/or one square frame of block diagram or multiple square frame.
Although preferred embodiments of the present invention have been described, but those skilled in the art once know base This creativeness concept, then can make other change and amendment to these embodiments.So, appended right is wanted Ask and be intended to be construed to include preferred embodiment and fall into all changes and the amendment of the scope of the invention.
Obviously, those skilled in the art can carry out various change and modification without deviating from this to the present invention Bright spirit and scope.So, if the present invention these amendment and modification belong to the claims in the present invention and Within the scope of its equivalent technologies, then the present invention is also intended to comprise these change and modification.

Claims (20)

1. a cluster server disaster tolerance system, it is characterised in that include multiple monitor node and multiple work Make node, wherein:
Described monitor node is for running the first monitoring process, and described first monitoring process is according to default first Each monitoring process in the monitoring process list of monitoring frequency monitoring storage is the most abnormal;Determining at least one During monitoring process exception, perform abnormal monitoring process resumption;And described first monitoring process is according to default In the progress of work list of second frequency monitoring storage, each progress of work is the most abnormal;Determining at least one work When making process exception, perform abnormal work process resumption;
Described working node is used for running the progress of work, and the described progress of work is according to the monitoring process list of storage The service of self is reported according to the described second monitoring frequency each monitoring process in described monitoring process list State.
2. the system as claimed in claim 1, it is characterised in that
Described monitor node, specifically for when monitoring at least one monitoring process exception, by described One monitoring process other second monitoring process in monitoring process list sends condition monitoring message, root Identify according to the monitoring process of testing result recording exceptional;Determine the prison that the abnormal monitoring process identification (PID) of record is corresponding Control process exception.
3. system as claimed in claim 1 or 2, it is characterised in that
Described monitor node, specifically for comprise according to monitoring process list by described first monitoring process The order composition message transmission loop chain that monitoring process sequence number that each monitoring process is corresponding is ascending, and initiate to disappear Breath transmission;If do not receive the message of transmission in preset duration, it is determined that monitor at least one prison Control process exception.
4. system as claimed in claim 2, it is characterised in that
Described monitor node, specifically for the quantity according to abnormal monitoring process, by described first monitor into Journey selects in addition to the monitor node that monitor node list comprises, corresponding from cluster server node listing The working node of quantity is respectively created a monitoring process for replacing abnormal monitoring process;And by described the One monitoring process receives monitoring process sequence number and the joint of place node thereof that each newly created monitoring process returns Point identification;Monitoring process sequence number and node identification according to receiving update monitoring process list and monitoring respectively Node listing;And notify that in monitoring process list, each monitoring process updates the monitoring process list of its storage;With And each progress of work updates the monitoring process list of its storage in notice progress of work list.
5. system as claimed in claim 4, it is characterised in that
Described monitor node, specifically for saving to the work being selected respectively by described first monitoring process Any operative process that point runs sends monitoring process request to create;
The described working node being selected, is used for by self-operating, receives the establishment of described monitoring process The progress of work of request creates the subprocess of this progress of work, and to the subprocess created forward described monitor into Journey request to create;Described subprocess, according to the monitoring process request to create received, identifies own process classification For monitoring process;And return newly created monitoring process sequence by the described progress of work to described first monitoring process Number and the node identification of place node.
6. the system as claimed in claim 1, it is characterised in that
Described monitor node, specifically for when monitoring at least one progress of work exception, by described One monitoring process sends service state detection message according to the progress of work list of storage to each progress of work;Root Identify according to the progress of work of testing result recording exceptional;Determine the work that the abnormal work process identification (PID) of record is corresponding Make process exception.
7. system as claimed in claim 6, it is characterised in that
Described monitor node, specifically for after determining abnormal work process, monitors by described first Process detection abnormal work process place working node is the most abnormal;And according to the work of result of detection recording exceptional Make node identification;According to the quantity of abnormal work process, select from cluster server node listing except abnormal Node beyond working node, respective numbers is respectively created a progress of work and enters for replacing abnormal work Journey;And by described first monitoring process receive progress of work sequence number that each newly created progress of work returns and The working node mark of its place working node;Identify according to the progress of work sequence number received and working node Update progress of work list and the working node list of storage respectively;And notify monitoring process list respectively monitors Process updates the progress of work list of its storage;And each progress of work updates it in notice progress of work list The progress of work list of storage.
8. system as claimed in claim 7, it is characterised in that
Described monitor node, specifically for transporting to the node being selected respectively by described first monitoring process Arbitrary process of row sends monitoring process request to create;
Described selected node is used for by self-operating, receives entering of described progress of work request to create Journey creates the subprocess of this process, and forwards described progress of work request to create to the subprocess created;Described Subprocess is according to the progress of work request to create received, and mark own process classification is the progress of work;And lead to Cross the process receiving described progress of work request to create and return newly created work to described first monitoring process Process sequence number and the node identification of place node.
9. a cluster server disaster recovery method, it is characterised in that including:
Monitor node by the first monitoring process of self-operating respectively according to the first default monitoring frequency and Each monitoring process in the monitoring process list of the second monitoring frequency monitoring storage and the progress of work row of storage In table, each progress of work is the most abnormal;
When determining at least one monitoring process exception, perform abnormal monitoring process resumption;And
When determining at least one progress of work exception, perform abnormal work process resumption.
10. method as claimed in claim 9, it is characterised in that determine at least one according to procedure below Monitoring process is abnormal:
Described monitor node is when monitoring at least one monitoring process exception, by described first monitoring process Other second monitoring process in monitoring process list sends condition monitoring message;
Monitoring process mark according to testing result recording exceptional;
Determine the monitoring process exception that the abnormal monitoring process identification (PID) of record is corresponding.
11. methods as described in claim 9 or 10, it is characterised in that determine prison in accordance with the following methods Control is abnormal at least one monitoring process:
Described monitor node by described first monitoring process according to respectively monitoring of comprising of monitoring process list into The order composition message transmission loop chain that monitoring process sequence number corresponding to journey is ascending, and initiate message transmission;
If do not receive the message of transmission in preset duration, it is determined that monitor at least one monitor into Cheng Yichang.
12. methods as claimed in claim 10, it is characterised in that perform abnormal prison according to procedure below Control process resumption:
Described monitor node, according to the quantity of abnormal monitoring process, is taken from cluster by described first monitoring process Business device node listing selects in addition to the monitor node that monitor node list comprises, the work of respective numbers Node is respectively created a monitoring process for replacing abnormal monitoring process;And
By described first monitoring process receive monitoring process sequence number that each newly created monitoring process returns and The node identification of its place node;
Monitoring process list and monitoring joint is updated respectively according to the monitoring process sequence number received and node identification Point list;
In notice monitoring process list, each monitoring process updates the monitoring process list of its storage;And
In notice progress of work list, each progress of work updates the monitoring process list of its storage.
13. methods as claimed in claim 9, it is characterised in that determine at least one according to procedure below The progress of work is abnormal:
Described monitor node is when monitoring at least one progress of work exception, by described first monitoring process Progress of work list according to storage sends service state detection message to each progress of work;
Progress of work mark according to testing result recording exceptional;
Determine the progress of work exception that the abnormal work process identification (PID) of record is corresponding.
14. methods as claimed in claim 13, it is characterised in that perform abnormal work according to procedure below Make process resumption:
Described monitor node is after determining abnormal work process, different by described first monitoring process detection Often progress of work place working node is the most abnormal;And
Working node mark according to result of detection recording exceptional;And
According to the quantity of abnormal work process, select except abnormal work node from cluster server node listing In addition, the node of respective numbers be respectively created a progress of work for replacing abnormal work process;
Described monitor node receives, by described first monitoring process, the work that each newly created progress of work returns Make the working node mark of process sequence number and place working node thereof;
The progress of work row updating storage respectively are identified according to the progress of work sequence number received and working node Table and working node list;And
In notice monitoring process list, each monitoring process updates the progress of work list of its storage;And
In notice progress of work list, each progress of work updates the progress of work list of its storage.
15. 1 kinds of server nodes, it is characterised in that including:
Monitoring unit, for the first monitoring process by running on described server node respectively according to presetting First monitoring frequency and second monitoring frequency monitoring storage monitoring process list in each monitoring process and In the progress of work list of storage, each progress of work is the most abnormal;
Exception processing unit, for when determining at least one monitoring process exception, performs abnormal monitoring process Recover;And when determining at least one progress of work exception, perform abnormal work process resumption.
16. server nodes as claimed in claim 15, it is characterised in that
Described monitoring unit, specifically for when monitoring at least one monitoring process exception, by described One monitoring process other second monitoring process in monitoring process list sends condition monitoring message;Root Identify according to the monitoring process of testing result recording exceptional;Determine the prison that the abnormal monitoring process identification (PID) of record is corresponding Control process exception.
17. server nodes as described in claim 15 or 16, it is characterised in that
Described monitoring unit, specifically for comprise according to monitoring process list by described first monitoring process The order composition message transmission loop chain that monitoring process sequence number that each monitoring process is corresponding is ascending, and initiate to disappear Breath transmission;If do not receive the message of transmission in preset duration, it is determined that monitor at least one prison Control process exception.
18. server nodes as claimed in claim 16, it is characterised in that described exception processing unit, Including:
First selects subelement, for the quantity according to abnormal monitoring process, by described first monitoring process In addition to the monitor node that monitor node list comprises, respective counts is selected from cluster server node listing The working node of amount is respectively created a monitoring process for replacing abnormal monitoring process;
First receives subelement, returns for receiving each newly created monitoring process by described first monitoring process The monitoring process sequence number returned and the node identification of place node thereof;
First updates subelement, for updating prison respectively according to the monitoring process sequence number received and node identification Control process list and monitor node list;
First notice subelement, for notifying that in monitoring process list, each monitoring process updates the monitoring of its storage Process list;And each progress of work updates the monitoring process list of its storage in notice progress of work list.
19. server nodes as claimed in claim 15, it is characterised in that
Described monitoring unit, specifically for when monitoring at least one progress of work exception, by described One monitoring process sends service state detection message according to the progress of work list of storage to each progress of work;Root Identify according to the progress of work of testing result recording exceptional;Determine the work that the abnormal work process identification (PID) of record is corresponding Make process exception.
20. server nodes as claimed in claim 19, it is characterised in that described exception processing unit, Including:
Detection subelement, for after determining abnormal work process, is visited by described first monitoring process Survey abnormal work process place working node the most abnormal;And according to the working node of result of detection recording exceptional Mark;
Second selects subelement, for the quantity according to abnormal work process, from cluster server node listing Middle selection in addition to abnormal work node, the node of respective numbers be respectively created a progress of work for replacing The normal progress of work of transversion;
Second receives subelement, returns for receiving each newly created progress of work by described first monitoring process The progress of work sequence number returned and the working node mark of place working node thereof;
Second updates subelement, for identifying the most more according to the progress of work sequence number received and working node The progress of work list of new storage and working node list;
Second notice subelement, for notifying that in monitoring process list, each monitoring process updates the work of its storage Process list;And each progress of work updates the progress of work list of its storage in notice progress of work list.
CN201510385495.9A 2015-07-03 2015-07-03 Cluster server disaster recovery system and method, and server node Pending CN106330523A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510385495.9A CN106330523A (en) 2015-07-03 2015-07-03 Cluster server disaster recovery system and method, and server node

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510385495.9A CN106330523A (en) 2015-07-03 2015-07-03 Cluster server disaster recovery system and method, and server node

Publications (1)

Publication Number Publication Date
CN106330523A true CN106330523A (en) 2017-01-11

Family

ID=57727266

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510385495.9A Pending CN106330523A (en) 2015-07-03 2015-07-03 Cluster server disaster recovery system and method, and server node

Country Status (1)

Country Link
CN (1) CN106330523A (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107665158A (en) * 2017-09-22 2018-02-06 郑州云海信息技术有限公司 A kind of storage cluster restoration methods and equipment
CN108776633A (en) * 2018-05-22 2018-11-09 深圳壹账通智能科技有限公司 Method, terminal device and the computer readable storage medium of monitoring process operation
CN108845916A (en) * 2018-07-03 2018-11-20 中国联合网络通信集团有限公司 Platform monitoring and alarm method, device, equipment and computer readable storage medium
CN109150666A (en) * 2018-10-11 2019-01-04 深圳互联先锋科技有限公司 A method of preventing website delay machine
CN109375873A (en) * 2018-09-27 2019-02-22 郑州云海信息技术有限公司 The initial method of data processing finger daemon in a kind of distributed storage cluster
CN109408158A (en) * 2018-11-06 2019-03-01 恒生电子股份有限公司 Method and device, storage medium and the electronic equipment that subprocess is exited with parent process
CN110032487A (en) * 2018-11-09 2019-07-19 阿里巴巴集团控股有限公司 Keep Alive supervision method, apparatus and electronic equipment
CN111506480A (en) * 2020-04-23 2020-08-07 上海达梦数据库有限公司 State detection method, device and system for components in cluster
CN111800304A (en) * 2019-04-09 2020-10-20 安克创新科技股份有限公司 Process running monitoring method, storage medium and virtual device
CN111988188A (en) * 2020-09-03 2020-11-24 深圳壹账通智能科技有限公司 Transaction endorsement method, device and storage medium
CN112035721A (en) * 2020-07-22 2020-12-04 大箴(杭州)科技有限公司 Crawler cluster monitoring method and device, storage medium and computer equipment
CN112994977A (en) * 2021-02-24 2021-06-18 紫光云技术有限公司 Method for high availability of server host
CN113704026A (en) * 2021-10-28 2021-11-26 北京时代正邦科技股份有限公司 Distributed financial memory database security synchronization method, device and medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1477509A (en) * 2002-08-19 2004-02-25 万达信息股份有限公司 Process automatic restoring method
US20040073657A1 (en) * 2002-10-11 2004-04-15 John Palmer Indirect measurement of business processes
CN1512375A (en) * 2002-12-31 2004-07-14 联想(北京)有限公司 Fault-tolerance approach using machine group node interacting buckup
CN1996257A (en) * 2006-12-26 2007-07-11 华为技术有限公司 Method and system for monitoring process
CN102739435A (en) * 2011-03-31 2012-10-17 微软公司 Fault detection and recovery as service

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1477509A (en) * 2002-08-19 2004-02-25 万达信息股份有限公司 Process automatic restoring method
US20040073657A1 (en) * 2002-10-11 2004-04-15 John Palmer Indirect measurement of business processes
CN1512375A (en) * 2002-12-31 2004-07-14 联想(北京)有限公司 Fault-tolerance approach using machine group node interacting buckup
CN1996257A (en) * 2006-12-26 2007-07-11 华为技术有限公司 Method and system for monitoring process
CN102739435A (en) * 2011-03-31 2012-10-17 微软公司 Fault detection and recovery as service

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107665158A (en) * 2017-09-22 2018-02-06 郑州云海信息技术有限公司 A kind of storage cluster restoration methods and equipment
CN108776633A (en) * 2018-05-22 2018-11-09 深圳壹账通智能科技有限公司 Method, terminal device and the computer readable storage medium of monitoring process operation
CN108776633B (en) * 2018-05-22 2021-07-02 深圳壹账通智能科技有限公司 Method for monitoring process operation, terminal equipment and computer readable storage medium
CN108845916A (en) * 2018-07-03 2018-11-20 中国联合网络通信集团有限公司 Platform monitoring and alarm method, device, equipment and computer readable storage medium
CN109375873A (en) * 2018-09-27 2019-02-22 郑州云海信息技术有限公司 The initial method of data processing finger daemon in a kind of distributed storage cluster
CN109150666A (en) * 2018-10-11 2019-01-04 深圳互联先锋科技有限公司 A method of preventing website delay machine
CN109408158A (en) * 2018-11-06 2019-03-01 恒生电子股份有限公司 Method and device, storage medium and the electronic equipment that subprocess is exited with parent process
CN109408158B (en) * 2018-11-06 2022-11-18 恒生电子股份有限公司 Method and device for quitting child process along with parent process, storage medium and electronic equipment
CN110032487A (en) * 2018-11-09 2019-07-19 阿里巴巴集团控股有限公司 Keep Alive supervision method, apparatus and electronic equipment
CN111800304A (en) * 2019-04-09 2020-10-20 安克创新科技股份有限公司 Process running monitoring method, storage medium and virtual device
CN111506480A (en) * 2020-04-23 2020-08-07 上海达梦数据库有限公司 State detection method, device and system for components in cluster
CN111506480B (en) * 2020-04-23 2024-03-08 上海达梦数据库有限公司 Method, device and system for detecting states of components in cluster
CN112035721A (en) * 2020-07-22 2020-12-04 大箴(杭州)科技有限公司 Crawler cluster monitoring method and device, storage medium and computer equipment
WO2022048357A1 (en) * 2020-09-03 2022-03-10 深圳壹账通智能科技有限公司 Transaction endorsement method and apparatus, and storage medium
CN111988188A (en) * 2020-09-03 2020-11-24 深圳壹账通智能科技有限公司 Transaction endorsement method, device and storage medium
CN112994977A (en) * 2021-02-24 2021-06-18 紫光云技术有限公司 Method for high availability of server host
CN113704026B (en) * 2021-10-28 2022-01-25 北京时代正邦科技股份有限公司 Distributed financial memory database security synchronization method, device and medium
CN113704026A (en) * 2021-10-28 2021-11-26 北京时代正邦科技股份有限公司 Distributed financial memory database security synchronization method, device and medium

Similar Documents

Publication Publication Date Title
CN106330523A (en) Cluster server disaster recovery system and method, and server node
KR100658913B1 (en) A scalable method of continuous monitoring the remotely accessible resources against the node failures for very large clusters
CN108847982B (en) Distributed storage cluster and node fault switching method and device thereof
CN109656742B (en) Node exception handling method and device and storage medium
CN107147540A (en) Fault handling method and troubleshooting cluster in highly available system
CN111953566B (en) Distributed fault monitoring-based method and virtual machine high-availability system
CN113067850B (en) Cluster arrangement system under multi-cloud scene
CN105681077A (en) Fault processing method, device and system
CN108347339B (en) Service recovery method and device
CN110618864A (en) Interrupt task recovery method and device
CN105589756A (en) Batch processing cluster system and method
JP2020115330A (en) System and method of monitoring software application process
JP5855724B1 (en) Virtual device management apparatus, virtual device management method, and virtual device management program
CN108632106A (en) System for monitoring service equipment
JP5285045B2 (en) Failure recovery method, server and program in virtual environment
CN113986450A (en) Virtual machine backup method and device
CN113377535A (en) Distributed timing task allocation method, device, equipment and readable storage medium
CN114598591A (en) Embedded platform node fault recovery system and method
JP2018169920A (en) Management device, management method and management program
CN113672452A (en) Method and system for monitoring operation of data acquisition task
CN112260902A (en) Network equipment monitoring method, device, equipment and storage medium
CN115499296B (en) Cloud desktop hot standby management method, device and system
CN1722627A (en) A method and device for realizing switching between main and backup units in communication equipment
JP2019197352A (en) Service continuing system and service continuing method
CN112328375B (en) Correlation method and device for tracking data segments of distributed system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20170111