CN106330523A - Cluster server disaster recovery system and method, and server node - Google Patents
Cluster server disaster recovery system and method, and server node Download PDFInfo
- Publication number
- CN106330523A CN106330523A CN201510385495.9A CN201510385495A CN106330523A CN 106330523 A CN106330523 A CN 106330523A CN 201510385495 A CN201510385495 A CN 201510385495A CN 106330523 A CN106330523 A CN 106330523A
- Authority
- CN
- China
- Prior art keywords
- work
- monitoring process
- progress
- monitoring
- node
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/06—Management of faults, events, alarms or notifications
- H04L41/0654—Management of faults, events, alarms or notifications using network fault recovery
- H04L41/0668—Management of faults, events, alarms or notifications using network fault recovery by dynamic selection of recovery network elements, e.g. replacement by the most appropriate element after failure
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L1/00—Arrangements for detecting or preventing errors in the information received
- H04L1/22—Arrangements for detecting or preventing errors in the information received using redundant apparatus to increase reliability
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L43/00—Arrangements for monitoring or testing data switching networks
- H04L43/10—Active monitoring, e.g. heartbeat, ping or trace-route
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/01—Protocols
- H04L67/02—Protocols based on web technology, e.g. hypertext transfer protocol [HTTP]
- H04L67/025—Protocols based on web technology, e.g. hypertext transfer protocol [HTTP] for remote control or remote monitoring of applications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/01—Protocols
- H04L67/10—Protocols in which an application is distributed across nodes in the network
Abstract
The invention discloses a cluster server disaster recovery system and method, and a server node, and is used for providing a disaster recovery mechanism based on a cluster server and ensuring the operation stability and service reliability of the cluster server. The cluster server disaster recovery system comprises multiple monitor nodes and multiple work nodes. The monitor nodes are used for operating a first monitor process. According to the first monitor process, whether various monitor processes in a stored monitor process list are abnormal or not is monitored according to preset first monitor frequency. When it is monitored that at least one monitor process is abnormal, abnormal monitor process recovery is carried out. According to the first monitor process, whether various work processes in a stored work process list are abnormal or not is monitored according to preset second frequency. When it is monitored that at least one work process is abnormal, abnormal work process recovery is carried out. The work nodes are used for operating the work processes. The work processes report own service statuses to various monitor processes in the monitor process list on the basis of the second monitor frequency according to the stored monitoring process list.
Description
Technical field
The present invention relates to cloud technical field of data processing, particularly relate to a kind of cluster server disaster tolerance system, side
Method and server node.
Background technology
Disaster tolerance mechanism is that (Business&Operation Support System, service operation supports system to BOSS
One of system) the emergent indispensable technology of operation system fault, the computing architecture that disaster tolerance mechanism presses BOSS system is different
And difference.Existing BOSS system, based on traditional Single-Server single multi-process (thread) calculating support
Structure, its Disaster Tolerant Scheme is divided into program level and host-level disaster tolerance mechanism by fault rank.For program level disaster tolerance,
Typically use crontab (scheduled task) by frequency monitoring internal memory active process table, for the exception monitored
Stopping process, uses the scheme restarting process to carry out disaster tolerance;For host-level disaster tolerance, abnormal at process host
Delay machine time, use HA switching mode by process switching to alternate program main frame, and start this standby meet an urgent need
Main frame and process, restart process in can standing the time, to continue to provide service.
Along with the development of cluster computing, BOSS operation system based on cluster necessarily replaces existing list
A kind of selection of high-performance main frame, BOSS operation system based on cluster, invest for protection existing hardware,
Be faced with and be deployed in minority high-performance main frame and the situation of a large amount of common machines simultaneously, be characterized in AIX and
LINUX isomery programming system coexists, the high-performance node of minority and substantial amounts of ordinary node be simultaneously to calculate
The form of cloud provides service.
Specific due to cluster cloud computing is to simultaneously participate in calculating across machine multinode, and existing PC cluster is put down
Platform does not provide reliable process fault tolerant mechanism, how to ensure the process of BOSS operation system based on cluster
Calculate stability and avoid host node actual effect on service impact, become urgently to be resolved hurrily in prior art
One of technical problem.
Summary of the invention
The embodiment of the present invention provides a kind of cluster server disaster tolerance system, method and server node, in order to carry
For a kind of disaster tolerance based on cluster server mechanism, it is ensured that the operation stability of cluster server and service are reliable
Property.
The embodiment of the present invention provides a kind of cluster server disaster tolerance system, including multiple monitor nodes and multiple work
Make node, wherein:
Described monitor node is for running the first monitoring process, and described first monitoring process is according to default first
Each monitoring process in the monitoring process list of monitoring frequency monitoring storage is the most abnormal;And monitor to
During a few monitoring process exception, perform abnormal monitoring process resumption;And described first monitoring process according to
In the progress of work list of the second frequency monitoring storage preset, each progress of work is the most abnormal;Monitor to
During a few progress of work exception, perform abnormal work process resumption;
Described working node is used for running the progress of work, and the described progress of work is according to the monitoring process list of storage
The service of self is reported according to the described second monitoring frequency each monitoring process in described monitoring process list
State.
Described monitor node, specifically for when monitoring at least one monitoring process exception, by described
One monitoring process other second monitoring process in monitoring process list sends condition monitoring message, root
Identify according to the monitoring process of testing result recording exceptional;Determine the prison that the abnormal monitoring process identification (PID) of record is corresponding
Control process exception.
Described monitor node, specifically for comprise according to monitoring process list by described first monitoring process
The order composition message transmission loop chain that monitoring process sequence number that each monitoring process is corresponding is ascending, and initiate to disappear
Breath transmission;If do not receive the message of transmission in preset duration, it is determined that monitor at least one prison
Control process exception.
Described monitor node, specifically for the quantity according to abnormal monitoring process, by described first monitor into
Journey selects in addition to the monitor node that monitor node list comprises, corresponding from cluster server node listing
The working node of quantity is respectively created a monitoring process for replacing abnormal monitoring process;And by described the
One monitoring process receives monitoring process sequence number and the joint of place node thereof that each newly created monitoring process returns
Point identification;Monitoring process sequence number and node identification according to receiving update monitoring process list and monitoring respectively
Node listing;And notify that in monitoring process list, each monitoring process updates the monitoring process list of its storage;With
And each progress of work updates the monitoring process list of its storage in notice progress of work list.
Described monitor node, specifically for saving to the work being selected respectively by described first monitoring process
Any operative process that point runs sends monitoring process request to create;
The described working node being selected, is used for by self-operating, receives the establishment of described monitoring process
The progress of work of request creates the subprocess of this progress of work, and to the subprocess created forward described monitor into
Journey request to create;Described subprocess, according to the monitoring process request to create received, identifies own process classification
For monitoring process;And return newly created monitoring process sequence by the described progress of work to described first monitoring process
Number and the node identification of place node.
Described monitor node, specifically for when monitoring at least one progress of work exception, by described
One monitoring process sends service state detection message according to the progress of work list of storage to each progress of work;Root
Identify according to the progress of work of testing result recording exceptional;Determine the work that the abnormal work process identification (PID) of record is corresponding
Make process exception.
Described monitor node, specifically for after determining abnormal work process, monitors by described first
Process detection abnormal work process place working node is the most abnormal;And according to the work of result of detection recording exceptional
Make node identification;According to the quantity of abnormal work process, select from cluster server node listing except abnormal
Node beyond working node, respective numbers is respectively created a progress of work and enters for replacing abnormal work
Journey;And by described first monitoring process receive progress of work sequence number that each newly created progress of work returns and
The working node mark of its place working node;Identify according to the progress of work sequence number received and working node
Update progress of work list and the working node list of storage respectively;And notify monitoring process list respectively monitors
Process updates the progress of work list of its storage;And each progress of work updates it in notice progress of work list
The progress of work list of storage.
Described monitor node, specifically for transporting to the node being selected respectively by described first monitoring process
Arbitrary process of row sends monitoring process request to create;
Described selected node is used for by self-operating, receives entering of described progress of work request to create
Journey creates the subprocess of this process, and forwards described progress of work request to create to the subprocess created;Described
Subprocess is according to the progress of work request to create received, and mark own process classification is the progress of work;And lead to
Cross the process receiving described progress of work request to create and return newly created work to described first monitoring process
Process sequence number and the node identification of place node.
The embodiment of the present invention provides a kind of cluster server disaster recovery method, including:
Monitor node by the first monitoring process of self-operating respectively according to the first default monitoring frequency and
Each monitoring process in the monitoring process list of the second monitoring frequency monitoring storage and the progress of work row of storage
In table, each progress of work is the most abnormal;
When monitoring at least one monitoring process exception, perform abnormal monitoring process resumption;And
When monitoring at least one progress of work exception, perform abnormal work process resumption.
Preferably, determine that at least one monitoring process is abnormal according to procedure below:
Described monitor node is when monitoring at least one monitoring process exception, by described first monitoring process
Other second monitoring process in monitoring process list sends condition monitoring message;
Monitoring process mark according to testing result recording exceptional;
Determine the monitoring process exception that the abnormal monitoring process identification (PID) of record is corresponding.
Preferably, determine that to monitor at least one monitoring process abnormal in accordance with the following methods:
Described monitor node by described first monitoring process according to respectively monitoring of comprising of monitoring process list into
The order composition message transmission loop chain that monitoring process sequence number corresponding to journey is ascending, and initiate message transmission;
If do not receive the message of transmission in preset duration, it is determined that monitor at least one monitor into
Cheng Yichang.
Preferably, according to procedure below execution abnormal monitoring process resumption:
Described monitor node, according to the quantity of abnormal monitoring process, is taken from cluster by described first monitoring process
Business device node listing selects in addition to the monitor node that monitor node list comprises, the work of respective numbers
Node is respectively created a monitoring process for replacing abnormal monitoring process;And
By described first monitoring process receive monitoring process sequence number that each newly created monitoring process returns and
The node identification of its place node;
Monitoring process list and monitoring joint is updated respectively according to the monitoring process sequence number received and node identification
Point list;
In notice monitoring process list, each monitoring process updates the monitoring process list of its storage;And
In notice progress of work list, each progress of work updates the monitoring process list of its storage.
Preferably, determine that at least one progress of work is abnormal according to procedure below:
Described monitor node is when monitoring at least one progress of work exception, by described first monitoring process
Progress of work list according to storage sends service state detection message to each progress of work;
Progress of work mark according to testing result recording exceptional;
Determine the progress of work exception that the abnormal work process identification (PID) of record is corresponding.
Preferably, according to procedure below execution abnormal work process resumption:
Described monitor node is after determining abnormal work process, different by described first monitoring process detection
Often progress of work place working node is the most abnormal;And
Working node mark according to result of detection recording exceptional;And
According to the quantity of abnormal work process, select except abnormal work node from cluster server node listing
In addition, the node of respective numbers be respectively created a progress of work for replacing abnormal work process;
Described monitor node receives, by described first monitoring process, the work that each newly created progress of work returns
Make the working node mark of process sequence number and place working node thereof;
The progress of work row updating storage respectively are identified according to the progress of work sequence number received and working node
Table and working node list;And
In notice monitoring process list, each monitoring process updates the progress of work list of its storage;And
In notice progress of work list, each progress of work updates the progress of work list of its storage.
The embodiment of the present invention provides a kind of server node, including:
Monitoring unit, for the first monitoring process by running on described server node respectively according to presetting
First monitoring frequency and second monitoring frequency monitoring storage monitoring process list in each monitoring process and
In the progress of work list of storage, each progress of work is the most abnormal;
Exception processing unit, for when monitoring at least one monitoring process exception, performs abnormal monitoring and enters
Cheng Huifu;And when monitoring at least one progress of work exception, perform abnormal work process resumption.
Described monitoring unit, specifically for when monitoring at least one monitoring process exception, by described
One monitoring process other second monitoring process in monitoring process list sends condition monitoring message;Root
Identify according to the monitoring process of testing result recording exceptional;Determine the prison that the abnormal monitoring process identification (PID) of record is corresponding
Control process exception.
Described monitoring unit, specifically for comprise according to monitoring process list by described first monitoring process
The order composition message transmission loop chain that monitoring process sequence number that each monitoring process is corresponding is ascending, and initiate to disappear
Breath transmission;If do not receive the message of transmission in preset duration, it is determined that monitor at least one prison
Control process exception.
Described exception processing unit, including:
First selects subelement, for the quantity according to abnormal monitoring process, by described first monitoring process
In addition to the monitor node that monitor node list comprises, respective counts is selected from cluster server node listing
The working node of amount is respectively created a monitoring process for replacing abnormal monitoring process;
First receives subelement, returns for receiving each newly created monitoring process by described first monitoring process
The monitoring process sequence number returned and the node identification of place node thereof;
First updates subelement, for updating prison respectively according to the monitoring process sequence number received and node identification
Control process list and monitor node list;
First notice subelement, for notifying that in monitoring process list, each monitoring process updates the monitoring of its storage
Process list;And each progress of work updates the monitoring process list of its storage in notice progress of work list.
Described monitoring unit, specifically for when monitoring at least one progress of work exception, by described
One monitoring process sends service state detection message according to the progress of work list of storage to each progress of work;Root
Identify according to the progress of work of testing result recording exceptional;Determine the work that the abnormal work process identification (PID) of record is corresponding
Make process exception.
Described exception processing unit, including:
Detection subelement, for after determining abnormal work process, is visited by described first monitoring process
Survey abnormal work process place working node the most abnormal;And according to the working node of result of detection recording exceptional
Mark;
Second selects subelement, for the quantity according to abnormal work process, from cluster server node listing
Middle selection in addition to abnormal work node, the node of respective numbers be respectively created a progress of work for replacing
The normal progress of work of transversion;
Second receives subelement, returns for receiving each newly created progress of work by described first monitoring process
The progress of work sequence number returned and the working node mark of place working node thereof;
Second updates subelement, for identifying the most more according to the progress of work sequence number received and working node
The progress of work list of new storage and working node list;
Second notice subelement, for notifying that in monitoring process list, each monitoring process updates the work of its storage
Process list;And each progress of work updates the progress of work list of its storage in notice progress of work list.
Cluster server disaster tolerance system, method and the server node that the embodiment of the present invention provides, monitor node
Arranged according to the monitoring process of the first monitoring frequency monitoring self storage by the first monitoring process of self-operating
Whether each monitoring process in table there is exception, and the work according to the second monitoring frequency monitoring self storage
Whether each process of group altogether made in process list there is exception, and different monitoring at least one monitoring process
Chang Shi, performs monitoring process and recovers, and when monitoring at least one progress of work exception, performs the progress of work
Recover, pass through said process, it is achieved that the abnormal monitoring of cluster server and recovery, it is ensured that cluster service
The operation stability of device and service reliability.
Other features and advantages of the present invention will illustrate in the following description, and, partly from explanation
Book becomes apparent, or understands by implementing the present invention.The purpose of the present invention and other advantages can
Realize by structure specifically noted in the description write, claims and accompanying drawing and obtain
?.
Accompanying drawing explanation
Accompanying drawing described herein is used for providing a further understanding of the present invention, constitutes of the present invention
Point, the schematic description and description of the present invention is used for explaining the present invention, is not intended that to the present invention not
Work as restriction.In the accompanying drawings:
Fig. 1 is in the embodiment of the present invention, the structural representation of cluster server disaster tolerance system;
Fig. 2 is in the embodiment of the present invention, the message transmission schematic diagram of monitoring process and the progress of work;
Fig. 3 is in the embodiment of the present invention, and cluster server initializes schematic flow sheet;
Fig. 4 is in the embodiment of the present invention, and the progress of work carries out service state detection and the progress of work that notes abnormalities
Time be operated the implementing procedure schematic diagram of process resumption;
Fig. 5 is in the embodiment of the present invention, the implementing procedure schematic diagram of cluster server disaster recovery method;
Fig. 6 is in the embodiment of the present invention, determines the implementing procedure schematic diagram that at least one monitoring process is abnormal;
Fig. 7 is in the embodiment of the present invention, determines that the implementing procedure monitoring at least one monitoring process abnormal shows
It is intended to;
Fig. 8 is in the embodiment of the present invention, performs the implementing procedure schematic diagram of abnormal monitoring process resumption;
Fig. 9 is in the embodiment of the present invention, determines the implementing procedure schematic diagram that at least one progress of work is abnormal;
Figure 10 is in the embodiment of the present invention, performs the implementing procedure schematic diagram of abnormal work process resumption;
Figure 11 is in the embodiment of the present invention, the structural representation of server node.
Detailed description of the invention
In order to provide a kind of disaster tolerance based on cluster server mechanism, it is ensured that the operation stability of cluster server
And service reliability, embodiments provide a kind of cluster server disaster tolerance system, method and server
Node.
Below in conjunction with Figure of description, the preferred embodiments of the present invention are illustrated, it will be appreciated that this place
The preferred embodiment described is merely to illustrate and explains the present invention, is not intended to limit the present invention, and not
In the case of conflict, the embodiment in the present invention and the feature in embodiment can be mutually combined.
As it is shown in figure 1, the structural representation of the cluster server disaster tolerance system provided for the embodiment of the present invention.
During it is also preferred that the left be embodied as, each server node that can be comprised by cluster server is according to joint behavior
(as being the computing capability of node) arranges from high to low, sets up server node list, at this
On the basis of, set up two node listings, be respectively as follows: monitor node list and working node list.Wherein,
M (M is the positive integer more than or equal to 1) individual server can be selected from server node list successively to save
All of server node, as monitor node, is added to working node list as working node by point.
When aggregated server system run after, for the server node of monitor node list create 1 monitor into
Journey, creates 1 or multiple progress of work for the server node in working node list.So,
Server node in monitor node list will only run a monitoring process, and in working node list
Server node will at least run a progress of work, also part server node its both at monitor node
In list again in working node list, it will run a monitoring process and at least one progress of work.
The operation principle of the cluster server disaster tolerance system that the embodiment of the present invention provides is as follows: cluster server system
After system runs, operation process is divided into monitoring process group and progress of work group, wherein, monitoring process group
Monitoring process monitors progress of work running status and the service state of progress of work group by assigned frequency, supervises simultaneously
Controlling the running status of other monitoring process, the process of calculation procedure group is responsible for business logic processing, and preservation is worked as
Front task list, is reported self running status and service by assigned frequency to the monitoring process of monitoring process group
State, as in figure 2 it is shown, monitoring process and the progress of work can be done and switch.
When aggregated server system initializes, enter by each server node capabilities list starter node and process
Row work, for simultaneously at monitor node list and the machine of working node list, at least generates 2 processes;
For only at the machine of working node list, at least generating a process, Mk system state is for initializing shape
State;Take minmal sequence number (rank) process monitoring process as current monitor node of each monitor node,
Labeling process classification is monitoring process class, is registered in monitoring process group, monitoring process press process sequence number from
Little to big composition chaining message transferring structure, test that all monitoring process are working properly, place node is just working
Often, then the first of regulation monitoring process group frequency is monitored.Collect all non-supervised processes, labeling process class
Not Wei the progress of work, be registered in progress of work group, test all working process works is normal, place node
Working properly, then (stagger with the first monitoring frequency by the report frequency of the second monitoring allocation working group
Time).The monitoring process group set up preserves prison by each process of frequency broadcast notice progress of work group
The monitoring process list of control process group, provides unified report frequency (the i.e. second monitoring frequency of the progress of work simultaneously
Rate), the progress of work presses this frequency configuration clock, and this value dormancy pressed by clock, at the appointed time by system wake-up
After, to each the monitoring process report services situation being saved in the local monitoring process list of process.Cluster
Server disaster tolerance system has initialized, and Mk system state is duty.
In order to be better understood from the embodiment of the present invention, below the initialization procedure of aggregated server system is carried out
Describe in detail.
First setting aggregated server system uses hydra process manager to start and managing process, arranges
With communications access in the case of losing efficacy in process whether error handler is MPI_ERRORS_RETURN,
Return process failure error code, concrete, as it is shown on figure 3, may comprise steps of:
The process sequence number of the process each process of acquisition that S31, process sequence number are minimum and each process place node
Node identification.
Concrete, start aggregated server system by configuration file, generate whole process, pass through process sequence
The process of number minimum (rank=0) is initiated MonitorProcessGroupCreate and is called, and starts component monitor
Process group, MonitorProcessGroupCreate sends and obtains process sequence number request to Servers-all joint
Point.After the process run on Servers-all node receives acquisition process sequence number request, return its process
Serial number, the node identification of place node give visiting process.
The minimum process of S32, process sequence number determines monitoring process.
Concrete, MonitorProcessGroupCreate is according to the process sequence number obtained, to monitor node
Process that run on each monitor node in list, that minimum process sequence number is corresponding sends this process of labelling and is
The request of monitoring process classification;Target process (the minimum process sequence correspondence run on the most each monitor node
Process) receive amendment process category request after, revising its process classification is monitoring process, and to rank=0
Process return process classification information.
The minimum process of S33, process sequence number generates monitoring process list and notifies each monitoring process storage prison
Control process list.
Concrete, the process of rank=0 generates monitoring process list and stores, and each to monitoring process group
Process sends the information preserving monitoring process list, after each process in monitoring process list receives information
Storage monitoring process list.
The process establishment progress of work that S34, process sequence number are minimum.
The process of rank=0 is initiated ServiceProcessGroupCreate and is called, and starts building work process group.
ServiceProcessGroupCreate obtains the not process sequence number in monitoring process list and place clothes thereof
The node identification of business device node.According to the process sequence number obtained, send this process of labelling to corresponding process
Request for progress of work classification;Target process (process that the process sequence number that i.e. obtains is corresponding) receives
After amendment process category request, revising its process classification is the progress of work, and returns to the process of rank=0
Process classification information.
The minimum process of S35, process sequence number generates progress of work list and notifies that each process storage work is entered
Cheng Liebiao and notify each progress of work store monitoring process list.
The process of rank=0 generates progress of work list and stores, send preserve progress of work list to all enter
Journey (includes monitoring process and the progress of work), and each target process receives progress of work list storage.
The process of rank=0 each progress of work in progress of work list sends the letter preserving monitoring process list
Breath, each target process (the most each progress of work) stores monitoring process list after receiving information.
The minimum process of S36, process sequence number notifies monitoring process and the monitoring of its correspondence of the progress of work respectively
Frequency.
The process of rank=0 monitoring process in monitoring process list sends the first monitoring frequency, monitoring process
Monitoring process in list is monitored process service detection by this frequency;The process of rank=0 is to the progress of work
Each progress of work in list sends report frequency (the i.e. second monitoring frequency), in progress of work list
The progress of work reports frequency by this frequency configuration;Initialization procedure terminates, and initialization system state is duty,
Monitoring process in monitoring process list starts monitoring process and is monitored, and the work in progress of work list is entered
Cheng Qidong calculates process and provides business service.
Design in the system of aggregated server system, set up a clock and the letter of correspondence thereof for the progress of work
Number process function, the progress of work by second monitoring frequency (report frequency) value dormancy, when clock is waken up touch
Send out corresponding signal process function;For the progress of work, signal process function drives the progress of work to collect self
Service state, each monitoring process being sent in monitoring process group, monitoring information sends situation simultaneously, if
Sending unsuccessfully to the information of all monitoring process, write monitoring abnormal log, return works on, and treats that clock is stopped
Sleep and continue to attempt to send service information to subsequent period.
Monitoring process group is according to the fortune of each monitor node in the monitoring frequency check monitor node list set
The duty of row situation, monitoring process self and the service state of the progress of work.Monitoring process group each
Monitoring process is monitored process status detection by the first monitoring frequency.Monitoring process is being carried out status monitoring
Time, first putting system mode is that monitoring process group detects state, by monitoring process sequence to other all monitoring
It is normal to confirm oneself place server node service that process sends broadcast message, receives monitoring process group simultaneously
The status information that the monitoring process that other monitor node is run by himself is sent, if all information sends
Receiving normal, monitoring process has detected, without exception;Otherwise, if some monitoring process sends or receives letter
Breath is abnormal, and this monitoring process preserves abnormal monitoring process sequence number (rank), the joint of place node monitored
Point identification and abnormal cause, this monitoring process state in the monitoring process list of this monitoring process storage of putting is different
Often state, complete other monitoring process all to be detected, the monitoring process list retained according to this process, Xiang Qi
In normal monitoring process (being not labeled as the monitoring process of abnormal monitoring process) initiate monitoring process group different
Often message, now state normal monitoring process exchange monitoring process group information, confirm abnormal monitoring process.
For abnormal monitoring process, shield in monitoring process list, and except monitoring joint from server node list
Server node (i.e. working node) beyond the monitor node that point list comprises select a working node add
Entering monitor node list and generate new process, arranging new process classification is monitoring process, joins monitoring process row
In table, update the local monitor node listing of each monitoring process of monitoring process group, local monitoring process list,
In progress of work group, each process sends current monitor process list, writes log alarming simultaneously.Finally, put
System mode is idle condition.
The service state of each monitoring process regular check progress of work of monitoring process group.The progress of work time
Clock is waken up up by report frequency, to the progress of work sequence number (rank) of monitoring process time-triggered report self, service shape
State and affiliated node and node load, if service is normal, send normal service information, if abnormal, send different
Often information on services, if the progress of work is hung dead, it is impossible to send information;Monitoring process receives the progress of work and is reported
Information, first put system mode be working group detect state, to each progress of work received report, brush
Work at present progress information in new local progress of work list, for being in local progress of work list but overtime
Not receiving the progress of work of report information, monitoring process actively initiates process status detection, if invalid, puts
Its progress of work service state is invalid;After all working process report information receives or detected, press
Node load average in progress of work report updates the load shape of corresponding working node in working node list
State.And exchange, with other all monitoring process in monitoring process group, the information that each progress of work reports, by same
Node load average in the report of seat process updates the load condition of corresponding working node in working node list,
Finally update server node capabilities list according to the load condition of each working node;Its in monitoring process group
Its all monitoring process exchanges the information that each progress of work reports, and is abnormal for progress of work service state
The progress of work, monitoring process actively initiates the detection of its working node Host Status;If working node main frame is normal,
Working node belonging to notice removes this abnormal work process, and it is arranged from progress of work group by each progress of work
Remove, from server node, select a server node, use cluster interface function dynamically generate one new
Process, this new process classification of labelling is progress of work class, joins in progress of work list and sends all joints
Point;If corresponding working node host fails, this working node is shielded in server node list, simultaneously
In working node list shield, update working node list and send all processes (include monitoring process and
The progress of work), to update the progress of work list of each process storage in monitoring process group and progress of work group;
Progress of work service detection completes, and putting system mode is idle condition.
Based on above-mentioned operation principle, embodiments provide a kind of cluster server disaster tolerance system, as
Shown in Fig. 1, including multiple monitor nodes 11 and multiple working node 12, wherein:
Monitor node 11 is for running the first monitoring process, and the first monitoring process is according to the first default monitoring
Each monitoring process in the monitoring process list of frequency monitoring storage is the most abnormal;And determining at least one
During monitoring process exception, perform abnormal monitoring process resumption;And first monitoring process according to default second
In the progress of work list of frequency monitoring storage, each progress of work is the most abnormal;Determine at least one work into
During Cheng Yichang, perform abnormal work process resumption;
Working node 12 is used for running the progress of work, the progress of work according to storage monitoring process list according to
The described second monitoring frequency each monitoring process in described monitoring process list reports the service state of self.
During it should be noted that be embodied as, due to the possible operation monitoring process simultaneously of part server node
And the progress of work, for the ease of distinguishing, in the embodiment of the present invention, when describing it and performing monitoring process function
Referred to as monitor node, is referred to as working node when describing it and performing progress of work function.
It is monitored process and services disaster-tolerant recovery process to this below in conjunction with aggregated server system
The detailed description of the invention of bright embodiment illustrates.
One, monitor node carries out the implementation process of condition monitoring to monitoring process
When system mode is changed to duty, initialization procedure terminates, the monitoring in monitoring process list
Process initiation enters monitoring process, and the progress of work in progress of work list launches into calculating process, mutually
Accompany and attend to, according to the running status of the whole aggregated server system of frequency monitoring of respective place process group.
Monitoring process initiates monitoring process state-detection by the first monitoring frequency, it is ensured that the progress of work reports service
Before state, the monitoring process in monitoring process list can normally work.
Concrete, monitor node is for each monitoring comprised according to monitoring process list by the first monitoring process
The order composition message transmission loop chain that monitoring process sequence number corresponding to process is ascending, and initiate message and pass
Pass;If do not receive the message of transmission in preset duration, it is determined that monitor at least one monitor into
Cheng Yichang.When monitoring at least one monitoring process exception, the first monitoring process is in monitoring process list
Other second monitoring process send condition monitoring message, enter according to the monitoring of testing result recording exceptional
Journey identifies;Determine the monitoring process exception that the abnormal monitoring process identification (PID) of record is corresponding.
Enter it is also preferred that the left the first monitoring process can be the monitoring that in monitoring process list, process sequence number is minimum
Journey, the most above-mentioned monitor node refers to the server node at the minimum monitoring process place of this process sequence number.
When being embodied as, the monitoring process that in monitoring process group, minimum process sequence number is corresponding is initiated
MonitorProcessGroupStatusCheck process, MonitorProcessGroupStatusCheck is set up and is disappeared
Breath chain (pressing process sequence number arrange from small to large), and initiate message transmission, if message can through monitoring into
Cheng Huanlian normally returns to the monitoring process that minimum process sequence number is corresponding, then may determine that monitoring process list
In all monitoring process normal, wait that the progress of work in progress of work list reports its service state.
If message loop chain type transmission time-out is not restored to minimum monitoring process serial number, in monitoring process list
The monitoring process that minimum process number is corresponding initiates MonitorProcessGroupRebuild process,
The MonitorProcessGroupRebuild process first each monitoring process in monitoring process list order
Send condition monitoring information, if abnormal state, by < process sequence number of abnormal monitoring process, its institute
Node identification at monitor node > add in abnormal state monitoring process list by this form, repeat this process
Until having detected other all monitoring process states in monitoring process group.
After determining abnormal monitoring process, according to the quantity of abnormal monitoring process, monitor node is by the
One monitoring process select from cluster server node listing the monitor node that comprises except monitor node list with
Outer, the working node of respective numbers is respectively created a monitoring process for replacing abnormal monitoring process;And
Monitoring process sequence number and the institute thereof that each newly created monitoring process returns is received by described first monitoring process
Node identification at node;Monitoring process sequence number and node identification according to receiving update monitoring process respectively
List and monitor node list;And notify that each monitoring process in monitoring process list updates monitoring of its storage and enters
Cheng Liebiao;And each progress of work updates the monitoring process list of its storage in notice progress of work list.
When being embodied as, monitor node is run to the working node being selected respectively by the first monitoring process
Any operative process send monitoring process request to create.It is also preferred that the left monitor node can be by the first monitoring
The progress of work that process is run on the working node being selected, that process sequence number is minimum send monitor into
Journey request to create.The working node being selected, is used for by self-operating, receives monitoring process establishment
The progress of work of request creates the subprocess of this progress of work, and to the subprocess created forward described monitor into
Journey request to create;This subprocess according to the monitoring process request to create received, mark own process classification is
Monitoring process;And return newly created monitoring process sequence number and place by this progress of work to the first monitoring process
The node identification of node.
Concrete, MonitorProcessGroupRebuild process calculates in abnormal state monitoring process list
Abnormal monitoring process number n, obtains performance ranking in server list forward and not in monitor node list
N working node (when being embodied as, it is also possible to arbitrarily select from the working node in addition to monitor node
Selecting n, this is not limited by the embodiment of the present invention), send the progress of work of this n node of information query,
Obtain progress of work serial number (rank).MonitorProcessGroupRebuild is to each work inquired
The progress of work that on node, process sequence number is minimum sends dynamic monitoring process request to create.When being embodied as,
Arbitrary progress of work dynamic monitoring process request to create, the present invention can be sent on the working node selected
This is not limited by embodiment.Target operation process (process sequence minimum on the working node i.e. selected
The progress of work) receive monitoring process request to create after, use MPI_Comm_spawn to create dynamically
Subprocess, target operation process creates communication domain between father and son's process group and is connected with newly created subprocess, target
The progress of work forwards monitoring process request to create, newly created subprocess mark own process class to this subprocess
Not Wei monitoring process, the state of setting is init state;Target operation process to
MonitorProcessGroupRebuild returns < process sequence number of new dynamic creation, the node of place node
Mark, target operation process sequence number, the node identification of target operation process place node > give
(wherein, target operation process sequence number is as newly created monitoring to MonitorProcessGroupRebuild
Process is converted in communicating with the monitoring process in former monitoring process list),
Current abnormal monitoring process is deleted from monitoring process list by MonitorProcessGroupRebuild, simultaneously
Add newly created monitoring process in monitoring process list, as the replacement of former abnormal monitoring process monitor into
Journey.MonitorProcessGroupRebuild repeats said process, until the replacement process of abnormal monitoring process
All create complete.
MonitorProcessGroupRebuild sends to all processes (including monitoring process and the progress of work)
Update it and store new monitoring process list message, after other process receives this message, update self storage
Monitoring process list;MonitorProcessGroupRebuild is (the most newly-increased to the parent process of newly-increased monitoring process
Run on the node of monitoring process place, the minimum progress of work of process sequence number) initiate this newly-increased monitoring of change
Process works status request, target operation process (i.e. increases run on the node of monitoring process place, process newly
The serial number minimum progress of work) receive request after, first to newly-increased monitoring process send preserve the progress of work
List, monitoring process list and the request of each server node capabilities list, and forward to newly-increased monitoring process
The request of request change duty, newly-increased monitoring process preserve progress of work list, monitoring process list and
Server node capabilities list, and the system mode arranging self is duty, proceeds by monitoring.When
After all newly-increased monitoring process all enter duty, monitoring process recovering state process completes, can be normal
Receive the report of each progress of work in progress of work list.
Monitoring process starts ServiceProcessReport process by the first monitoring frequency preset,
The service state of ServiceProcessReport process collection self, system resource state, to the monitoring of self
Each monitoring process in process list sends report message, reports own services information.
If after monitoring process directly or indirectly receives the report message of the progress of work in progress of work list,
Message to each progress of work, updates the corresponding service state of the progress of work, process place working node
Performance state, after having updated the report information of all working process, send collect monitoring informational message to work as
The monitoring process that in front monitoring process list, process sequence number is the most small size, the monitoring that minimum process sequence number is corresponding
Process weights the performance of each working node that each monitoring process obtains, and rearranges server node accordingly
Can list.
The monitoring process that in monitoring process list, minimum process sequence number is corresponding sends messages to Servers-all
Node, updates server node capabilities list, and initialization system state is duty, completes entirely to monitor process.
If above-mentioned steps sending abnormal, when not receiving the report of some progress of work in progress of work list, say
These progresses of work bright may be abnormal, need to start service disaster-tolerant recovery process.
Two, monitor node carries out the implementation process of service state detection to the progress of work
When being embodied as, monitoring process finds that during monitoring some progress of work is not according to default the
When two monitoring frequencies report service state message, then illustrate that these progresses of work or its place working node are sent out
Raw abnormal.Concrete, monitor node, may be used for monitoring at least one progress of work (the most not
Receive the service state message that this at least one progress of work reports) time, the first monitoring process can be passed through
Progress of work list according to storage sends service state detection message according to testing result to each progress of work
The progress of work mark of recording exceptional;Determine that the progress of work corresponding to the abnormal work process identification (PID) of record is different
Often.After determining abnormal work process, monitor node can be by the first abnormal work of monitoring process detection
Make process place working node the most abnormal;And identify according to the working node of result of detection recording exceptional;Root
According to the quantity of abnormal work process, select in addition to abnormal work node from cluster server node listing
, the node of respective numbers be respectively created a progress of work for replacing abnormal work process;And pass through institute
State the first monitoring process and receive progress of work sequence number and the place work thereof that each newly created progress of work returns
The working node mark of node;Identify to update respectively according to the progress of work sequence number received and working node and deposit
The progress of work list of storage and working node list;And notify that in monitoring process list, each monitoring process updates it
The progress of work list of storage;And each progress of work updates the work of its storage in notice progress of work list
Process list.
Wherein, monitor node, may be used for being run to the node being selected respectively by the first monitoring process
Arbitrary process send monitoring process request to create;Selected node is used for by self-operating, receives
The subprocess of this process of process creation of described progress of work request to create, and forward institute to the subprocess created
State progress of work request to create;This subprocess, according to the progress of work request to create received, identifies self and enters
Journey classification is the progress of work;And by receiving the process of described progress of work request to create to described first prison
Control process returns newly created progress of work sequence number and the node identification of place node.
Concrete, as shown in Figure 4, the progress of work is carried out service state detection and the progress of work that notes abnormalities
Time be operated process resumption and may comprise steps of:
The monitoring process that in S41, monitoring process list, minimum process sequence number is corresponding determines the work of abnormality
Make process.
The monitoring process that in monitoring process list, minimum process sequence number is corresponding is initiated
ServiceProcessGroupRebuild process, ServiceProcessGroupRebuild is first to monitoring process
In list, all monitoring process send message, and putting all monitoring process is disaster-tolerant recovery state;Then to work
Each procedural sequences of process sends service state detection message, however, it is determined that abnormal state, by < abnormal work
The process sequence number of process, the node identification of its place working node > the form addition abnormal state progress of work
List, if abnormal work process place working node is abnormal, adds in abnormal work node listing, repeats this
Process is until having detected all working process status in work process list.
The monitoring process that in S42, monitoring process list, minimum process sequence number is corresponding select server node and
Certain process on this server node starts process as the agency creating new process.
ServiceProcessGroupRebuild process calculates abnormal work in abnormal state progress of work list
Process number n, obtains ranking in server node capabilities list forward and not in abnormal work node listing
N server node, send the progress of work of this n node of information query or monitoring process, and select
Process (may be likely to as monitoring process for the progress of work) conduct that minimum process sequence number (rank) is corresponding
The agency of current abnormal work process dynamically starts process.During it should be noted that be embodied as, it is also possible to
Randomly choosing arbitrary process, this is not limited by the embodiment of the present invention.
The monitoring process that in S43, monitoring process list, minimum process sequence number is corresponding is by acting on behalf of startup process
Generate the replacement process of abnormal work process.
ServiceProcessGroupRebuild process opens to the agency of each abnormal work process inquired
Dynamic process be (that run on n the server node i.e. selected, minimum process sequence number (rank) correspondence
Process) send progress of work request to create, this agency starts after process receives progress of work request to create,
Use MPI_Comm_spawn to create dynamic subprocess, act on behalf of startup process and create with newly created subprocess
Building communication domain between father and son's process group to connect, the monitoring process acting on behalf of startup process forwarding process sequence number minimum is sent out
The progress of work request to create sent, newly created subprocess identification process classification is the progress of work, arranges son and enters
Journey state is init state;Act on behalf of startup process minimum monitoring process serial number pair in monitoring process list
The monitoring process answered return < process sequence number of the process of new dynamic creation, the node identification of place node,
Act on behalf of the process sequence number of startup process, act on behalf of the node identification of startup process place node, wherein, generation
The process sequence number of reason startup process is as the newly created progress of work and the work in former progress of work list
Conversion in process communication), ServiceProcessGroupRebuild is by currently processed abnormal work process
Delete from progress of work list, add the newly created progress of work in progress of work list simultaneously, as
The replacement process of former abnormal work process.ServiceProcessGroupRebuild repeats this process, until losing
The replacement process of the progress of work of effect is all set up.
The monitoring process that in S44, monitoring process list, minimum process sequence number is corresponding notifies that all processes store
New progress of work list and notify that newly created process changes its duty.
ServiceProcessGroupRebuild sends to all processes (including the progress of work and monitoring process)
Preserve new progress of work list message, after other process receives this message, update the progress of work row of storage
Table;ServiceProcessGroupRebuild sends change newly by acting on behalf of startup process to new work process
The duty request of the progress of work, acts on behalf of after startup process receives this request, first to newly can work into
Journey transmission storage progress of work list, monitoring process list and the request of server joint behavior list, and to
The request of new work process forwarding its duty of change, the storage progress of work list of new work process,
Monitoring process list and server joint behavior list, and the system mode arranging self is duty, opens
Begin to provide service.After all new progresses of work all enter duty, abnormal work process resumption process is tied
Bundle, can normally provide service.
The monitoring process that in S45, monitoring process list, minimum process sequence number is corresponding updates system mode.
Disaster-tolerant recovery process terminates, and the monitoring process that in monitoring process list, middle minimum process sequence number is corresponding sets
Determining system mode is duty.
The cluster server disaster recovery method that the embodiment of the present invention provides, for service based on PC cluster framework
The online disaster-tolerant recovery of device system, by setting up monitoring process list and progress of work list, it is provided that monitoring process
With the abnormality eliminating method of the progress of work, and each provide abnormal monitoring process and abnormal work process is online
Restoration methods, it is possible to the online service ability recovering aggregated server system, it is achieved that system disaster tolerance and service
Integrated design and operation, improve operation stability and the service reliability of cluster server.
Based on same inventive concept, the embodiment of the present invention additionally provides a kind of cluster server disaster recovery method and
Server node, owing to said method and equipment solve principle and the cluster server disaster tolerance system phase of problem
Seemingly, therefore the enforcement of said method and equipment may refer to the enforcement of system, repeats no more in place of repetition.
As it is shown in figure 5, the implementing procedure signal of the cluster server disaster recovery method provided for the embodiment of the present invention
Figure, may comprise steps of:
S51, monitor node are by the first monitoring process of self-operating respectively according to the first default monitoring frequently
Each monitoring process and the work of storage in the monitoring process list of rate and the second monitoring frequency monitoring storage are entered
In Cheng Liebiao, each progress of work is the most abnormal.
S52, when determining at least one monitoring process exception, perform abnormal monitoring process resumption;Determining
During at least one progress of work exception, perform abnormal work process resumption.
When being embodied as, in step S52, can determine that at least one monitors according to the process shown in Fig. 6
Process exception:
S61, described monitor node, when monitoring at least one monitoring process exception, are supervised by described first
Control process other second monitoring process in monitoring process list sends condition monitoring message;
S62, identify according to the monitoring process of testing result recording exceptional;
S63, determine that monitoring process corresponding to the abnormal monitoring process identification (PID) of record is abnormal.
Wherein, in S61, can determine according to the flow process shown in Fig. 7 that to monitor at least one monitoring process different
Normal:
Each prison that S71, described monitor node are comprised according to monitoring process list by described first monitoring process
The order composition message transmission loop chain that monitoring process sequence number corresponding to control process is ascending, and initiate message and pass
Pass;
If S72 does not receives the message of transmission in preset duration, it is determined that monitor at least one
Monitoring process is abnormal.
When being embodied as, when determining at least one monitoring process exception, can be according to the flow process shown in Fig. 8
Execution abnormal monitoring process resumption:
S81, described monitor node according to the quantity of abnormal monitoring process, by described first monitoring process from
Cluster server node listing selects in addition to the monitor node that monitor node list comprises, respective numbers
Working node be respectively created a monitoring process for replacing abnormal monitoring process;
S82, the monitoring process sequence returned by the described first each newly created monitoring process of monitoring process reception
Number and the node identification of place node;
Monitoring process sequence number and node identification that S83, basis receive update monitoring process list and prison respectively
Control node listing;
In S84, notice monitoring process list, each monitoring process updates the monitoring process list of its storage;And
In notice progress of work list, each progress of work updates the monitoring process list of its storage.
When being embodied as, in step S52, can according to the flow process shown in Fig. 9 determine at least one work into
Cheng Yichang:
S91, described monitor node, when monitoring at least one progress of work exception, are supervised by described first
Control process sends service state detection message according to the progress of work list of storage to each progress of work;
S92, identify according to the progress of work of testing result recording exceptional;
S93, determine that the progress of work corresponding to the abnormal work process identification (PID) of record is abnormal.
When being embodied as, when determining at least one progress of work exception, can be according to the stream shown in Figure 10
Cheng Zhihang abnormal work process resumption:
S101, described monitor node are after determining abnormal work process, by described first monitoring process
Detection abnormal work process place working node is the most abnormal;
S102, identify according to the working node of result of detection recording exceptional;
S103, quantity according to abnormal work process, select except abnormal work from cluster server node listing
Make node beyond node, respective numbers to be respectively created a progress of work and enter for replacing abnormal work
Journey;
S104, described monitor node receive each newly created progress of work by described first monitoring process and return
Progress of work sequence number and place working node working node mark;
S105, the work updating storage according to the progress of work sequence number that receives and working node mark respectively are entered
Cheng Liebiao and working node list;
In S106, notice monitoring process list, each monitoring process updates the progress of work list of its storage;And
In notice progress of work list, each progress of work updates the progress of work list of its storage.
The structural representation of the server node provided for the embodiment of the present invention as shown in figure 11, may include that
Monitoring unit 111, for by run on the first monitoring process of described server node respectively according to
Preset first monitoring frequency and second monitoring frequency monitoring storage monitoring process list in respectively monitor into
In the progress of work list of journey and storage, each progress of work is the most abnormal;
Exception processing unit 112, for when determining at least one monitoring process exception, performs abnormal monitoring
Process resumption;And when determining at least one progress of work exception, perform abnormal work process resumption.
Wherein, monitoring unit 111, may be used for, when monitoring at least one monitoring process exception, passing through
Described first monitoring process other second monitoring process in monitoring process list sends condition monitoring
Message;Monitoring process mark according to testing result recording exceptional;Determine the abnormal monitoring process identification (PID) of record
Corresponding monitoring process is abnormal.
It is also preferred that the left monitoring unit 111 may be used for by described first monitoring process according to monitoring process list
The order composition message transmission loop chain that monitoring process sequence number that each monitoring process of comprising is corresponding is ascending, and
Initiation message is transmitted;If do not receive the message of transmission in preset duration, it is determined that monitor at least
One monitoring process is abnormal.
When being embodied as, exception processing unit 112 may include that
First selects subelement, for the quantity according to abnormal monitoring process, by described first monitoring process
In addition to the monitor node that monitor node list comprises, respective counts is selected from cluster server node listing
The working node of amount is respectively created a monitoring process for replacing abnormal monitoring process;
First receives subelement, returns for receiving each newly created monitoring process by described first monitoring process
The monitoring process sequence number returned and the node identification of place node thereof;
First updates subelement, for updating prison respectively according to the monitoring process sequence number received and node identification
Control process list and monitor node list;
First notice subelement, for notifying that in monitoring process list, each monitoring process updates the monitoring of its storage
Process list;And each progress of work updates the monitoring process list of its storage in notice progress of work list.
When being embodied as, monitoring unit 111, may be used for when monitoring at least one progress of work exception,
Service state is sent according to the progress of work list of storage to each progress of work by described first monitoring process
Detection message;Progress of work mark according to testing result recording exceptional;Determine the abnormal work process of record
The progress of work of mark correspondence is abnormal.
When being embodied as, exception processing unit 112, may include that
Detection subelement, for after determining abnormal work process, is visited by described first monitoring process
Survey abnormal work process place working node the most abnormal;And according to the working node of result of detection recording exceptional
Mark;
Second selects subelement, for the quantity according to abnormal work process, from cluster server node listing
Middle selection in addition to abnormal work node, the node of respective numbers be respectively created a progress of work for replacing
The normal progress of work of transversion;
Second receives subelement, returns for receiving each newly created progress of work by described first monitoring process
The progress of work sequence number returned and the working node mark of place working node thereof;
Second updates subelement, for identifying the most more according to the progress of work sequence number received and working node
The progress of work list of new storage and working node list;
Second notice subelement, for notifying that in monitoring process list, each monitoring process updates the work of its storage
Process list;And each progress of work updates the progress of work list of its storage in notice progress of work list.
For convenience of description, above each several part is divided by function and is respectively described for each module (or unit).
Certainly, when implementing the present invention can the function of each module (or unit) at same or multiple softwares or
Hardware realizes.
Cluster server disaster tolerance system, method and the server node that the embodiment of the present invention provides, monitor node
Arranged according to the monitoring process of the first monitoring frequency monitoring self storage by the first monitoring process of self-operating
Whether each monitoring process in table there is exception, and the work according to the second monitoring frequency monitoring self storage
Whether each process of group altogether made in process list there is exception, and different monitoring at least one monitoring process
Chang Shi, performs monitoring process and recovers, and when monitoring at least one progress of work exception, performs the progress of work
Recover, pass through said process, it is achieved that the abnormal monitoring of cluster server and recovery, it is ensured that cluster service
The operation stability of device and service reliability.
Those skilled in the art are it should be appreciated that embodiments of the invention can be provided as method, system or meter
Calculation machine program product.Therefore, the present invention can use complete hardware embodiment, complete software implementation or knot
The form of the embodiment in terms of conjunction software and hardware.And, the present invention can use and wherein wrap one or more
Computer-usable storage medium containing computer usable program code (include but not limited to disk memory,
CD-ROM, optical memory etc.) form of the upper computer program implemented.
The present invention is with reference to method, equipment (system) and computer program product according to embodiments of the present invention
The flow chart of product and/or block diagram describe.It should be understood that can by computer program instructions flowchart and
/ or block diagram in each flow process and/or flow process in square frame and flow chart and/or block diagram and/
Or the combination of square frame.These computer program instructions can be provided to general purpose computer, special-purpose computer, embedding
The processor of formula datatron or other programmable data processing device is to produce a machine so that by calculating
The instruction that the processor of machine or other programmable data processing device performs produces for realizing at flow chart one
The device of the function specified in individual flow process or multiple flow process and/or one square frame of block diagram or multiple square frame.
These computer program instructions may be alternatively stored in and computer or the process of other programmable datas can be guided to set
In the standby computer-readable memory worked in a specific way so that be stored in this computer-readable memory
Instruction produce and include the manufacture of command device, this command device realizes in one flow process or multiple of flow chart
The function specified in flow process and/or one square frame of block diagram or multiple square frame.
These computer program instructions also can be loaded in computer or other programmable data processing device, makes
Sequence of operations step must be performed to produce computer implemented place on computer or other programmable devices
Reason, thus the instruction performed on computer or other programmable devices provides for realizing flow chart one
The step of the function specified in flow process or multiple flow process and/or one square frame of block diagram or multiple square frame.
Although preferred embodiments of the present invention have been described, but those skilled in the art once know base
This creativeness concept, then can make other change and amendment to these embodiments.So, appended right is wanted
Ask and be intended to be construed to include preferred embodiment and fall into all changes and the amendment of the scope of the invention.
Obviously, those skilled in the art can carry out various change and modification without deviating from this to the present invention
Bright spirit and scope.So, if the present invention these amendment and modification belong to the claims in the present invention and
Within the scope of its equivalent technologies, then the present invention is also intended to comprise these change and modification.
Claims (20)
1. a cluster server disaster tolerance system, it is characterised in that include multiple monitor node and multiple work
Make node, wherein:
Described monitor node is for running the first monitoring process, and described first monitoring process is according to default first
Each monitoring process in the monitoring process list of monitoring frequency monitoring storage is the most abnormal;Determining at least one
During monitoring process exception, perform abnormal monitoring process resumption;And described first monitoring process is according to default
In the progress of work list of second frequency monitoring storage, each progress of work is the most abnormal;Determining at least one work
When making process exception, perform abnormal work process resumption;
Described working node is used for running the progress of work, and the described progress of work is according to the monitoring process list of storage
The service of self is reported according to the described second monitoring frequency each monitoring process in described monitoring process list
State.
2. the system as claimed in claim 1, it is characterised in that
Described monitor node, specifically for when monitoring at least one monitoring process exception, by described
One monitoring process other second monitoring process in monitoring process list sends condition monitoring message, root
Identify according to the monitoring process of testing result recording exceptional;Determine the prison that the abnormal monitoring process identification (PID) of record is corresponding
Control process exception.
3. system as claimed in claim 1 or 2, it is characterised in that
Described monitor node, specifically for comprise according to monitoring process list by described first monitoring process
The order composition message transmission loop chain that monitoring process sequence number that each monitoring process is corresponding is ascending, and initiate to disappear
Breath transmission;If do not receive the message of transmission in preset duration, it is determined that monitor at least one prison
Control process exception.
4. system as claimed in claim 2, it is characterised in that
Described monitor node, specifically for the quantity according to abnormal monitoring process, by described first monitor into
Journey selects in addition to the monitor node that monitor node list comprises, corresponding from cluster server node listing
The working node of quantity is respectively created a monitoring process for replacing abnormal monitoring process;And by described the
One monitoring process receives monitoring process sequence number and the joint of place node thereof that each newly created monitoring process returns
Point identification;Monitoring process sequence number and node identification according to receiving update monitoring process list and monitoring respectively
Node listing;And notify that in monitoring process list, each monitoring process updates the monitoring process list of its storage;With
And each progress of work updates the monitoring process list of its storage in notice progress of work list.
5. system as claimed in claim 4, it is characterised in that
Described monitor node, specifically for saving to the work being selected respectively by described first monitoring process
Any operative process that point runs sends monitoring process request to create;
The described working node being selected, is used for by self-operating, receives the establishment of described monitoring process
The progress of work of request creates the subprocess of this progress of work, and to the subprocess created forward described monitor into
Journey request to create;Described subprocess, according to the monitoring process request to create received, identifies own process classification
For monitoring process;And return newly created monitoring process sequence by the described progress of work to described first monitoring process
Number and the node identification of place node.
6. the system as claimed in claim 1, it is characterised in that
Described monitor node, specifically for when monitoring at least one progress of work exception, by described
One monitoring process sends service state detection message according to the progress of work list of storage to each progress of work;Root
Identify according to the progress of work of testing result recording exceptional;Determine the work that the abnormal work process identification (PID) of record is corresponding
Make process exception.
7. system as claimed in claim 6, it is characterised in that
Described monitor node, specifically for after determining abnormal work process, monitors by described first
Process detection abnormal work process place working node is the most abnormal;And according to the work of result of detection recording exceptional
Make node identification;According to the quantity of abnormal work process, select from cluster server node listing except abnormal
Node beyond working node, respective numbers is respectively created a progress of work and enters for replacing abnormal work
Journey;And by described first monitoring process receive progress of work sequence number that each newly created progress of work returns and
The working node mark of its place working node;Identify according to the progress of work sequence number received and working node
Update progress of work list and the working node list of storage respectively;And notify monitoring process list respectively monitors
Process updates the progress of work list of its storage;And each progress of work updates it in notice progress of work list
The progress of work list of storage.
8. system as claimed in claim 7, it is characterised in that
Described monitor node, specifically for transporting to the node being selected respectively by described first monitoring process
Arbitrary process of row sends monitoring process request to create;
Described selected node is used for by self-operating, receives entering of described progress of work request to create
Journey creates the subprocess of this process, and forwards described progress of work request to create to the subprocess created;Described
Subprocess is according to the progress of work request to create received, and mark own process classification is the progress of work;And lead to
Cross the process receiving described progress of work request to create and return newly created work to described first monitoring process
Process sequence number and the node identification of place node.
9. a cluster server disaster recovery method, it is characterised in that including:
Monitor node by the first monitoring process of self-operating respectively according to the first default monitoring frequency and
Each monitoring process in the monitoring process list of the second monitoring frequency monitoring storage and the progress of work row of storage
In table, each progress of work is the most abnormal;
When determining at least one monitoring process exception, perform abnormal monitoring process resumption;And
When determining at least one progress of work exception, perform abnormal work process resumption.
10. method as claimed in claim 9, it is characterised in that determine at least one according to procedure below
Monitoring process is abnormal:
Described monitor node is when monitoring at least one monitoring process exception, by described first monitoring process
Other second monitoring process in monitoring process list sends condition monitoring message;
Monitoring process mark according to testing result recording exceptional;
Determine the monitoring process exception that the abnormal monitoring process identification (PID) of record is corresponding.
11. methods as described in claim 9 or 10, it is characterised in that determine prison in accordance with the following methods
Control is abnormal at least one monitoring process:
Described monitor node by described first monitoring process according to respectively monitoring of comprising of monitoring process list into
The order composition message transmission loop chain that monitoring process sequence number corresponding to journey is ascending, and initiate message transmission;
If do not receive the message of transmission in preset duration, it is determined that monitor at least one monitor into
Cheng Yichang.
12. methods as claimed in claim 10, it is characterised in that perform abnormal prison according to procedure below
Control process resumption:
Described monitor node, according to the quantity of abnormal monitoring process, is taken from cluster by described first monitoring process
Business device node listing selects in addition to the monitor node that monitor node list comprises, the work of respective numbers
Node is respectively created a monitoring process for replacing abnormal monitoring process;And
By described first monitoring process receive monitoring process sequence number that each newly created monitoring process returns and
The node identification of its place node;
Monitoring process list and monitoring joint is updated respectively according to the monitoring process sequence number received and node identification
Point list;
In notice monitoring process list, each monitoring process updates the monitoring process list of its storage;And
In notice progress of work list, each progress of work updates the monitoring process list of its storage.
13. methods as claimed in claim 9, it is characterised in that determine at least one according to procedure below
The progress of work is abnormal:
Described monitor node is when monitoring at least one progress of work exception, by described first monitoring process
Progress of work list according to storage sends service state detection message to each progress of work;
Progress of work mark according to testing result recording exceptional;
Determine the progress of work exception that the abnormal work process identification (PID) of record is corresponding.
14. methods as claimed in claim 13, it is characterised in that perform abnormal work according to procedure below
Make process resumption:
Described monitor node is after determining abnormal work process, different by described first monitoring process detection
Often progress of work place working node is the most abnormal;And
Working node mark according to result of detection recording exceptional;And
According to the quantity of abnormal work process, select except abnormal work node from cluster server node listing
In addition, the node of respective numbers be respectively created a progress of work for replacing abnormal work process;
Described monitor node receives, by described first monitoring process, the work that each newly created progress of work returns
Make the working node mark of process sequence number and place working node thereof;
The progress of work row updating storage respectively are identified according to the progress of work sequence number received and working node
Table and working node list;And
In notice monitoring process list, each monitoring process updates the progress of work list of its storage;And
In notice progress of work list, each progress of work updates the progress of work list of its storage.
15. 1 kinds of server nodes, it is characterised in that including:
Monitoring unit, for the first monitoring process by running on described server node respectively according to presetting
First monitoring frequency and second monitoring frequency monitoring storage monitoring process list in each monitoring process and
In the progress of work list of storage, each progress of work is the most abnormal;
Exception processing unit, for when determining at least one monitoring process exception, performs abnormal monitoring process
Recover;And when determining at least one progress of work exception, perform abnormal work process resumption.
16. server nodes as claimed in claim 15, it is characterised in that
Described monitoring unit, specifically for when monitoring at least one monitoring process exception, by described
One monitoring process other second monitoring process in monitoring process list sends condition monitoring message;Root
Identify according to the monitoring process of testing result recording exceptional;Determine the prison that the abnormal monitoring process identification (PID) of record is corresponding
Control process exception.
17. server nodes as described in claim 15 or 16, it is characterised in that
Described monitoring unit, specifically for comprise according to monitoring process list by described first monitoring process
The order composition message transmission loop chain that monitoring process sequence number that each monitoring process is corresponding is ascending, and initiate to disappear
Breath transmission;If do not receive the message of transmission in preset duration, it is determined that monitor at least one prison
Control process exception.
18. server nodes as claimed in claim 16, it is characterised in that described exception processing unit,
Including:
First selects subelement, for the quantity according to abnormal monitoring process, by described first monitoring process
In addition to the monitor node that monitor node list comprises, respective counts is selected from cluster server node listing
The working node of amount is respectively created a monitoring process for replacing abnormal monitoring process;
First receives subelement, returns for receiving each newly created monitoring process by described first monitoring process
The monitoring process sequence number returned and the node identification of place node thereof;
First updates subelement, for updating prison respectively according to the monitoring process sequence number received and node identification
Control process list and monitor node list;
First notice subelement, for notifying that in monitoring process list, each monitoring process updates the monitoring of its storage
Process list;And each progress of work updates the monitoring process list of its storage in notice progress of work list.
19. server nodes as claimed in claim 15, it is characterised in that
Described monitoring unit, specifically for when monitoring at least one progress of work exception, by described
One monitoring process sends service state detection message according to the progress of work list of storage to each progress of work;Root
Identify according to the progress of work of testing result recording exceptional;Determine the work that the abnormal work process identification (PID) of record is corresponding
Make process exception.
20. server nodes as claimed in claim 19, it is characterised in that described exception processing unit,
Including:
Detection subelement, for after determining abnormal work process, is visited by described first monitoring process
Survey abnormal work process place working node the most abnormal;And according to the working node of result of detection recording exceptional
Mark;
Second selects subelement, for the quantity according to abnormal work process, from cluster server node listing
Middle selection in addition to abnormal work node, the node of respective numbers be respectively created a progress of work for replacing
The normal progress of work of transversion;
Second receives subelement, returns for receiving each newly created progress of work by described first monitoring process
The progress of work sequence number returned and the working node mark of place working node thereof;
Second updates subelement, for identifying the most more according to the progress of work sequence number received and working node
The progress of work list of new storage and working node list;
Second notice subelement, for notifying that in monitoring process list, each monitoring process updates the work of its storage
Process list;And each progress of work updates the progress of work list of its storage in notice progress of work list.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510385495.9A CN106330523A (en) | 2015-07-03 | 2015-07-03 | Cluster server disaster recovery system and method, and server node |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510385495.9A CN106330523A (en) | 2015-07-03 | 2015-07-03 | Cluster server disaster recovery system and method, and server node |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106330523A true CN106330523A (en) | 2017-01-11 |
Family
ID=57727266
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510385495.9A Pending CN106330523A (en) | 2015-07-03 | 2015-07-03 | Cluster server disaster recovery system and method, and server node |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106330523A (en) |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107665158A (en) * | 2017-09-22 | 2018-02-06 | 郑州云海信息技术有限公司 | A kind of storage cluster restoration methods and equipment |
CN108776633A (en) * | 2018-05-22 | 2018-11-09 | 深圳壹账通智能科技有限公司 | Method, terminal device and the computer readable storage medium of monitoring process operation |
CN108845916A (en) * | 2018-07-03 | 2018-11-20 | 中国联合网络通信集团有限公司 | Platform monitoring and alarm method, device, equipment and computer readable storage medium |
CN109150666A (en) * | 2018-10-11 | 2019-01-04 | 深圳互联先锋科技有限公司 | A method of preventing website delay machine |
CN109375873A (en) * | 2018-09-27 | 2019-02-22 | 郑州云海信息技术有限公司 | The initial method of data processing finger daemon in a kind of distributed storage cluster |
CN109408158A (en) * | 2018-11-06 | 2019-03-01 | 恒生电子股份有限公司 | Method and device, storage medium and the electronic equipment that subprocess is exited with parent process |
CN110032487A (en) * | 2018-11-09 | 2019-07-19 | 阿里巴巴集团控股有限公司 | Keep Alive supervision method, apparatus and electronic equipment |
CN111506480A (en) * | 2020-04-23 | 2020-08-07 | 上海达梦数据库有限公司 | State detection method, device and system for components in cluster |
CN111800304A (en) * | 2019-04-09 | 2020-10-20 | 安克创新科技股份有限公司 | Process running monitoring method, storage medium and virtual device |
CN111988188A (en) * | 2020-09-03 | 2020-11-24 | 深圳壹账通智能科技有限公司 | Transaction endorsement method, device and storage medium |
CN112035721A (en) * | 2020-07-22 | 2020-12-04 | 大箴(杭州)科技有限公司 | Crawler cluster monitoring method and device, storage medium and computer equipment |
CN112994977A (en) * | 2021-02-24 | 2021-06-18 | 紫光云技术有限公司 | Method for high availability of server host |
CN113704026A (en) * | 2021-10-28 | 2021-11-26 | 北京时代正邦科技股份有限公司 | Distributed financial memory database security synchronization method, device and medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1477509A (en) * | 2002-08-19 | 2004-02-25 | 万达信息股份有限公司 | Process automatic restoring method |
US20040073657A1 (en) * | 2002-10-11 | 2004-04-15 | John Palmer | Indirect measurement of business processes |
CN1512375A (en) * | 2002-12-31 | 2004-07-14 | 联想(北京)有限公司 | Fault-tolerance approach using machine group node interacting buckup |
CN1996257A (en) * | 2006-12-26 | 2007-07-11 | 华为技术有限公司 | Method and system for monitoring process |
CN102739435A (en) * | 2011-03-31 | 2012-10-17 | 微软公司 | Fault detection and recovery as service |
-
2015
- 2015-07-03 CN CN201510385495.9A patent/CN106330523A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1477509A (en) * | 2002-08-19 | 2004-02-25 | 万达信息股份有限公司 | Process automatic restoring method |
US20040073657A1 (en) * | 2002-10-11 | 2004-04-15 | John Palmer | Indirect measurement of business processes |
CN1512375A (en) * | 2002-12-31 | 2004-07-14 | 联想(北京)有限公司 | Fault-tolerance approach using machine group node interacting buckup |
CN1996257A (en) * | 2006-12-26 | 2007-07-11 | 华为技术有限公司 | Method and system for monitoring process |
CN102739435A (en) * | 2011-03-31 | 2012-10-17 | 微软公司 | Fault detection and recovery as service |
Cited By (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107665158A (en) * | 2017-09-22 | 2018-02-06 | 郑州云海信息技术有限公司 | A kind of storage cluster restoration methods and equipment |
CN108776633A (en) * | 2018-05-22 | 2018-11-09 | 深圳壹账通智能科技有限公司 | Method, terminal device and the computer readable storage medium of monitoring process operation |
CN108776633B (en) * | 2018-05-22 | 2021-07-02 | 深圳壹账通智能科技有限公司 | Method for monitoring process operation, terminal equipment and computer readable storage medium |
CN108845916A (en) * | 2018-07-03 | 2018-11-20 | 中国联合网络通信集团有限公司 | Platform monitoring and alarm method, device, equipment and computer readable storage medium |
CN109375873A (en) * | 2018-09-27 | 2019-02-22 | 郑州云海信息技术有限公司 | The initial method of data processing finger daemon in a kind of distributed storage cluster |
CN109150666A (en) * | 2018-10-11 | 2019-01-04 | 深圳互联先锋科技有限公司 | A method of preventing website delay machine |
CN109408158A (en) * | 2018-11-06 | 2019-03-01 | 恒生电子股份有限公司 | Method and device, storage medium and the electronic equipment that subprocess is exited with parent process |
CN109408158B (en) * | 2018-11-06 | 2022-11-18 | 恒生电子股份有限公司 | Method and device for quitting child process along with parent process, storage medium and electronic equipment |
CN110032487A (en) * | 2018-11-09 | 2019-07-19 | 阿里巴巴集团控股有限公司 | Keep Alive supervision method, apparatus and electronic equipment |
CN111800304A (en) * | 2019-04-09 | 2020-10-20 | 安克创新科技股份有限公司 | Process running monitoring method, storage medium and virtual device |
CN111506480A (en) * | 2020-04-23 | 2020-08-07 | 上海达梦数据库有限公司 | State detection method, device and system for components in cluster |
CN111506480B (en) * | 2020-04-23 | 2024-03-08 | 上海达梦数据库有限公司 | Method, device and system for detecting states of components in cluster |
CN112035721A (en) * | 2020-07-22 | 2020-12-04 | 大箴(杭州)科技有限公司 | Crawler cluster monitoring method and device, storage medium and computer equipment |
WO2022048357A1 (en) * | 2020-09-03 | 2022-03-10 | 深圳壹账通智能科技有限公司 | Transaction endorsement method and apparatus, and storage medium |
CN111988188A (en) * | 2020-09-03 | 2020-11-24 | 深圳壹账通智能科技有限公司 | Transaction endorsement method, device and storage medium |
CN112994977A (en) * | 2021-02-24 | 2021-06-18 | 紫光云技术有限公司 | Method for high availability of server host |
CN113704026B (en) * | 2021-10-28 | 2022-01-25 | 北京时代正邦科技股份有限公司 | Distributed financial memory database security synchronization method, device and medium |
CN113704026A (en) * | 2021-10-28 | 2021-11-26 | 北京时代正邦科技股份有限公司 | Distributed financial memory database security synchronization method, device and medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106330523A (en) | Cluster server disaster recovery system and method, and server node | |
KR100658913B1 (en) | A scalable method of continuous monitoring the remotely accessible resources against the node failures for very large clusters | |
CN108847982B (en) | Distributed storage cluster and node fault switching method and device thereof | |
CN109656742B (en) | Node exception handling method and device and storage medium | |
CN107147540A (en) | Fault handling method and troubleshooting cluster in highly available system | |
CN111953566B (en) | Distributed fault monitoring-based method and virtual machine high-availability system | |
CN105681077A (en) | Fault processing method, device and system | |
CN108347339B (en) | Service recovery method and device | |
CN110618864A (en) | Interrupt task recovery method and device | |
CN105589756A (en) | Batch processing cluster system and method | |
JP2020115330A (en) | System and method of monitoring software application process | |
JP5855724B1 (en) | Virtual device management apparatus, virtual device management method, and virtual device management program | |
CN108632106A (en) | System for monitoring service equipment | |
CN113067850A (en) | Cluster arrangement system under multi-cloud scene | |
JP5285045B2 (en) | Failure recovery method, server and program in virtual environment | |
CN113986450A (en) | Virtual machine backup method and device | |
CN113377535A (en) | Distributed timing task allocation method, device, equipment and readable storage medium | |
CN115712521A (en) | Cluster node fault processing method, system and medium | |
CN114598591A (en) | Embedded platform node fault recovery system and method | |
CN111221620A (en) | Storage method, storage device and storage medium | |
JP2018169920A (en) | Management device, management method and management program | |
CN113672452A (en) | Method and system for monitoring operation of data acquisition task | |
CN112260902A (en) | Network equipment monitoring method, device, equipment and storage medium | |
CN115499296B (en) | Cloud desktop hot standby management method, device and system | |
CN1722627A (en) | A method and device for realizing switching between main and backup units in communication equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170111 |