CN102957563A

CN102957563A - Linux cluster fault automatic recovery method and Linux cluster fault automatic recovery system

Info

Publication number: CN102957563A
Application number: CN2012100312095A
Authority: CN
Inventors: 单联瑜; 丛龙水; 董涛; 李战强; 孙世为; 邢占军; 孙友凯; 段淼; 刘玉梅; 徐香明; 赵军民; 付巧娟; 吴敏; 车晓萍; 刘芳; 卢晋平; 董倩; 尚新民; 侯树杰; 郭见乐
Original assignee: China Petroleum and Chemical Corp; Geophysical Research Institute of Sinopec Shengli Oilfield Co
Current assignee: China Petroleum and Chemical Corp; Geophysical Research Institute of Sinopec Shengli Oilfield Co
Priority date: 2011-08-16
Filing date: 2012-02-13
Publication date: 2013-03-06
Anticipated expiration: 2032-02-13
Also published as: CN102957563B

Abstract

The invention provides a Linux cluster fault automatic recovery method which comprises the following steps of: executing data information acquisition and judging whether a fault occurs; if the fault occurs, restarting a node; after the node is re-started, executing the data information acquisition again, and if the fault occurs, executing maintenance integration of the fault node; after executing the maintenance integration of the fault node, executing the data information acquisition again, and if the fault occurs, executing installation integration of the fault node; and after executing the installation integration of the fault node, executing the data information acquisition again, and if the fault occurs, performing manual treatment. The Linux cluster fault automatic recovery method reduces the labor consumption to a great extent, can automatically, quickly and efficiently finish fault automatic recovery of a cluster node system, can meet different needs of a heterogeneous cluster, supports multiple versions of operating systems, and improves the utilization efficiency of the cluster resources.

Description

Linux clustering fault automatic recovery method and Linux clustering fault automatic recovery system

Technical field

The present invention relates to the optimization and application of large-scale cluster resource management system, particularly relate to Linux clustering fault automatic recovery method.

Background technology

Along with the development of computation requirement, how the scale of microcomputer cluster also constantly enlarging, finishes the management to large-scale cluster efficiently, becomes a difficult problem that needs to be resolved hurrily.Computer manufacturer both domestic and external has all dropped into a large amount of research and development strength exploitation cluster Related products, from the freeware to the charging software, function is had nothing in common with each other, major function concentrates on system management and supervision, but lack instrument intelligence, automation, so the manageability of cluster and availability are all received great impact.Under existing pattern, administrative staff need to carry out searching of fault point and judge by the experience of self, and are often consuming time longer, and are not easy to process rapidly problem, and malfunctioning node is come into operation again.We have invented a kind of new Linux clustering fault automatic recovery method for this reason, have solved above technical problem.

Summary of the invention

The purpose of this invention is to provide a kind of Linux clustering fault automatic recovery method that can finish automatically, fast and efficiently the automatically restoring fault of clustered node system.

Purpose of the present invention can be achieved by the following technical measures: Linux clustering fault automatic recovery method, and this Linux clustering fault automatic recovery method bag executing data information gathering also judges whether to break down; When judgement is broken down, restart node; After restarting this node, again carry out this data information acquisition and judge that when breaking down, the maintenance of carrying out malfunctioning node is integrated; After the maintenance of carrying out this malfunctioning node is integrated, again carries out this data information acquisition and judge that when breaking down, the installation of carrying out malfunctioning node is integrated; And after the installation of carrying out this malfunctioning node is integrated, again carries out this data information acquisition and also judge when breaking down, carry out artificial treatment.

Purpose of the present invention also can be achieved by the following technical measures:

This data information acquisition comprises Dynamic Data Acquiring, static information collection, the collection of system service status data and application message data acquisition.

This Dynamic Data Acquiring and static information collection by reading system /the proc file system obtains system information.

This system service status data collection is the service state of principal and subordinate's servers such as the DNS, the NIS that detect whole cluster, NTP, and its state is write in the database.

This application message data acquisition is the actual conditions according to production application, enumerate out first the various needs in the practical application, then the as required node name of each application server of manual typing and application service, and be saved in the database, then detect the application service state of this server according to the node name of each application server, and its state is write in this database.

This Linux clustering fault automatic recovery method is set the maximum time interval at this renewal interval according to the update time of this data information acquisition, and when the refresh time of this data information acquisition had surpassed this maximum time interval, judgement was broken down.

After this Linux clustering fault automatic recovery method also is included in the step that restarts node, the flag bit that the sign node has restarted, and again carrying out this data information acquisition and judging and to remove the flag bit that node has restarted when not breaking down.

After this Linux clustering fault automatic recovery method also is included in the integrated step of the maintenance of carrying out malfunctioning node, the integrated flag bit of sign node maintenance, and again carrying out this data information acquisition and judging and to remove the integrated flag bit of node maintenance when not breaking down.

After this Linux clustering fault automatic recovery method also is included in the integrated step of the installation of carrying out malfunctioning node, the integrated flag bit of sign node installation, and again carrying out this data information acquisition and judging and to remove the integrated flag bit of node installation when not breaking down.

The integrated step of the maintenance of this execution malfunctioning node comprises is arranged to maintenance state with this node at server end, restart this node, this node reads start image from network in start-up course, enter the maintenance state of this node, the configuration of this node system is reverted to the initial configuration state.

The integrated step of the installation of this execution malfunctioning node comprises is arranged to install integrated state with this node at server end, restart this node, this node is in start-up course, read start image from network, the network installation that enters node is integrated, read installation kit from network, carry out the installation configuration of system, this node system is re-started installation.

Purpose of the present invention also can be achieved by the following technical measures: Linux clustering fault automatic recovery system, it is characterized in that, this Linux clustering fault automatic recovery system comprises data information acquisition and judge module, restart node module, safeguard integration module and integration module is installed, this data information acquisition and judge module are used for the executing data information gathering and judge whether to break down, this restarts node module and is used for restarting node, it is integrated that this safeguards that integration module is used to carry out the maintenance of malfunctioning node, and this installs integration module, and to be used for carrying out the installation of malfunctioning node integrated.

This data information acquisition and the information gathering of judge module executing data also judge whether to break down, when this data information acquisition and judge module are judged when breaking down, this restarts node module and restarts node, after this restarts node module and restarts this node, this data information acquisition and judge module are again carried out this data information acquisition and are judged when breaking down, this maintenance of safeguarding integration module execution malfunctioning node is integrated, when this safeguard integration module carry out the maintenance of this malfunctioning node integrated after, this data information acquisition and judge module are again carried out this data information acquisition and are judged when breaking down, this installation that integration module execution malfunctioning node is installed is integrated, after the installation that this installation integration module is carried out this malfunctioning node is integrated, this data information acquisition and judge module are again carried out this data information acquisition and are judged when breaking down, and this data information acquisition and judge module send message to carry out artificial treatment.

This data information acquisition and judge module by reading system /the proc file system obtains system information to obtain this Dynamic Data Acquiring and this static information collection.

The service state that this data information acquisition and judge module detect principal and subordinate's servers such as DNS, NIS, the NTP of whole cluster to be obtaining this system service status data collection, and its state is write in the database.

This data information acquisition and judge module are according to the actual conditions of production application, enumerate out first the various needs in the practical application, then the as required node name of each application server of manual typing and application service, and be saved in the database, then detect the application service state of this server according to the node name of each application server, and its state write in this database, to carry out this application message data acquisition.

This data information acquisition and judge module are according to the update time of this data information acquisition, set the maximum time interval at this renewal interval, when the refresh time of this data information acquisition had surpassed this maximum time interval, this data information acquisition and judge module judgement were broken down.

This safeguards that integration module is when the maintenance of carrying out malfunctioning node is integrated, this node is arranged to maintenance state at server end, restart this node, this node is in start-up course, read start image from network, enter the maintenance state of this node, the configuration of this node system is reverted to the initial configuration state.

This installs integration module when the installation of carrying out malfunctioning node is integrated, this node is arranged to install integrated state at server end, restart this node, this node is in start-up course, read start image from network, the network installation that enters node is integrated, reads installation kit from network, carry out the installation configuration of system, this node system is re-started installation.

Linux clustering fault automatic recovery method among the present invention, can gather the various key messages of group system operation and concentrate storage, set up early warning mechanism, fault from the automatic Processing Cluster of a plurality of ranks, and can provide detailed reference data to supply the managerial decision making, reduced to a great extent artificial consumption, malfunctioning node is recovered with speed full out and put into production use.Linux clustering fault automatic recovery method among the present invention, can finish automatically, fast and efficiently the automatically restoring fault of clustered node system, can satisfy the different demands of isomeric group, support a plurality of version operating systems, accelerated the again operation of clustered node, convenient for users to use, improved the cluster resource utilization ratio.

Description of drawings

Fig. 1 is the flow chart of Linux clustering fault automatic recovery method of the present invention;

Fig. 2 is the flow chart of the application message data acquisition step among Fig. 1;

Fig. 3 is the module map of Linux clustering fault automatic recovery system of the present invention.

Embodiment

For above and other purpose of the present invention, feature and advantage can be become apparent, cited below particularlyly go out preferred embodiment, and cooperate appended graphicly, be described in detail below.

As shown in Figure 1, Fig. 1 is the flow chart of Linux clustering fault automatic recovery method of the present invention.In step 101, executing data information gathering, information gathering can be divided into Dynamic Data Acquiring, static information collection, the collection of system service status data and application message data acquisition.The information that Dynamic Data Acquiring mainly gathers comprises and comprises total internal memory, uses internal memory, free memory, shared drive, total exchange area, use exchange area, idle exchange area, IO number of disk per second, disk reading rate, disk to read byte number, disk writing rate, disk to write the information such as byte number.Dynamic Data Acquiring by read/the proc file system in meminfo, stat, the files such as loadavg, snmp and analyze after obtain.The dynamic data information that gathers is preserved for concentrating, and service operation is on the node of the information of collection, and the information of being responsible for gathering stores in the database.

The information that the static information collection mainly gathers comprises unit, CPU number, single cpu check figure, memory size, disk size, local file system name, the file system size of corresponding FSNames field etc. of frequency of frequency, the CPU of model, the CPU of identifier, the CPU of title, the CPU of node.The static information collection reads/the proc file system in the file such as cpuinfo, partitions, mounts and analyze after acquired information.The static data information that gathers also is to concentrate to preserve, and each node operation provides the service of static information, the request of monitoring users in cluster.When needs gathered the static information of certain node, long-range fill order was again returned the data of collection by network, leave concentratedly in the database.Dynamic Data Acquiring and static information collection by reading system /the proc file system obtains system information, the fast efficient of the method speed is high, be fit to walk abreast and obtain great deal of nodes information, file system content changes less under different kernel versions, be conducive to programming compatible.

The collection of system service status data is the service state of principal and subordinate's servers such as the DNS, the NIS that detect whole cluster, NTP, and its state is write in the database, so that the user inquires the service state of these servers easily.

The application message data acquisition is the actual conditions according to production application, enumerate out first the various needs in the practical application, then the as required node name of each application server of manual typing and application service, and be saved in the database, then detect the application service state of this server according to the node name of each application server, and its state write in the database, so that the user inquires the application service state of these servers easily.

The information data that gathers can also show by graphic interface.Graphical interfaces is by unified data-interface, and the data of the information gathering in the database are left in access concentratedly, and carry out according to demand the customization of display mode, can offer the very convenient intuitively surveillance style of system manager.

After executing Dynamic Data Acquiring, static information collection, the collection of system service status data and application message data acquisition, flow process enters into step 102.

In step 102, judge whether to break down.In one embodiment, according to the update time of data information acquisition (information such as node, service, application), set the maximum threshold value that upgrades the interval, when the refresh time of Information Monitoring has surpassed the maximum time interval, assert node, service or use and break down, flow process enters into step 103; When the refresh time of Information Monitoring did not surpass the maximum time interval, illustrating did not have fault to produce, and flow process turns back to step 101.

In step 103, restart node.That is to say, use long-range control method to restart node, and indicate that node restarts, flow process enters into step 104.

In step 104, carry out the data information acquisition identical with step 101, flow process enters into step 105.

In step 105, identical with step 102, judge whether to break down, when judgement did not have fault to produce, flow process entered into step 106; When judgement had fault to produce, flow process entered into step 107.

In step 106, remove the flag bit that node has restarted, flow process turns back to step 101.

In step 107, the maintenance of carrying out malfunctioning node is integrated.Namely for the node that has restarted, if fault can't be got rid of, node is arranged to maintenance state at server end, restarts node.Node reads start image from network in start-up course, enter the maintenance state of node, and the configuration of node system is reverted to the initial configuration state.After safeguarding integrated finishing, system starts again comes into force the integrated content of maintenance, and becomes maintenance integrated node label, and flow process enters into step 108.

In step 108, carry out the data information acquisition identical with step 101, flow process enters into step 109.

In step 109, identical with step 102, judge whether to break down, when judgement did not have fault to produce, flow process entered into step 110; When judgement had fault to produce, flow process entered into step 111.

In step 110, remove the integrated flag bit of node maintenance, flow process turns back to step 101.

In step 111, the installation of carrying out malfunctioning node is integrated.After integrated for maintenance, detect the node that still can not normally move, node is arranged to install integrated state at server end, restart node.Node reads start image from network in start-up course, the network installation that enters node is integrated, namely reads installation kit from network, carries out the installation configuration of system, and node system is re-started installation.After integrated finishing was installed, system starts again came into force new system, and became installation integrated node label, and flow process enters into step 112.

In step 112, carry out the data information acquisition identical with step 101, flow process enters into step 113.

In step 113, identical with step 102, judge whether to break down, when judgement did not have fault to produce, flow process entered into step 114; When judgement had fault to produce, flow process entered into step 115.

In step 114, remove the integrated flag bit of node installation, flow process turns back to step 101.

In step 115, because detection node still can not normally be moved, send message to the system manager this moment, carries out artificial treatment by the keeper.

With reference to Fig. 2, Fig. 2 is the flow chart of application message data acquisition in the step of the executing data information gathering among Fig. 1.The application message data acquisition is the actual conditions according to production application, enumerate out first the various needs in the practical application, then the as required node name of each application server of manual typing and application service, and be saved in the database, then inquire about the application service state of this server according to the node name of each application server, and its state write in the database, so that the user inquires the application service state of these servers easily.It mainly comprises step:

In step 201, for service and process are named in database, namely by various application in researching and producing, definite key service, process wherein, and for it has defined name in database, flow process enters into step 202.

In step 202, for each service has defined corresponding state: UP, DOWN, DEGRADE with the different situations of process.UP represents that state is normal, represents with green on the interface, and DOWN represents that this service is unavailable, represents with redness on the interface, and DEGRADE represents to serve available, but existing problems represent with yellow on the interface.Flow process enters into step 203.

In step 203, get host name, flow process enters into step 204.

In step 204, read and use each service or corresponding record and the number of process, flow process enters into step 205.

In step 205, take out the node name in one of them record, flow process enters into step 206.

In step 206, whether the decision node name conforms to host name, and when node name conformed to host name, flow process entered into step 207; When node name did not conform to host name, flow process turned back to step 205.

In step 207, gather corresponding service or process status and write into Databasce, flow process enters into step 208.

In step 208, judge whether to circulate according to the number of the record that reads complete, namely whether taken out the node name in all records, when circulation was complete, flow process entered into step 209; After also do not carried out circulation time, waiting for the regular time interval, flow process turns back to step 205.

In step 209, the result of status poll is sent to the storage that long-range database server is concentrated.Flow process enters into step 210.

In step 210, on the application server status bar at interface, when clicking this column, upgrade and show each application server state.This flow process finishes.

In Fig. 2, step 205 to step 208 is moved in the finger daemon mode on each application server in cluster for the application message acquisition module, carries out status poll with Fixed Time Interval.

In the step 115 in Fig. 1, when the keeper carries out Artificial Control, can may further comprise the steps: at first, build-in services device end is installation with node label; Again by the method for Long-distance Control, restart node or with node reset; Node is by the PXE netboot, and judgement need to be carried out the installation of network; When the needs network installation, read the software kit that needs installation by NFS, carry out the installation of system and every network, the system configuration after the installation; Again carry out the detection of fault and judge, and process accordingly.

Fig. 3 is the module map of Linux clustering fault automatic recovery system of the present invention.This system comprises data information acquisition and judge module 301, restarts node module 302, safeguards integration module 302 and integration module 304 is installed.Data information acquisition and judge module 301 are used for the executing data information gathering and judge whether to break down.Information gathering can be divided into Dynamic Data Acquiring, static information collection, the collection of system service status data and application message data acquisition.In one embodiment, this data information acquisition and judge module 301 by reading system /the proc file system obtains system information to obtain this Dynamic Data Acquiring and this static information collection.The service state that this data information acquisition and judge module 301 detects principal and subordinate's servers such as DNS, NIS, the NTP of whole cluster to be obtaining this system service status data collection, and its state is write in the database.This data information acquisition and judge module 301 are according to the actual conditions of production application, enumerate out first the various needs in the practical application, then the as required node name of each application server of manual typing and application service, and be saved in the database, then detect the application service state of this server according to the node name of each application server, and its state write in this database, to carry out this application message data acquisition.Data information acquisition and judge module 301 are according to the update time of data information acquisition (information such as node, service, application), set the maximum threshold value that upgrades the interval, when the refresh time of Information Monitoring has surpassed the maximum time interval, assert node, service or use to break down.

Judge when data information acquisition and judge module 301 and restart node module 302 and use long-range control methods to restart node, and the sign node to restart when breaking down.At this moment, the again executing data information gathering and judge whether to break down of data information acquisition and judge module 301, when judgement did not have fault to produce, data information acquisition and judge module 301 were removed the flag bit that nodes have restarted; When judgement has fault to produce, safeguard that the maintenance of integration module 303 execution malfunctioning nodes is integrated.

For the node that has restarted, if fault can't be got rid of, safeguard that integration module 303 is arranged to maintenance state with node at server end, restart node.Node safeguards that integration module 303 reads start image from network in start-up course, enter the maintenance state of node, and the configuration of node system is reverted to the initial configuration state.After safeguarding integrated finishing, system starts again comes into force the integrated content of maintenance, and safeguards that integration module 303 becomes to safeguard node label integrated.At this moment, the again executing data information gathering and judge whether to break down of data information acquisition and judge module 301, when judgement did not have fault to produce, data information acquisition and judge module 301 were removed the integrated flag bit of node maintenances; When judgement had fault to produce, the installation that integration module 304 execution malfunctioning nodes are installed was integrated.

After integrated for maintenance, detect the node that still can not normally move, integration module 304 is installed node is arranged to install integrated state at server end, restart node.Node is installed integration module 304 and is read start image from network in start-up course, and the network installation that enters node is integrated, namely reads installation kit from network, carries out the installation configuration of system, and node system is re-started installation.After integrated finishing was installed, system starts again came into force new system, and integration module 304 was installed is installed node label one-tenth integrated.At this moment, the again executing data information gathering and judge whether to break down of data information acquisition and judge module 301 when judgement does not have fault to produce, is installed integration module 304 and is removed the integrated flag bit of node installations; When judgement had fault to produce, data information acquisition and judge module sent message to the system manager 301 this moments, carry out artificial treatment by the keeper.

Above embodiment is exemplary embodiment of the present invention only, is not used in restriction the present invention, and protection scope of the present invention is limited by additional claims.Those skilled in the art can make various modifications or be equal to replacement the present invention in essence of the present invention and protection range, this modification or be equal to replacement and also should be considered as dropping in protection scope of the present invention.

Claims

1.Linux the clustering fault automatic recovery method is characterized in that, this Linux clustering fault automatic recovery method comprises:

The executing data information gathering also judges whether to break down;

When judgement is broken down, restart node;

After restarting this node, again carry out this data information acquisition and judge that when breaking down, the maintenance of carrying out malfunctioning node is integrated;

After the maintenance of carrying out this malfunctioning node is integrated, again carries out this data information acquisition and judge that when breaking down, the installation of carrying out malfunctioning node is integrated; And

After the installation of carrying out this malfunctioning node is integrated, again carry out this data information acquisition and also judge when breaking down, carry out artificial treatment.

2. Linux clustering fault automatic recovery method according to claim 1 is characterized in that, this data information acquisition comprises Dynamic Data Acquiring, static information collection, the collection of system service status data and application message data acquisition.

3. Linux clustering fault automatic recovery method according to claim 2 is characterized in that, this Dynamic Data Acquiring and static information collection by reading system /the proc file system obtains system information.

4. Linux clustering fault automatic recovery method according to claim 2 is characterized in that, this system service status data collection is the service state of principal and subordinate's servers such as the DNS, the NIS that detect whole cluster, NTP, and its state is write in the database.

5.Linux clustering fault automatic recovery system, it is characterized in that, this Linux clustering fault automatic recovery system comprises data information acquisition and judge module, restarts node module, safeguards integration module and integration module is installed, this data information acquisition and judge module are used for the executing data information gathering and judge whether to break down, this restarts node module and is used for restarting node, it is integrated that this safeguards that integration module is used to carry out the maintenance of malfunctioning node, and this installs integration module, and to be used for carrying out the installation of malfunctioning node integrated.

6. Linux clustering fault automatic recovery system according to claim 5, it is characterized in that, this data information acquisition and the information gathering of judge module executing data also judge whether to break down, when this data information acquisition and judge module are judged when breaking down, this restarts node module and restarts node, after this restarts node module and restarts this node, this data information acquisition and judge module are again carried out this data information acquisition and are judged when breaking down, this maintenance of safeguarding integration module execution malfunctioning node is integrated, when this safeguard integration module carry out the maintenance of this malfunctioning node integrated after, this data information acquisition and judge module are again carried out this data information acquisition and are judged when breaking down, this installation that integration module execution malfunctioning node is installed is integrated, after the installation that this installation integration module is carried out this malfunctioning node is integrated, this data information acquisition and judge module are again carried out this data information acquisition and are judged when breaking down, and this data information acquisition and judge module send message to carry out artificial treatment.

7. Linux clustering fault automatic recovery system according to claim 5 is characterized in that, this data information acquisition comprises Dynamic Data Acquiring, static information collection, the collection of system service status data and application message data acquisition.

8. Linux clustering fault automatic recovery system according to claim 7 is characterized in that, this data information acquisition and judge module by reading system /the proc file system obtains system information to obtain this Dynamic Data Acquiring and this static information collection.

9. Linux clustering fault automatic recovery system according to claim 7, it is characterized in that, the service state that this data information acquisition and judge module detect principal and subordinate's servers such as DNS, NIS, the NTP of whole cluster to be obtaining this system service status data collection, and its state is write in the database.

10. Linux clustering fault automatic recovery system according to claim 5, it is characterized in that, this data information acquisition and judge module are according to the update time of this data information acquisition, set the maximum time interval at this renewal interval, when the refresh time of this data information acquisition had surpassed this maximum time interval, this data information acquisition and judge module judgement were broken down.