CN102957563B

CN102957563B - Linux clustering fault automatic recovery method and Linux clustering fault automatic recovery system

Info

Publication number: CN102957563B
Application number: CN201210031209.5A
Authority: CN
Inventors: 单联瑜; 丛龙水; 董涛; 李战强; 孙世为; 邢占军; 孙友凯; 段淼; 刘玉梅; 徐香明; 赵军民; 付巧娟; 吴敏; 车晓萍; 刘芳; 卢晋平; 董倩; 尚新民; 侯树杰; 郭见乐
Original assignee: China Petroleum and Chemical Corp; Geophysical Research Institute of Sinopec Shengli Oilfield Co
Current assignee: China Petroleum and Chemical Corp; Geophysical Research Institute of Sinopec Shengli Oilfield Co
Priority date: 2011-08-16
Filing date: 2012-02-13
Publication date: 2016-07-06
Anticipated expiration: 2032-02-13
Also published as: CN102957563A

Abstract

The present invention provides a kind of Linux clustering fault automatic recovery method, including performing data information acquisition and judging whether to break down；When judging to break down, restart node；After restarting this node, again performing this data information acquisition and when judgement is broken down, the maintenance performing malfunctioning node is integrated；After the maintenance performing this malfunctioning node is integrated, again performing this data information acquisition and when judgement is broken down, the installation performing malfunctioning node is integrated；And after the installation performing this malfunctioning node is integrated, again performs this data information acquisition and when judgement is broken down, carry out artificial treatment.This Linux clustering fault automatic recovery method decreases artificial consumption to a great extent, the automatically restoring fault of clustered node system can be completed automatically, fast and efficiently, disclosure satisfy that isomeric group difference demand, support multiple Edition operating system, improve cluster resource utilization ratio.

Description

Linux clustering fault automatic recovery method and Linux clustering fault automatic recovery system

Technical field

The present invention relates to the optimization of large-scale cluster resource management system and application, especially relate to Linux clustering fault automatic recovery method.

Background technology

Along with the development of the demand of calculating, the scale of microcomputer cluster also expands constantly, how to be efficiently completed the management to large-scale cluster, becomes a difficult problem urgently to be resolved hurrily.Computer manufacturer both domestic and external has all put into a large amount of research and development strength development set group's Related product, from freeware to charging software, function is had nothing in common with each other, major function concentrates on system administration and supervision, but lack instrument intelligence, automatization, so the manageability of cluster and availability all receive strong influence.Under existing pattern, management personnel require over self experience and carry out lookup and the judgement of trouble point, often consuming time longer, and are not easy to process rapidly problem, are put back into by malfunctioning node.We have invented a kind of new Linux clustering fault automatic recovery method for this, solve above technical problem.

Summary of the invention

It is an object of the invention to provide the Linux clustering fault automatic recovery method of a kind of automatically restoring fault that can complete clustered node system automatically, fast and efficiently.

The purpose of the present invention can be achieved by the following technical measures: Linux clustering fault automatic recovery method, and this Linux clustering fault automatic recovery method bag performs data information acquisition and judges whether to break down；When judging to break down, restart node；After restarting this node, again performing this data information acquisition and when judgement is broken down, the maintenance performing malfunctioning node is integrated；After the maintenance performing this malfunctioning node is integrated, again performing this data information acquisition and when judgement is broken down, the installation performing malfunctioning node is integrated；And after the installation performing this malfunctioning node is integrated, again performs this data information acquisition and when judgement is broken down, carry out artificial treatment.

The purpose of the present invention realizes also by following technical measures:

This data information acquisition includes Dynamic Data Acquiring, static information collection, system service state data acquisition and application message data acquisition.

This Dynamic Data Acquiring and static information collection by read system /proc file system obtains system information.

This system service state data acquisition is the service state of principal and subordinate's servers such as DNS, NIS, NTP of detecting whole cluster, and its state is write in data base.

This application message data acquisition is the practical situation according to production application, first enumerate out the various needs in practical application, then the node name of manual each application server of typing and application service as required, and it is saved in data base, then detect the application service state of this server according to the node name of each application server, and its state is write in this data base.

This Linux clustering fault automatic recovery method renewal time according to this data information acquisition, set the maximum time interval at this renewal interval, when the refresh time of this data information acquisition has exceeded this maximum time interval, it is judged that break down.

After this Linux clustering fault automatic recovery method is additionally included in the step restarting node, mark node carries out the flag bit restarted, and when again performing this data information acquisition and judging not break down, remove the flag bit that node carries out restarting.

After this Linux clustering fault automatic recovery method is additionally included in the step that the maintenance performing malfunctioning node is integrated, mark node safeguards integrated flag bit, and when again performing this data information acquisition and judging not break down, remove the flag bit that node maintenance is integrated.

After this Linux clustering fault automatic recovery method is additionally included in the step that the installation performing malfunctioning node is integrated, mark node installs integrated flag bit, and when again performing this data information acquisition and judging not break down, remove the flag bit that node installation is integrated.

The integrated step of maintenance of this execution malfunctioning node includes at server end, this node is arranged to maintenance state, restarting this node, this node, in start-up course, reads start image from network, enter the maintenance state of this node, the configuration of this node system is reverted to initial configuration status.

The integrated step of installing of this execution malfunctioning node includes being arranged to install integrated state at server end by this node, restart this node, this node is in start-up course, start image is read from network, the network installation entering node is integrated, read installation kit from network, carry out the installation configuration of system, this node system is re-started installation.

The purpose of the present invention realizes also by following technical measures: Linux clustering fault automatic recovery system, it is characterized in that, this Linux clustering fault automatic recovery system includes data information acquisition and judge module, restart node module, safeguard integration module and integration module is installed, this data information acquisition and judge module are used for performing data information acquisition and judging whether to break down, this restarts node module for restarting node, this maintenance integration module is integrated for the maintenance performing malfunctioning node, this installation integration module is integrated for the installation performing malfunctioning node.

nullThis data information acquisition and judge module perform data information acquisition and judge whether to break down，When this data information acquisition and judge module judge to break down，This restarts node module and restarts node，Restart after node module restarts this node at this，When this data information acquisition and judge module perform this data information acquisition again and judgement is broken down，The maintenance that this maintenance integration module performs malfunctioning node is integrated，After the maintenance that performs this malfunctioning node when this maintenance integration module is integrated，When this data information acquisition and judge module perform this data information acquisition again and judgement is broken down，The installation that this installation integration module performs malfunctioning node is integrated，After the installation that performs this malfunctioning node at this installation integration module is integrated，When this data information acquisition and judge module perform this data information acquisition again and judgement is broken down，This data information acquisition and judge module send message to carry out artificial treatment.

This data information acquisition and judge module by reading system /proc file system obtains system information to obtain this Dynamic Data Acquiring and this static information collection.

This data information acquisition and judge module detect the service state of principal and subordinate's servers such as DNS, NIS, NTP of whole cluster to obtain this system service state data acquisition, and its state is write in data base.

This data information acquisition and the judge module practical situation according to production application, first enumerate out the various needs in practical application, then the node name of manual each application server of typing and application service as required, and it is saved in data base, then the application service state of this server is detected according to the node name of each application server, and its state is write in this data base, to carry out this application message data acquisition.

This data information acquisition and the judge module renewal time according to this data information acquisition, set the maximum time interval at this renewal interval, when the refresh time of this data information acquisition has exceeded this maximum time interval, this data information acquisition and judge module judge to break down.

This maintenance integration module is when the maintenance performing malfunctioning node is integrated, this node is arranged to maintenance state at server end, restart this node, this node is in start-up course, start image is read from network, enter the maintenance state of this node, the configuration of this node system is reverted to initial configuration status.

This installation integration module is when the installation performing malfunctioning node is integrated, it is arranged to integrated state is installed at server end by this node, restart this node, this node is in start-up course, reading start image from network, the network installation entering node is integrated, reads installation kit from network, carry out the installation configuration of system, this node system is re-started installation.

Linux clustering fault automatic recovery method in the present invention, various key messages centralized stores that group system is run can be gathered, set up early warning mechanism, the fault of cluster is automatically processed from multiple ranks, and detailed reference data can be provided to supply managerial decision making, decrease artificial consumption to a great extent, malfunctioning node is recovered with full out speed and puts into production use.Linux clustering fault automatic recovery method in the present invention, the automatically restoring fault of clustered node system can be completed automatically, fast and efficiently, disclosure satisfy that isomeric group difference demand, support multiple Edition operating system, accelerate the operation again of clustered node, convenient for users to use, improve cluster resource utilization ratio.

Accompanying drawing explanation

Fig. 1 is the flow chart of the Linux clustering fault automatic recovery method of the present invention；

Fig. 2 is the flow chart of the application message data collection steps in Fig. 1；

Fig. 3 is the module map of the Linux clustering fault automatic recovery system of the present invention.

Detailed description of the invention

For making the above and other purpose of the present invention, feature and advantage to become apparent, cited below particularly go out preferred embodiment, and coordinate institute's accompanying drawings, be described in detail below.

As it is shown in figure 1, the flow chart of the Linux clustering fault automatic recovery method that Fig. 1 is the present invention.In step 101, performing data information acquisition, information gathering can be divided into Dynamic Data Acquiring, static information collection, system service state data acquisition and application message data acquisition.The information that Dynamic Data Acquiring mainly gathers includes comprising total internal memory, uses internal memory, free memory, shared drive, total exchange area, uses exchange area, idle exchange area, disk IO number per second, disk reading rate, disk to read the information such as byte number, disk write speed, disk write byte number.Dynamic Data Acquiring is by obtaining after the files such as meminfo, stat, loadavg, snmp in reading/proc file system analysis.The dynamic data information gathered preserves for concentrating, and service operation, on the node of gather information, is responsible for storing in data base the information of collection.

The information that static information collection mainly gathers includes the file system size etc. of the title of node, the identifier of CPU, the model of CPU, the frequency of CPU, the unit of frequency of CPU, CPU number, single cpu check figure, memory size, disk size, local file system name, corresponding FSNames field.The file such as cpuinfo, partitions, mounts obtain information after analyzing in static information collection reading/proc file system.The static data information gathered also is concentrate to preserve, and runs the service providing static information, the request of monitoring users in the cluster on each node.When needing the static information gathering certain node, more remotely perform order, returned the data gathered by network, leave concentratedly in data base.Dynamic Data Acquiring and static information collection by read system /proc file system obtains system information, the fast efficiency of the method speed is high, being suitable for parallel acquisition great deal of nodes information, file system content changes relatively small under different kernel versions, is conducive to programming compatibility.

System service state data acquisition is the service state of principal and subordinate's servers such as DNS, NIS, NTP of detecting whole cluster, and its state is write in data base, in order to user inquires the service state of these servers easily.

Application message data acquisition is the practical situation according to production application, first enumerate out the various needs in practical application, then the node name of manual each application server of typing and application service as required, and it is saved in data base, then the application service state of this server is detected according to the node name of each application server, and its state is write in data base, in order to user inquires the application service state of these servers easily.

The information data gathered can also be shown by graphic interface.Graphical interfaces, by unified data-interface, accesses the data of the information gathering left concentratedly in data base, and displays the customization of mode according to demand, it is provided that to the very convenient surveillance style intuitively of system manager.

After having performed Dynamic Data Acquiring, static information collection, system service state data acquisition and application message data acquisition, flow process enters into step 102.

In step 102, it may be judged whether break down.In one embodiment, the renewal time according to data information acquisition (information such as node, service, application), set the maximum threshold value updating interval, when the refresh time of the information of collection has exceeded maximum interval, assert that node, service or application are broken down, flow process enters into step 103；When the refresh time of the information of collection is not less than maximum interval, illustrating do not have fault to produce, flow process returns to step 101.

In step 103, restart node.It is to say, use long-range control method to restart node, and mark node restarts, and flow process enters into step 104.

In step 104, performing the data information acquisition identical with step 101, flow process enters into step 105.

In step 105, identical with step 102, it may be judged whether to break down, when judging not have fault to produce, flow process enters into step 106；When judging to have fault to produce, flow process enters into step 107.

In step 106, removing the flag bit that node carries out restarting, flow process returns to step 101.

In step 107, the maintenance performing malfunctioning node is integrated.The i.e. node for having restarted, if fault cannot be got rid of, is arranged to maintenance state by node at server end, restarts node.Node, in start-up course, reads start image from network, enters the maintenance state of node, the configuration of node system is reverted to initial configuration status.After safeguarding integrated completing, the content that system is again started up making maintenance integrated comes into force, and becomes to safeguard integrated by node label, and flow process enters into step 108.

In step 108, performing the data information acquisition identical with step 101, flow process enters into step 109.

In step 109, identical with step 102, it may be judged whether to break down, when judging not have fault to produce, flow process enters into step 110；When judging to have fault to produce, flow process enters into step 111.

In step 110, removing the flag bit that node maintenance is integrated, flow process returns to step 101.

In step 111, the installation performing malfunctioning node is integrated.For safeguard integrated after, detection still can not be properly functioning node, be arranged to integrated state is installed at server end by node, restart node.Node, in start-up course, reads start image from network, and the network installation entering node is integrated, namely reads installation kit from network, carries out the installation configuration of system, node system is re-started installation.After installing integrated completing, system is again started up making new system come into force, and becomes to install integrated by node label, and flow process enters into step 112.

In step 112, performing the data information acquisition identical with step 101, flow process enters into step 113.

In step 113, identical with step 102, it may be judged whether to break down, when judging not have fault to produce, flow process enters into step 114；When judging to have fault to produce, flow process enters into step 115.

In step 114, removing the flag bit that node installation is integrated, flow process returns to step 101.

In step 115, owing to detection node is still not normally functioning, now sends messages to system manager, manager carry out artificial treatment.

It is the flow chart of application message data acquisition in the step performing data information acquisition in Fig. 1 with reference to Fig. 2, Fig. 2.Application message data acquisition is the practical situation according to production application, first enumerate out the various needs in practical application, then the node name of manual each application server of typing and application service as required, and it is saved in data base, then the application service state of this server is inquired about according to the node name of each application server, and its state is write in data base, in order to user inquires the application service state of these servers easily.It mainly includes step:

In step 201, naming in data base for service and process, namely by various application in researching and producing, it is determined that key service therein, process, and for which define the name in data base, flow process enters into step 202.

In step 202, the different situations for each service and process define corresponding state: UP, DOWN, DEGRADE.UP represents that state is normal, represents by green on interface, and DOWN represents that this service is unavailable, represents by redness on interface, and DEGRADE represents that service is available, but existing problems, interface represents by yellow.Flow process enters into step 203.

In step 203, taking host name, flow process enters into step 204.

In step 204, reading and apply each service or record corresponding to process and number, flow process enters into step 205.

In step 205, taking out the node name in one of them record, flow process enters into step 206.

In step 206, it is judged that whether node name is consistent with host name, when node name is consistent with host name, flow process enters into step 207；When node name does not correspond with host name, flow process returns to step 205.

In step 207, gathering corresponding service or process status write into Databasce, flow process enters into step 208.

In step 208, judging whether to circulate according to the number of the record read complete, namely whether had been taken out the node name in all records, when circulating complete, flow process enters into step 209；When also not carried out circulating, after waiting fixed time interval, flow process returns to step 205.

In step 209, the result of status poll is sent to the storage that long-range database server carries out concentrating.Flow process enters into step 210.

In step 210, on the application server status bar at interface, when clicking this column, update and show each application server state.This flow process terminates.

In fig. 2, step 205 is run in finger daemon mode to step 208 on application message acquisition module each application server in the cluster, carries out status poll with Fixed Time Interval.

In step 115 in FIG, when manager carries out Artificial Control, it may include following steps: first, node label is installation by build-in services device end；Again through the method remotely controlled, restart node or by node reset；Node passes through PXE netboot, it is judged that need to carry out the installation of network；When needs network installation, by NFS read need software kit to be mounted, carry out system installation and install after every network, system configuration；The detection again carrying out fault judges, and processes accordingly.

Fig. 3 is the module map of the Linux clustering fault automatic recovery system of the present invention.This system includes data information acquisition and judge module 301, restarts node module 302, safeguards integration module 302 and installs integration module 304.Data information acquisition and judge module 301 are used for performing data information acquisition and judging whether to break down.Information gathering can be divided into Dynamic Data Acquiring, static information collection, system service state data acquisition and application message data acquisition.In one embodiment, this data information acquisition and judge module 301 by reading system /proc file system obtains system information to obtain this Dynamic Data Acquiring and this static information collection.This data information acquisition and judge module 301 detect the service state of principal and subordinate's servers such as DNS, NIS, NTP of whole cluster to obtain this system service state data acquisition, and its state is write in data base.This data information acquisition and the judge module 301 practical situation according to production application, first enumerate out the various needs in practical application, then the node name of manual each application server of typing and application service as required, and it is saved in data base, then the application service state of this server is detected according to the node name of each application server, and its state is write in this data base, to carry out this application message data acquisition.Data information acquisition and the judge module 301 renewal time according to data information acquisition (information such as node, service, application), set the maximum threshold value updating interval, when the refresh time of the information of collection has exceeded maximum interval, assert that node, service or application are broken down.

When data information acquisition and judge module 301 judge to break down, restart node module 302 and use long-range control method to restart node, and mark node restarts.Now, data information acquisition and judge module 301 again perform data information acquisition and judge whether to break down, and when judging not have fault to produce, data information acquisition and judge module 301 remove the flag bit that node carries out restarting；When judging to have fault to produce, safeguard that integration module 303 performs the maintenance of malfunctioning node integrated.

For the node restarted, if fault cannot be got rid of, safeguard that node is arranged to maintenance state at server end by integration module 303, restart node.Node, in start-up course, is safeguarded that integration module 303 reads start image from network, is entered the maintenance state of node, the configuration of node system is reverted to initial configuration status.After safeguarding integrated completing, the content that system is again started up making maintenance integrated comes into force, and safeguards that node label is become to safeguard integrated by integration module 303.Now, data information acquisition and judge module 301 again perform data information acquisition and judge whether to break down, and when judging not have fault to produce, data information acquisition and judge module 301 remove the flag bit that node maintenance is integrated；When judging to have fault to produce, the installation that installation integration module 304 performs malfunctioning node is integrated.

For safeguard integrated after, detection still can not be properly functioning node, install integration module 304 be arranged to integrated state is installed at server end by node, restart node.Node, in start-up course, is installed integration module 304 and is read start image from network, and the network installation entering node is integrated, namely reads installation kit from network, carries out the installation configuration of system, node system is re-started installation.After installing integrated completing, system is again started up making new system come into force, and installs integration module 304 by integrated for node label one-tenth installation.Now, data information acquisition and judge module 301 again perform data information acquisition and judge whether to break down, and when judging not have fault to produce, install integration module 304 and remove the flag bit that node installation is integrated；When judging to have fault to produce, data information acquisition and judge module 301 now send messages to system manager, manager carry out artificial treatment.

Above example is only the exemplary embodiment of the present invention, is not used in the restriction present invention, and protection scope of the present invention is limited by appended claims book.The present invention in the essence of the present invention and protection domain, can be made various amendment or equivalent replacement by those skilled in the art, and this amendment or equivalent replacement also should be regarded as being within the scope of the present invention.

Claims

1.Linux clustering fault automatic recovery method, it is characterised in that this Linux clustering fault automatic recovery method includes:

Perform data information acquisition and judge whether to break down；

When judging to break down, restart node；

After restarting this node, again performing this data information acquisition and when judgement is broken down, the maintenance performing malfunctioning node is integrated；

After the maintenance performing this malfunctioning node is integrated, again performing this data information acquisition and when judgement is broken down, the installation performing malfunctioning node is integrated；And

After the installation performing this malfunctioning node is integrated, again performs this data information acquisition and when judgement is broken down, carry out artificial treatment；

This data information acquisition includes Dynamic Data Acquiring, static information collection, system service state data acquisition and application message data acquisition；

This Dynamic Data Acquiring and static information collection by read system /proc file system obtains system information；

The integrated step of maintenance of this execution malfunctioning node includes at server end, this node is arranged to maintenance state, restarting this node, this node, in start-up course, reads start image from network, enter the maintenance state of this node, the configuration of this node system is reverted to initial configuration status；

The integrated step of installing of this execution malfunctioning node includes being arranged to install integrated state at server end by this node, restart this node, this node is in start-up course, start image is read from network, the network installation entering node is integrated, read installation kit from network, carry out the installation configuration of system, this node system is re-started installation；

This Linux clustering fault automatic recovery method renewal time according to this data information acquisition, set the maximum time interval at this renewal interval, when the refresh time of this data information acquisition has exceeded this maximum time interval, it is judged that break down；

After this Linux clustering fault automatic recovery method is additionally included in the step restarting node, mark node carries out the flag bit restarted, and when again performing this data information acquisition and judging not break down, remove the flag bit that node carries out restarting；

After this Linux clustering fault automatic recovery method is additionally included in the step that the maintenance performing malfunctioning node is integrated, mark node safeguards integrated flag bit, and when again performing this data information acquisition and judging not break down, remove the flag bit that node maintenance is integrated；

2. Linux clustering fault automatic recovery method according to claim 1, it is characterised in that this system service state data acquisition is the service state of DNS, NIS, NTP principal and subordinate's server detecting whole cluster, and its state is write in data base.

3. Linux clustering fault automatic recovery method according to claim 1, it is characterized in that, this application message data acquisition is the practical situation according to production application, first enumerate out the various needs in practical application, then the node name of manual each application server of typing and application service as required, and it is saved in data base, then detect the application service state of this server according to the node name of each application server, and its state is write in this data base.

4.Linux clustering fault automatic recovery system, it is characterized in that, this Linux clustering fault automatic recovery system includes data information acquisition and judge module, restarts node module, safeguards integration module and installs integration module, this data information acquisition and judge module are used for performing data information acquisition and judging whether to break down, this restarts node module for restarting node, this maintenance integration module is integrated for the maintenance performing malfunctioning node, and this installation integration module is integrated for the installation performing malfunctioning node；

nullThis data information acquisition and judge module perform data information acquisition and judge whether to break down，When this data information acquisition and judge module judge to break down，This restarts node module and restarts node，Restart after node module restarts this node at this，When this data information acquisition and judge module perform this data information acquisition again and judgement is broken down，The maintenance that this maintenance integration module performs malfunctioning node is integrated，After the maintenance that performs this malfunctioning node when this maintenance integration module is integrated，When this data information acquisition and judge module perform this data information acquisition again and judgement is broken down，The installation that this installation integration module performs malfunctioning node is integrated，After the installation that performs this malfunctioning node at this installation integration module is integrated，When this data information acquisition and judge module perform this data information acquisition again and judgement is broken down，This data information acquisition and judge module send message to carry out artificial treatment；

This data information acquisition and the judge module renewal time according to this data information acquisition, set the maximum time interval at this renewal interval, when the refresh time of this data information acquisition has exceeded this maximum time interval, this data information acquisition and judge module judge to break down；

This maintenance integration module is when the maintenance performing malfunctioning node is integrated, this node is arranged to maintenance state at server end, restart this node, this node is in start-up course, start image is read from network, enter the maintenance state of this node, the configuration of this node system is reverted to initial configuration status；

This installation integration module is when the installation performing malfunctioning node is integrated, it is arranged to integrated state is installed at server end by this node, restart this node, this node is in start-up course, reading start image from network, the network installation entering node is integrated, reads installation kit from network, carry out the installation configuration of system, this node system is re-started installation；

This data information acquisition and judge module by reading system /proc file system obtains system information to obtain this Dynamic Data Acquiring and this static information collection；

This data information acquisition and judge module detect the service state of DNS, NIS, NTP principal and subordinate's server of whole cluster to obtain this system service state data acquisition, and its state is write in data base.

5. Linux clustering fault automatic recovery system according to claim 4, it is characterized in that, this data information acquisition and the judge module practical situation according to production application, first enumerate out the various needs in practical application, then the node name of manual each application server of typing and application service as required, and it is saved in data base, then the application service state of this server is detected according to the node name of each application server, and its state is write in this data base, to carry out this application message data acquisition.