CN102957563B - Linux clustering fault automatic recovery method and Linux clustering fault automatic recovery system - Google Patents

Linux clustering fault automatic recovery method and Linux clustering fault automatic recovery system Download PDF

Info

Publication number
CN102957563B
CN102957563B CN201210031209.5A CN201210031209A CN102957563B CN 102957563 B CN102957563 B CN 102957563B CN 201210031209 A CN201210031209 A CN 201210031209A CN 102957563 B CN102957563 B CN 102957563B
Authority
CN
China
Prior art keywords
node
information acquisition
data information
integrated
installation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201210031209.5A
Other languages
Chinese (zh)
Other versions
CN102957563A (en
Inventor
单联瑜
丛龙水
董涛
李战强
孙世为
邢占军
孙友凯
段淼
刘玉梅
徐香明
赵军民
付巧娟
吴敏
车晓萍
刘芳
卢晋平
董倩
尚新民
侯树杰
郭见乐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Petroleum and Chemical Corp
Geophysical Research Institute of Sinopec Shengli Oilfield Co
Original Assignee
China Petroleum and Chemical Corp
Geophysical Research Institute of Sinopec Shengli Oilfield Co
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Petroleum and Chemical Corp, Geophysical Research Institute of Sinopec Shengli Oilfield Co filed Critical China Petroleum and Chemical Corp
Priority to CN201210031209.5A priority Critical patent/CN102957563B/en
Publication of CN102957563A publication Critical patent/CN102957563A/en
Application granted granted Critical
Publication of CN102957563B publication Critical patent/CN102957563B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Debugging And Monitoring (AREA)

Abstract

The present invention provides a kind of Linux clustering fault automatic recovery method, including performing data information acquisition and judging whether to break down;When judging to break down, restart node;After restarting this node, again performing this data information acquisition and when judgement is broken down, the maintenance performing malfunctioning node is integrated;After the maintenance performing this malfunctioning node is integrated, again performing this data information acquisition and when judgement is broken down, the installation performing malfunctioning node is integrated;And after the installation performing this malfunctioning node is integrated, again performs this data information acquisition and when judgement is broken down, carry out artificial treatment.This Linux clustering fault automatic recovery method decreases artificial consumption to a great extent, the automatically restoring fault of clustered node system can be completed automatically, fast and efficiently, disclosure satisfy that isomeric group difference demand, support multiple Edition operating system, improve cluster resource utilization ratio.

Description

Linux clustering fault automatic recovery method and Linux clustering fault automatic recovery system
Technical field
The present invention relates to the optimization of large-scale cluster resource management system and application, especially relate to Linux clustering fault automatic recovery method.
Background technology
Along with the development of the demand of calculating, the scale of microcomputer cluster also expands constantly, how to be efficiently completed the management to large-scale cluster, becomes a difficult problem urgently to be resolved hurrily.Computer manufacturer both domestic and external has all put into a large amount of research and development strength development set group's Related product, from freeware to charging software, function is had nothing in common with each other, major function concentrates on system administration and supervision, but lack instrument intelligence, automatization, so the manageability of cluster and availability all receive strong influence.Under existing pattern, management personnel require over self experience and carry out lookup and the judgement of trouble point, often consuming time longer, and are not easy to process rapidly problem, are put back into by malfunctioning node.We have invented a kind of new Linux clustering fault automatic recovery method for this, solve above technical problem.
Summary of the invention
It is an object of the invention to provide the Linux clustering fault automatic recovery method of a kind of automatically restoring fault that can complete clustered node system automatically, fast and efficiently.
The purpose of the present invention can be achieved by the following technical measures: Linux clustering fault automatic recovery method, and this Linux clustering fault automatic recovery method bag performs data information acquisition and judges whether to break down;When judging to break down, restart node;After restarting this node, again performing this data information acquisition and when judgement is broken down, the maintenance performing malfunctioning node is integrated;After the maintenance performing this malfunctioning node is integrated, again performing this data information acquisition and when judgement is broken down, the installation performing malfunctioning node is integrated;And after the installation performing this malfunctioning node is integrated, again performs this data information acquisition and when judgement is broken down, carry out artificial treatment.
The purpose of the present invention realizes also by following technical measures:
This data information acquisition includes Dynamic Data Acquiring, static information collection, system service state data acquisition and application message data acquisition.
This Dynamic Data Acquiring and static information collection by read system /proc file system obtains system information.
This system service state data acquisition is the service state of principal and subordinate's servers such as DNS, NIS, NTP of detecting whole cluster, and its state is write in data base.
This application message data acquisition is the practical situation according to production application, first enumerate out the various needs in practical application, then the node name of manual each application server of typing and application service as required, and it is saved in data base, then detect the application service state of this server according to the node name of each application server, and its state is write in this data base.
This Linux clustering fault automatic recovery method renewal time according to this data information acquisition, set the maximum time interval at this renewal interval, when the refresh time of this data information acquisition has exceeded this maximum time interval, it is judged that break down.
After this Linux clustering fault automatic recovery method is additionally included in the step restarting node, mark node carries out the flag bit restarted, and when again performing this data information acquisition and judging not break down, remove the flag bit that node carries out restarting.
After this Linux clustering fault automatic recovery method is additionally included in the step that the maintenance performing malfunctioning node is integrated, mark node safeguards integrated flag bit, and when again performing this data information acquisition and judging not break down, remove the flag bit that node maintenance is integrated.
After this Linux clustering fault automatic recovery method is additionally included in the step that the installation performing malfunctioning node is integrated, mark node installs integrated flag bit, and when again performing this data information acquisition and judging not break down, remove the flag bit that node installation is integrated.
The integrated step of maintenance of this execution malfunctioning node includes at server end, this node is arranged to maintenance state, restarting this node, this node, in start-up course, reads start image from network, enter the maintenance state of this node, the configuration of this node system is reverted to initial configuration status.
The integrated step of installing of this execution malfunctioning node includes being arranged to install integrated state at server end by this node, restart this node, this node is in start-up course, start image is read from network, the network installation entering node is integrated, read installation kit from network, carry out the installation configuration of system, this node system is re-started installation.
The purpose of the present invention realizes also by following technical measures: Linux clustering fault automatic recovery system, it is characterized in that, this Linux clustering fault automatic recovery system includes data information acquisition and judge module, restart node module, safeguard integration module and integration module is installed, this data information acquisition and judge module are used for performing data information acquisition and judging whether to break down, this restarts node module for restarting node, this maintenance integration module is integrated for the maintenance performing malfunctioning node, this installation integration module is integrated for the installation performing malfunctioning node.
The purpose of the present invention realizes also by following technical measures:
nullThis data information acquisition and judge module perform data information acquisition and judge whether to break down,When this data information acquisition and judge module judge to break down,This restarts node module and restarts node,Restart after node module restarts this node at this,When this data information acquisition and judge module perform this data information acquisition again and judgement is broken down,The maintenance that this maintenance integration module performs malfunctioning node is integrated,After the maintenance that performs this malfunctioning node when this maintenance integration module is integrated,When this data information acquisition and judge module perform this data information acquisition again and judgement is broken down,The installation that this installation integration module performs malfunctioning node is integrated,After the installation that performs this malfunctioning node at this installation integration module is integrated,When this data information acquisition and judge module perform this data information acquisition again and judgement is broken down,This data information acquisition and judge module send message to carry out artificial treatment.
This data information acquisition includes Dynamic Data Acquiring, static information collection, system service state data acquisition and application message data acquisition.
This data information acquisition and judge module by reading system /proc file system obtains system information to obtain this Dynamic Data Acquiring and this static information collection.
This data information acquisition and judge module detect the service state of principal and subordinate's servers such as DNS, NIS, NTP of whole cluster to obtain this system service state data acquisition, and its state is write in data base.
This data information acquisition and the judge module practical situation according to production application, first enumerate out the various needs in practical application, then the node name of manual each application server of typing and application service as required, and it is saved in data base, then the application service state of this server is detected according to the node name of each application server, and its state is write in this data base, to carry out this application message data acquisition.
This data information acquisition and the judge module renewal time according to this data information acquisition, set the maximum time interval at this renewal interval, when the refresh time of this data information acquisition has exceeded this maximum time interval, this data information acquisition and judge module judge to break down.
This maintenance integration module is when the maintenance performing malfunctioning node is integrated, this node is arranged to maintenance state at server end, restart this node, this node is in start-up course, start image is read from network, enter the maintenance state of this node, the configuration of this node system is reverted to initial configuration status.
This installation integration module is when the installation performing malfunctioning node is integrated, it is arranged to integrated state is installed at server end by this node, restart this node, this node is in start-up course, reading start image from network, the network installation entering node is integrated, reads installation kit from network, carry out the installation configuration of system, this node system is re-started installation.
Linux clustering fault automatic recovery method in the present invention, various key messages centralized stores that group system is run can be gathered, set up early warning mechanism, the fault of cluster is automatically processed from multiple ranks, and detailed reference data can be provided to supply managerial decision making, decrease artificial consumption to a great extent, malfunctioning node is recovered with full out speed and puts into production use.Linux clustering fault automatic recovery method in the present invention, the automatically restoring fault of clustered node system can be completed automatically, fast and efficiently, disclosure satisfy that isomeric group difference demand, support multiple Edition operating system, accelerate the operation again of clustered node, convenient for users to use, improve cluster resource utilization ratio.
Accompanying drawing explanation
Fig. 1 is the flow chart of the Linux clustering fault automatic recovery method of the present invention;
Fig. 2 is the flow chart of the application message data collection steps in Fig. 1;
Fig. 3 is the module map of the Linux clustering fault automatic recovery system of the present invention.
Detailed description of the invention
For making the above and other purpose of the present invention, feature and advantage to become apparent, cited below particularly go out preferred embodiment, and coordinate institute's accompanying drawings, be described in detail below.
As it is shown in figure 1, the flow chart of the Linux clustering fault automatic recovery method that Fig. 1 is the present invention.In step 101, performing data information acquisition, information gathering can be divided into Dynamic Data Acquiring, static information collection, system service state data acquisition and application message data acquisition.The information that Dynamic Data Acquiring mainly gathers includes comprising total internal memory, uses internal memory, free memory, shared drive, total exchange area, uses exchange area, idle exchange area, disk IO number per second, disk reading rate, disk to read the information such as byte number, disk write speed, disk write byte number.Dynamic Data Acquiring is by obtaining after the files such as meminfo, stat, loadavg, snmp in reading/proc file system analysis.The dynamic data information gathered preserves for concentrating, and service operation, on the node of gather information, is responsible for storing in data base the information of collection.
The information that static information collection mainly gathers includes the file system size etc. of the title of node, the identifier of CPU, the model of CPU, the frequency of CPU, the unit of frequency of CPU, CPU number, single cpu check figure, memory size, disk size, local file system name, corresponding FSNames field.The file such as cpuinfo, partitions, mounts obtain information after analyzing in static information collection reading/proc file system.The static data information gathered also is concentrate to preserve, and runs the service providing static information, the request of monitoring users in the cluster on each node.When needing the static information gathering certain node, more remotely perform order, returned the data gathered by network, leave concentratedly in data base.Dynamic Data Acquiring and static information collection by read system /proc file system obtains system information, the fast efficiency of the method speed is high, being suitable for parallel acquisition great deal of nodes information, file system content changes relatively small under different kernel versions, is conducive to programming compatibility.
System service state data acquisition is the service state of principal and subordinate's servers such as DNS, NIS, NTP of detecting whole cluster, and its state is write in data base, in order to user inquires the service state of these servers easily.
Application message data acquisition is the practical situation according to production application, first enumerate out the various needs in practical application, then the node name of manual each application server of typing and application service as required, and it is saved in data base, then the application service state of this server is detected according to the node name of each application server, and its state is write in data base, in order to user inquires the application service state of these servers easily.
The information data gathered can also be shown by graphic interface.Graphical interfaces, by unified data-interface, accesses the data of the information gathering left concentratedly in data base, and displays the customization of mode according to demand, it is provided that to the very convenient surveillance style intuitively of system manager.
After having performed Dynamic Data Acquiring, static information collection, system service state data acquisition and application message data acquisition, flow process enters into step 102.
In step 102, it may be judged whether break down.In one embodiment, the renewal time according to data information acquisition (information such as node, service, application), set the maximum threshold value updating interval, when the refresh time of the information of collection has exceeded maximum interval, assert that node, service or application are broken down, flow process enters into step 103;When the refresh time of the information of collection is not less than maximum interval, illustrating do not have fault to produce, flow process returns to step 101.
In step 103, restart node.It is to say, use long-range control method to restart node, and mark node restarts, and flow process enters into step 104.
In step 104, performing the data information acquisition identical with step 101, flow process enters into step 105.
In step 105, identical with step 102, it may be judged whether to break down, when judging not have fault to produce, flow process enters into step 106;When judging to have fault to produce, flow process enters into step 107.
In step 106, removing the flag bit that node carries out restarting, flow process returns to step 101.
In step 107, the maintenance performing malfunctioning node is integrated.The i.e. node for having restarted, if fault cannot be got rid of, is arranged to maintenance state by node at server end, restarts node.Node, in start-up course, reads start image from network, enters the maintenance state of node, the configuration of node system is reverted to initial configuration status.After safeguarding integrated completing, the content that system is again started up making maintenance integrated comes into force, and becomes to safeguard integrated by node label, and flow process enters into step 108.
In step 108, performing the data information acquisition identical with step 101, flow process enters into step 109.
In step 109, identical with step 102, it may be judged whether to break down, when judging not have fault to produce, flow process enters into step 110;When judging to have fault to produce, flow process enters into step 111.
In step 110, removing the flag bit that node maintenance is integrated, flow process returns to step 101.
In step 111, the installation performing malfunctioning node is integrated.For safeguard integrated after, detection still can not be properly functioning node, be arranged to integrated state is installed at server end by node, restart node.Node, in start-up course, reads start image from network, and the network installation entering node is integrated, namely reads installation kit from network, carries out the installation configuration of system, node system is re-started installation.After installing integrated completing, system is again started up making new system come into force, and becomes to install integrated by node label, and flow process enters into step 112.
In step 112, performing the data information acquisition identical with step 101, flow process enters into step 113.
In step 113, identical with step 102, it may be judged whether to break down, when judging not have fault to produce, flow process enters into step 114;When judging to have fault to produce, flow process enters into step 115.
In step 114, removing the flag bit that node installation is integrated, flow process returns to step 101.
In step 115, owing to detection node is still not normally functioning, now sends messages to system manager, manager carry out artificial treatment.
It is the flow chart of application message data acquisition in the step performing data information acquisition in Fig. 1 with reference to Fig. 2, Fig. 2.Application message data acquisition is the practical situation according to production application, first enumerate out the various needs in practical application, then the node name of manual each application server of typing and application service as required, and it is saved in data base, then the application service state of this server is inquired about according to the node name of each application server, and its state is write in data base, in order to user inquires the application service state of these servers easily.It mainly includes step:
In step 201, naming in data base for service and process, namely by various application in researching and producing, it is determined that key service therein, process, and for which define the name in data base, flow process enters into step 202.
In step 202, the different situations for each service and process define corresponding state: UP, DOWN, DEGRADE.UP represents that state is normal, represents by green on interface, and DOWN represents that this service is unavailable, represents by redness on interface, and DEGRADE represents that service is available, but existing problems, interface represents by yellow.Flow process enters into step 203.
In step 203, taking host name, flow process enters into step 204.
In step 204, reading and apply each service or record corresponding to process and number, flow process enters into step 205.
In step 205, taking out the node name in one of them record, flow process enters into step 206.
In step 206, it is judged that whether node name is consistent with host name, when node name is consistent with host name, flow process enters into step 207;When node name does not correspond with host name, flow process returns to step 205.
In step 207, gathering corresponding service or process status write into Databasce, flow process enters into step 208.
In step 208, judging whether to circulate according to the number of the record read complete, namely whether had been taken out the node name in all records, when circulating complete, flow process enters into step 209;When also not carried out circulating, after waiting fixed time interval, flow process returns to step 205.
In step 209, the result of status poll is sent to the storage that long-range database server carries out concentrating.Flow process enters into step 210.
In step 210, on the application server status bar at interface, when clicking this column, update and show each application server state.This flow process terminates.
In fig. 2, step 205 is run in finger daemon mode to step 208 on application message acquisition module each application server in the cluster, carries out status poll with Fixed Time Interval.
In step 115 in FIG, when manager carries out Artificial Control, it may include following steps: first, node label is installation by build-in services device end;Again through the method remotely controlled, restart node or by node reset;Node passes through PXE netboot, it is judged that need to carry out the installation of network;When needs network installation, by NFS read need software kit to be mounted, carry out system installation and install after every network, system configuration;The detection again carrying out fault judges, and processes accordingly.
Fig. 3 is the module map of the Linux clustering fault automatic recovery system of the present invention.This system includes data information acquisition and judge module 301, restarts node module 302, safeguards integration module 302 and installs integration module 304.Data information acquisition and judge module 301 are used for performing data information acquisition and judging whether to break down.Information gathering can be divided into Dynamic Data Acquiring, static information collection, system service state data acquisition and application message data acquisition.In one embodiment, this data information acquisition and judge module 301 by reading system /proc file system obtains system information to obtain this Dynamic Data Acquiring and this static information collection.This data information acquisition and judge module 301 detect the service state of principal and subordinate's servers such as DNS, NIS, NTP of whole cluster to obtain this system service state data acquisition, and its state is write in data base.This data information acquisition and the judge module 301 practical situation according to production application, first enumerate out the various needs in practical application, then the node name of manual each application server of typing and application service as required, and it is saved in data base, then the application service state of this server is detected according to the node name of each application server, and its state is write in this data base, to carry out this application message data acquisition.Data information acquisition and the judge module 301 renewal time according to data information acquisition (information such as node, service, application), set the maximum threshold value updating interval, when the refresh time of the information of collection has exceeded maximum interval, assert that node, service or application are broken down.
When data information acquisition and judge module 301 judge to break down, restart node module 302 and use long-range control method to restart node, and mark node restarts.Now, data information acquisition and judge module 301 again perform data information acquisition and judge whether to break down, and when judging not have fault to produce, data information acquisition and judge module 301 remove the flag bit that node carries out restarting;When judging to have fault to produce, safeguard that integration module 303 performs the maintenance of malfunctioning node integrated.
For the node restarted, if fault cannot be got rid of, safeguard that node is arranged to maintenance state at server end by integration module 303, restart node.Node, in start-up course, is safeguarded that integration module 303 reads start image from network, is entered the maintenance state of node, the configuration of node system is reverted to initial configuration status.After safeguarding integrated completing, the content that system is again started up making maintenance integrated comes into force, and safeguards that node label is become to safeguard integrated by integration module 303.Now, data information acquisition and judge module 301 again perform data information acquisition and judge whether to break down, and when judging not have fault to produce, data information acquisition and judge module 301 remove the flag bit that node maintenance is integrated;When judging to have fault to produce, the installation that installation integration module 304 performs malfunctioning node is integrated.
For safeguard integrated after, detection still can not be properly functioning node, install integration module 304 be arranged to integrated state is installed at server end by node, restart node.Node, in start-up course, is installed integration module 304 and is read start image from network, and the network installation entering node is integrated, namely reads installation kit from network, carries out the installation configuration of system, node system is re-started installation.After installing integrated completing, system is again started up making new system come into force, and installs integration module 304 by integrated for node label one-tenth installation.Now, data information acquisition and judge module 301 again perform data information acquisition and judge whether to break down, and when judging not have fault to produce, install integration module 304 and remove the flag bit that node installation is integrated;When judging to have fault to produce, data information acquisition and judge module 301 now send messages to system manager, manager carry out artificial treatment.
Above example is only the exemplary embodiment of the present invention, is not used in the restriction present invention, and protection scope of the present invention is limited by appended claims book.The present invention in the essence of the present invention and protection domain, can be made various amendment or equivalent replacement by those skilled in the art, and this amendment or equivalent replacement also should be regarded as being within the scope of the present invention.

Claims (5)

1.Linux clustering fault automatic recovery method, it is characterised in that this Linux clustering fault automatic recovery method includes:
Perform data information acquisition and judge whether to break down;
When judging to break down, restart node;
After restarting this node, again performing this data information acquisition and when judgement is broken down, the maintenance performing malfunctioning node is integrated;
After the maintenance performing this malfunctioning node is integrated, again performing this data information acquisition and when judgement is broken down, the installation performing malfunctioning node is integrated;And
After the installation performing this malfunctioning node is integrated, again performs this data information acquisition and when judgement is broken down, carry out artificial treatment;
This data information acquisition includes Dynamic Data Acquiring, static information collection, system service state data acquisition and application message data acquisition;
This Dynamic Data Acquiring and static information collection by read system /proc file system obtains system information;
The integrated step of maintenance of this execution malfunctioning node includes at server end, this node is arranged to maintenance state, restarting this node, this node, in start-up course, reads start image from network, enter the maintenance state of this node, the configuration of this node system is reverted to initial configuration status;
The integrated step of installing of this execution malfunctioning node includes being arranged to install integrated state at server end by this node, restart this node, this node is in start-up course, start image is read from network, the network installation entering node is integrated, read installation kit from network, carry out the installation configuration of system, this node system is re-started installation;
This Linux clustering fault automatic recovery method renewal time according to this data information acquisition, set the maximum time interval at this renewal interval, when the refresh time of this data information acquisition has exceeded this maximum time interval, it is judged that break down;
After this Linux clustering fault automatic recovery method is additionally included in the step restarting node, mark node carries out the flag bit restarted, and when again performing this data information acquisition and judging not break down, remove the flag bit that node carries out restarting;
After this Linux clustering fault automatic recovery method is additionally included in the step that the maintenance performing malfunctioning node is integrated, mark node safeguards integrated flag bit, and when again performing this data information acquisition and judging not break down, remove the flag bit that node maintenance is integrated;
After this Linux clustering fault automatic recovery method is additionally included in the step that the installation performing malfunctioning node is integrated, mark node installs integrated flag bit, and when again performing this data information acquisition and judging not break down, remove the flag bit that node installation is integrated.
2. Linux clustering fault automatic recovery method according to claim 1, it is characterised in that this system service state data acquisition is the service state of DNS, NIS, NTP principal and subordinate's server detecting whole cluster, and its state is write in data base.
3. Linux clustering fault automatic recovery method according to claim 1, it is characterized in that, this application message data acquisition is the practical situation according to production application, first enumerate out the various needs in practical application, then the node name of manual each application server of typing and application service as required, and it is saved in data base, then detect the application service state of this server according to the node name of each application server, and its state is write in this data base.
4.Linux clustering fault automatic recovery system, it is characterized in that, this Linux clustering fault automatic recovery system includes data information acquisition and judge module, restarts node module, safeguards integration module and installs integration module, this data information acquisition and judge module are used for performing data information acquisition and judging whether to break down, this restarts node module for restarting node, this maintenance integration module is integrated for the maintenance performing malfunctioning node, and this installation integration module is integrated for the installation performing malfunctioning node;
nullThis data information acquisition and judge module perform data information acquisition and judge whether to break down,When this data information acquisition and judge module judge to break down,This restarts node module and restarts node,Restart after node module restarts this node at this,When this data information acquisition and judge module perform this data information acquisition again and judgement is broken down,The maintenance that this maintenance integration module performs malfunctioning node is integrated,After the maintenance that performs this malfunctioning node when this maintenance integration module is integrated,When this data information acquisition and judge module perform this data information acquisition again and judgement is broken down,The installation that this installation integration module performs malfunctioning node is integrated,After the installation that performs this malfunctioning node at this installation integration module is integrated,When this data information acquisition and judge module perform this data information acquisition again and judgement is broken down,This data information acquisition and judge module send message to carry out artificial treatment;
This data information acquisition includes Dynamic Data Acquiring, static information collection, system service state data acquisition and application message data acquisition;
This data information acquisition and the judge module renewal time according to this data information acquisition, set the maximum time interval at this renewal interval, when the refresh time of this data information acquisition has exceeded this maximum time interval, this data information acquisition and judge module judge to break down;
This maintenance integration module is when the maintenance performing malfunctioning node is integrated, this node is arranged to maintenance state at server end, restart this node, this node is in start-up course, start image is read from network, enter the maintenance state of this node, the configuration of this node system is reverted to initial configuration status;
This installation integration module is when the installation performing malfunctioning node is integrated, it is arranged to integrated state is installed at server end by this node, restart this node, this node is in start-up course, reading start image from network, the network installation entering node is integrated, reads installation kit from network, carry out the installation configuration of system, this node system is re-started installation;
This data information acquisition and judge module by reading system /proc file system obtains system information to obtain this Dynamic Data Acquiring and this static information collection;
This data information acquisition and judge module detect the service state of DNS, NIS, NTP principal and subordinate's server of whole cluster to obtain this system service state data acquisition, and its state is write in data base.
5. Linux clustering fault automatic recovery system according to claim 4, it is characterized in that, this data information acquisition and the judge module practical situation according to production application, first enumerate out the various needs in practical application, then the node name of manual each application server of typing and application service as required, and it is saved in data base, then the application service state of this server is detected according to the node name of each application server, and its state is write in this data base, to carry out this application message data acquisition.
CN201210031209.5A 2011-08-16 2012-02-13 Linux clustering fault automatic recovery method and Linux clustering fault automatic recovery system Active CN102957563B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210031209.5A CN102957563B (en) 2011-08-16 2012-02-13 Linux clustering fault automatic recovery method and Linux clustering fault automatic recovery system

Applications Claiming Priority (7)

Application Number Priority Date Filing Date Title
CN201110234547 2011-08-16
CN2011102345474 2011-08-16
CN201110234547.4 2011-08-16
CN2011103312641 2011-10-27
CN201110331264 2011-10-27
CN201110331264.1 2011-10-27
CN201210031209.5A CN102957563B (en) 2011-08-16 2012-02-13 Linux clustering fault automatic recovery method and Linux clustering fault automatic recovery system

Publications (2)

Publication Number Publication Date
CN102957563A CN102957563A (en) 2013-03-06
CN102957563B true CN102957563B (en) 2016-07-06

Family

ID=47765829

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210031209.5A Active CN102957563B (en) 2011-08-16 2012-02-13 Linux clustering fault automatic recovery method and Linux clustering fault automatic recovery system

Country Status (1)

Country Link
CN (1) CN102957563B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103197992B (en) * 2013-04-08 2016-05-18 汉柏科技有限公司 The automation restoration methods of GlusterFS fissure
CN103475734A (en) * 2013-09-25 2013-12-25 浪潮电子信息产业股份有限公司 Linux cluster user backup migration method
CN104123192A (en) * 2014-08-04 2014-10-29 浪潮电子信息产业股份有限公司 Performance optimization method based on memory subsystem in linux system
CN107391335B (en) * 2016-03-31 2021-09-03 阿里巴巴集团控股有限公司 Method and equipment for checking health state of cluster
CN111193616A (en) * 2019-12-13 2020-05-22 广州朗国电子科技有限公司 Automatic operation and maintenance method, device and system, storage medium and automatic operation and maintenance server

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1852152A (en) * 2005-11-16 2006-10-25 华为技术有限公司 Method for recovering primary coufiguration performance of network terminal
CN101207519A (en) * 2007-12-13 2008-06-25 上海华为技术有限公司 Version server, operation maintenance unit and method for restoring failure
CN101403983A (en) * 2008-11-25 2009-04-08 北京航空航天大学 Resource monitoring method and system for multi-core processor based on virtual machine
CN101741619A (en) * 2009-12-24 2010-06-16 中国人民解放军信息工程大学 Self-curing J2EE application server for intrusion tolerance and self-curing method thereof

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1852152A (en) * 2005-11-16 2006-10-25 华为技术有限公司 Method for recovering primary coufiguration performance of network terminal
CN101207519A (en) * 2007-12-13 2008-06-25 上海华为技术有限公司 Version server, operation maintenance unit and method for restoring failure
CN101403983A (en) * 2008-11-25 2009-04-08 北京航空航天大学 Resource monitoring method and system for multi-core processor based on virtual machine
CN101741619A (en) * 2009-12-24 2010-06-16 中国人民解放军信息工程大学 Self-curing J2EE application server for intrusion tolerance and self-curing method thereof

Also Published As

Publication number Publication date
CN102957563A (en) 2013-03-06

Similar Documents

Publication Publication Date Title
CN100465919C (en) Techniques for health monitoring and control of application servers
US8006134B2 (en) Method for analyzing fault caused in virtualized environment, and management server
US9275172B2 (en) Systems and methods for analyzing performance of virtual environments
CN101996106B (en) Method for monitoring software running state
US9116897B2 (en) Techniques for power analysis
CN104360878B (en) A kind of method and device of application software deployment
CN108932184B (en) Monitoring device and method
CN102957563B (en) Linux clustering fault automatic recovery method and Linux clustering fault automatic recovery system
US20030226059A1 (en) Systems and methods for remote tracking of reboot status
WO2023142054A1 (en) Container microservice-oriented performance monitoring and alarm method and alarm system
CN100472468C (en) Computer system, computer network and method
CN111046011B (en) Log collection method, system, device, electronic equipment and readable storage medium
CN112667362B (en) Method and system for deploying Kubernetes virtual machine cluster on Kubernetes
CN114328102B (en) Equipment state monitoring method, equipment state monitoring device, equipment and computer readable storage medium
CN102110035B (en) DMI redundancy in multiple processor computer systems
CN105659562A (en) Tolerating failures using concurrency in a cluster
CN106201527B (en) A kind of Application Container system of logic-based subregion
EP2498186A1 (en) Operation management device and operation management method
CN102638378A (en) Mass storage system monitoring method integrating heterogeneous storage devices
CN108009004B (en) Docker-based method for realizing measurement and monitoring of availability of service application
WO2020015116A1 (en) Database monitoring method and terminal device
CN117130730A (en) Metadata management method for federal Kubernetes cluster
CN114064217B (en) OpenStack-based node virtual machine migration method and device
WO2015192664A1 (en) Device monitoring method and apparatus
CN103178977A (en) Computer system and starting-up management method of same

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant