CN102957563A - Linux cluster fault automatic recovery method and Linux cluster fault automatic recovery system - Google Patents

Linux cluster fault automatic recovery method and Linux cluster fault automatic recovery system Download PDF

Info

Publication number
CN102957563A
CN102957563A CN2012100312095A CN201210031209A CN102957563A CN 102957563 A CN102957563 A CN 102957563A CN 2012100312095 A CN2012100312095 A CN 2012100312095A CN 201210031209 A CN201210031209 A CN 201210031209A CN 102957563 A CN102957563 A CN 102957563A
Authority
CN
China
Prior art keywords
node
data information
information acquisition
automatic recovery
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2012100312095A
Other languages
Chinese (zh)
Other versions
CN102957563B (en
Inventor
单联瑜
丛龙水
董涛
李战强
孙世为
邢占军
孙友凯
段淼
刘玉梅
徐香明
赵军民
付巧娟
吴敏
车晓萍
刘芳
卢晋平
董倩
尚新民
侯树杰
郭见乐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Petroleum and Chemical Corp
Geophysical Research Institute of Sinopec Shengli Oilfield Co
Original Assignee
China Petroleum and Chemical Corp
Geophysical Research Institute of Sinopec Shengli Oilfield Co
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Petroleum and Chemical Corp, Geophysical Research Institute of Sinopec Shengli Oilfield Co filed Critical China Petroleum and Chemical Corp
Priority to CN201210031209.5A priority Critical patent/CN102957563B/en
Publication of CN102957563A publication Critical patent/CN102957563A/en
Application granted granted Critical
Publication of CN102957563B publication Critical patent/CN102957563B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Debugging And Monitoring (AREA)

Abstract

The invention provides a Linux cluster fault automatic recovery method which comprises the following steps of: executing data information acquisition and judging whether a fault occurs; if the fault occurs, restarting a node; after the node is re-started, executing the data information acquisition again, and if the fault occurs, executing maintenance integration of the fault node; after executing the maintenance integration of the fault node, executing the data information acquisition again, and if the fault occurs, executing installation integration of the fault node; and after executing the installation integration of the fault node, executing the data information acquisition again, and if the fault occurs, performing manual treatment. The Linux cluster fault automatic recovery method reduces the labor consumption to a great extent, can automatically, quickly and efficiently finish fault automatic recovery of a cluster node system, can meet different needs of a heterogeneous cluster, supports multiple versions of operating systems, and improves the utilization efficiency of the cluster resources.

Description

Linux clustering fault automatic recovery method and Linux clustering fault automatic recovery system
Technical field
The present invention relates to the optimization and application of large-scale cluster resource management system, particularly relate to Linux clustering fault automatic recovery method.
Background technology
Along with the development of computation requirement, how the scale of microcomputer cluster also constantly enlarging, finishes the management to large-scale cluster efficiently, becomes a difficult problem that needs to be resolved hurrily.Computer manufacturer both domestic and external has all dropped into a large amount of research and development strength exploitation cluster Related products, from the freeware to the charging software, function is had nothing in common with each other, major function concentrates on system management and supervision, but lack instrument intelligence, automation, so the manageability of cluster and availability are all received great impact.Under existing pattern, administrative staff need to carry out searching of fault point and judge by the experience of self, and are often consuming time longer, and are not easy to process rapidly problem, and malfunctioning node is come into operation again.We have invented a kind of new Linux clustering fault automatic recovery method for this reason, have solved above technical problem.
Summary of the invention
The purpose of this invention is to provide a kind of Linux clustering fault automatic recovery method that can finish automatically, fast and efficiently the automatically restoring fault of clustered node system.
Purpose of the present invention can be achieved by the following technical measures: Linux clustering fault automatic recovery method, and this Linux clustering fault automatic recovery method bag executing data information gathering also judges whether to break down; When judgement is broken down, restart node; After restarting this node, again carry out this data information acquisition and judge that when breaking down, the maintenance of carrying out malfunctioning node is integrated; After the maintenance of carrying out this malfunctioning node is integrated, again carries out this data information acquisition and judge that when breaking down, the installation of carrying out malfunctioning node is integrated; And after the installation of carrying out this malfunctioning node is integrated, again carries out this data information acquisition and also judge when breaking down, carry out artificial treatment.
Purpose of the present invention also can be achieved by the following technical measures:
This data information acquisition comprises Dynamic Data Acquiring, static information collection, the collection of system service status data and application message data acquisition.
This Dynamic Data Acquiring and static information collection by reading system /the proc file system obtains system information.
This system service status data collection is the service state of principal and subordinate's servers such as the DNS, the NIS that detect whole cluster, NTP, and its state is write in the database.
This application message data acquisition is the actual conditions according to production application, enumerate out first the various needs in the practical application, then the as required node name of each application server of manual typing and application service, and be saved in the database, then detect the application service state of this server according to the node name of each application server, and its state is write in this database.
This Linux clustering fault automatic recovery method is set the maximum time interval at this renewal interval according to the update time of this data information acquisition, and when the refresh time of this data information acquisition had surpassed this maximum time interval, judgement was broken down.
After this Linux clustering fault automatic recovery method also is included in the step that restarts node, the flag bit that the sign node has restarted, and again carrying out this data information acquisition and judging and to remove the flag bit that node has restarted when not breaking down.
After this Linux clustering fault automatic recovery method also is included in the integrated step of the maintenance of carrying out malfunctioning node, the integrated flag bit of sign node maintenance, and again carrying out this data information acquisition and judging and to remove the integrated flag bit of node maintenance when not breaking down.
After this Linux clustering fault automatic recovery method also is included in the integrated step of the installation of carrying out malfunctioning node, the integrated flag bit of sign node installation, and again carrying out this data information acquisition and judging and to remove the integrated flag bit of node installation when not breaking down.
The integrated step of the maintenance of this execution malfunctioning node comprises is arranged to maintenance state with this node at server end, restart this node, this node reads start image from network in start-up course, enter the maintenance state of this node, the configuration of this node system is reverted to the initial configuration state.
The integrated step of the installation of this execution malfunctioning node comprises is arranged to install integrated state with this node at server end, restart this node, this node is in start-up course, read start image from network, the network installation that enters node is integrated, read installation kit from network, carry out the installation configuration of system, this node system is re-started installation.
Purpose of the present invention also can be achieved by the following technical measures: Linux clustering fault automatic recovery system, it is characterized in that, this Linux clustering fault automatic recovery system comprises data information acquisition and judge module, restart node module, safeguard integration module and integration module is installed, this data information acquisition and judge module are used for the executing data information gathering and judge whether to break down, this restarts node module and is used for restarting node, it is integrated that this safeguards that integration module is used to carry out the maintenance of malfunctioning node, and this installs integration module, and to be used for carrying out the installation of malfunctioning node integrated.
Purpose of the present invention also can be achieved by the following technical measures:
This data information acquisition and the information gathering of judge module executing data also judge whether to break down, when this data information acquisition and judge module are judged when breaking down, this restarts node module and restarts node, after this restarts node module and restarts this node, this data information acquisition and judge module are again carried out this data information acquisition and are judged when breaking down, this maintenance of safeguarding integration module execution malfunctioning node is integrated, when this safeguard integration module carry out the maintenance of this malfunctioning node integrated after, this data information acquisition and judge module are again carried out this data information acquisition and are judged when breaking down, this installation that integration module execution malfunctioning node is installed is integrated, after the installation that this installation integration module is carried out this malfunctioning node is integrated, this data information acquisition and judge module are again carried out this data information acquisition and are judged when breaking down, and this data information acquisition and judge module send message to carry out artificial treatment.
This data information acquisition comprises Dynamic Data Acquiring, static information collection, the collection of system service status data and application message data acquisition.
This data information acquisition and judge module by reading system /the proc file system obtains system information to obtain this Dynamic Data Acquiring and this static information collection.
The service state that this data information acquisition and judge module detect principal and subordinate's servers such as DNS, NIS, the NTP of whole cluster to be obtaining this system service status data collection, and its state is write in the database.
This data information acquisition and judge module are according to the actual conditions of production application, enumerate out first the various needs in the practical application, then the as required node name of each application server of manual typing and application service, and be saved in the database, then detect the application service state of this server according to the node name of each application server, and its state write in this database, to carry out this application message data acquisition.
This data information acquisition and judge module are according to the update time of this data information acquisition, set the maximum time interval at this renewal interval, when the refresh time of this data information acquisition had surpassed this maximum time interval, this data information acquisition and judge module judgement were broken down.
This safeguards that integration module is when the maintenance of carrying out malfunctioning node is integrated, this node is arranged to maintenance state at server end, restart this node, this node is in start-up course, read start image from network, enter the maintenance state of this node, the configuration of this node system is reverted to the initial configuration state.
This installs integration module when the installation of carrying out malfunctioning node is integrated, this node is arranged to install integrated state at server end, restart this node, this node is in start-up course, read start image from network, the network installation that enters node is integrated, reads installation kit from network, carry out the installation configuration of system, this node system is re-started installation.
Linux clustering fault automatic recovery method among the present invention, can gather the various key messages of group system operation and concentrate storage, set up early warning mechanism, fault from the automatic Processing Cluster of a plurality of ranks, and can provide detailed reference data to supply the managerial decision making, reduced to a great extent artificial consumption, malfunctioning node is recovered with speed full out and put into production use.Linux clustering fault automatic recovery method among the present invention, can finish automatically, fast and efficiently the automatically restoring fault of clustered node system, can satisfy the different demands of isomeric group, support a plurality of version operating systems, accelerated the again operation of clustered node, convenient for users to use, improved the cluster resource utilization ratio.
Description of drawings
Fig. 1 is the flow chart of Linux clustering fault automatic recovery method of the present invention;
Fig. 2 is the flow chart of the application message data acquisition step among Fig. 1;
Fig. 3 is the module map of Linux clustering fault automatic recovery system of the present invention.
Embodiment
For above and other purpose of the present invention, feature and advantage can be become apparent, cited below particularlyly go out preferred embodiment, and cooperate appended graphicly, be described in detail below.
As shown in Figure 1, Fig. 1 is the flow chart of Linux clustering fault automatic recovery method of the present invention.In step 101, executing data information gathering, information gathering can be divided into Dynamic Data Acquiring, static information collection, the collection of system service status data and application message data acquisition.The information that Dynamic Data Acquiring mainly gathers comprises and comprises total internal memory, uses internal memory, free memory, shared drive, total exchange area, use exchange area, idle exchange area, IO number of disk per second, disk reading rate, disk to read byte number, disk writing rate, disk to write the information such as byte number.Dynamic Data Acquiring by read/the proc file system in meminfo, stat, the files such as loadavg, snmp and analyze after obtain.The dynamic data information that gathers is preserved for concentrating, and service operation is on the node of the information of collection, and the information of being responsible for gathering stores in the database.
The information that the static information collection mainly gathers comprises unit, CPU number, single cpu check figure, memory size, disk size, local file system name, the file system size of corresponding FSNames field etc. of frequency of frequency, the CPU of model, the CPU of identifier, the CPU of title, the CPU of node.The static information collection reads/the proc file system in the file such as cpuinfo, partitions, mounts and analyze after acquired information.The static data information that gathers also is to concentrate to preserve, and each node operation provides the service of static information, the request of monitoring users in cluster.When needs gathered the static information of certain node, long-range fill order was again returned the data of collection by network, leave concentratedly in the database.Dynamic Data Acquiring and static information collection by reading system /the proc file system obtains system information, the fast efficient of the method speed is high, be fit to walk abreast and obtain great deal of nodes information, file system content changes less under different kernel versions, be conducive to programming compatible.
The collection of system service status data is the service state of principal and subordinate's servers such as the DNS, the NIS that detect whole cluster, NTP, and its state is write in the database, so that the user inquires the service state of these servers easily.
The application message data acquisition is the actual conditions according to production application, enumerate out first the various needs in the practical application, then the as required node name of each application server of manual typing and application service, and be saved in the database, then detect the application service state of this server according to the node name of each application server, and its state write in the database, so that the user inquires the application service state of these servers easily.
The information data that gathers can also show by graphic interface.Graphical interfaces is by unified data-interface, and the data of the information gathering in the database are left in access concentratedly, and carry out according to demand the customization of display mode, can offer the very convenient intuitively surveillance style of system manager.
After executing Dynamic Data Acquiring, static information collection, the collection of system service status data and application message data acquisition, flow process enters into step 102.
In step 102, judge whether to break down.In one embodiment, according to the update time of data information acquisition (information such as node, service, application), set the maximum threshold value that upgrades the interval, when the refresh time of Information Monitoring has surpassed the maximum time interval, assert node, service or use and break down, flow process enters into step 103; When the refresh time of Information Monitoring did not surpass the maximum time interval, illustrating did not have fault to produce, and flow process turns back to step 101.
In step 103, restart node.That is to say, use long-range control method to restart node, and indicate that node restarts, flow process enters into step 104.
In step 104, carry out the data information acquisition identical with step 101, flow process enters into step 105.
In step 105, identical with step 102, judge whether to break down, when judgement did not have fault to produce, flow process entered into step 106; When judgement had fault to produce, flow process entered into step 107.
In step 106, remove the flag bit that node has restarted, flow process turns back to step 101.
In step 107, the maintenance of carrying out malfunctioning node is integrated.Namely for the node that has restarted, if fault can't be got rid of, node is arranged to maintenance state at server end, restarts node.Node reads start image from network in start-up course, enter the maintenance state of node, and the configuration of node system is reverted to the initial configuration state.After safeguarding integrated finishing, system starts again comes into force the integrated content of maintenance, and becomes maintenance integrated node label, and flow process enters into step 108.
In step 108, carry out the data information acquisition identical with step 101, flow process enters into step 109.
In step 109, identical with step 102, judge whether to break down, when judgement did not have fault to produce, flow process entered into step 110; When judgement had fault to produce, flow process entered into step 111.
In step 110, remove the integrated flag bit of node maintenance, flow process turns back to step 101.
In step 111, the installation of carrying out malfunctioning node is integrated.After integrated for maintenance, detect the node that still can not normally move, node is arranged to install integrated state at server end, restart node.Node reads start image from network in start-up course, the network installation that enters node is integrated, namely reads installation kit from network, carries out the installation configuration of system, and node system is re-started installation.After integrated finishing was installed, system starts again came into force new system, and became installation integrated node label, and flow process enters into step 112.
In step 112, carry out the data information acquisition identical with step 101, flow process enters into step 113.
In step 113, identical with step 102, judge whether to break down, when judgement did not have fault to produce, flow process entered into step 114; When judgement had fault to produce, flow process entered into step 115.
In step 114, remove the integrated flag bit of node installation, flow process turns back to step 101.
In step 115, because detection node still can not normally be moved, send message to the system manager this moment, carries out artificial treatment by the keeper.
With reference to Fig. 2, Fig. 2 is the flow chart of application message data acquisition in the step of the executing data information gathering among Fig. 1.The application message data acquisition is the actual conditions according to production application, enumerate out first the various needs in the practical application, then the as required node name of each application server of manual typing and application service, and be saved in the database, then inquire about the application service state of this server according to the node name of each application server, and its state write in the database, so that the user inquires the application service state of these servers easily.It mainly comprises step:
In step 201, for service and process are named in database, namely by various application in researching and producing, definite key service, process wherein, and for it has defined name in database, flow process enters into step 202.
In step 202, for each service has defined corresponding state: UP, DOWN, DEGRADE with the different situations of process.UP represents that state is normal, represents with green on the interface, and DOWN represents that this service is unavailable, represents with redness on the interface, and DEGRADE represents to serve available, but existing problems represent with yellow on the interface.Flow process enters into step 203.
In step 203, get host name, flow process enters into step 204.
In step 204, read and use each service or corresponding record and the number of process, flow process enters into step 205.
In step 205, take out the node name in one of them record, flow process enters into step 206.
In step 206, whether the decision node name conforms to host name, and when node name conformed to host name, flow process entered into step 207; When node name did not conform to host name, flow process turned back to step 205.
In step 207, gather corresponding service or process status and write into Databasce, flow process enters into step 208.
In step 208, judge whether to circulate according to the number of the record that reads complete, namely whether taken out the node name in all records, when circulation was complete, flow process entered into step 209; After also do not carried out circulation time, waiting for the regular time interval, flow process turns back to step 205.
In step 209, the result of status poll is sent to the storage that long-range database server is concentrated.Flow process enters into step 210.
In step 210, on the application server status bar at interface, when clicking this column, upgrade and show each application server state.This flow process finishes.
In Fig. 2, step 205 to step 208 is moved in the finger daemon mode on each application server in cluster for the application message acquisition module, carries out status poll with Fixed Time Interval.
In the step 115 in Fig. 1, when the keeper carries out Artificial Control, can may further comprise the steps: at first, build-in services device end is installation with node label; Again by the method for Long-distance Control, restart node or with node reset; Node is by the PXE netboot, and judgement need to be carried out the installation of network; When the needs network installation, read the software kit that needs installation by NFS, carry out the installation of system and every network, the system configuration after the installation; Again carry out the detection of fault and judge, and process accordingly.
Fig. 3 is the module map of Linux clustering fault automatic recovery system of the present invention.This system comprises data information acquisition and judge module 301, restarts node module 302, safeguards integration module 302 and integration module 304 is installed.Data information acquisition and judge module 301 are used for the executing data information gathering and judge whether to break down.Information gathering can be divided into Dynamic Data Acquiring, static information collection, the collection of system service status data and application message data acquisition.In one embodiment, this data information acquisition and judge module 301 by reading system /the proc file system obtains system information to obtain this Dynamic Data Acquiring and this static information collection.The service state that this data information acquisition and judge module 301 detects principal and subordinate's servers such as DNS, NIS, the NTP of whole cluster to be obtaining this system service status data collection, and its state is write in the database.This data information acquisition and judge module 301 are according to the actual conditions of production application, enumerate out first the various needs in the practical application, then the as required node name of each application server of manual typing and application service, and be saved in the database, then detect the application service state of this server according to the node name of each application server, and its state write in this database, to carry out this application message data acquisition.Data information acquisition and judge module 301 are according to the update time of data information acquisition (information such as node, service, application), set the maximum threshold value that upgrades the interval, when the refresh time of Information Monitoring has surpassed the maximum time interval, assert node, service or use to break down.
Judge when data information acquisition and judge module 301 and restart node module 302 and use long-range control methods to restart node, and the sign node to restart when breaking down.At this moment, the again executing data information gathering and judge whether to break down of data information acquisition and judge module 301, when judgement did not have fault to produce, data information acquisition and judge module 301 were removed the flag bit that nodes have restarted; When judgement has fault to produce, safeguard that the maintenance of integration module 303 execution malfunctioning nodes is integrated.
For the node that has restarted, if fault can't be got rid of, safeguard that integration module 303 is arranged to maintenance state with node at server end, restart node.Node safeguards that integration module 303 reads start image from network in start-up course, enter the maintenance state of node, and the configuration of node system is reverted to the initial configuration state.After safeguarding integrated finishing, system starts again comes into force the integrated content of maintenance, and safeguards that integration module 303 becomes to safeguard node label integrated.At this moment, the again executing data information gathering and judge whether to break down of data information acquisition and judge module 301, when judgement did not have fault to produce, data information acquisition and judge module 301 were removed the integrated flag bit of node maintenances; When judgement had fault to produce, the installation that integration module 304 execution malfunctioning nodes are installed was integrated.
After integrated for maintenance, detect the node that still can not normally move, integration module 304 is installed node is arranged to install integrated state at server end, restart node.Node is installed integration module 304 and is read start image from network in start-up course, and the network installation that enters node is integrated, namely reads installation kit from network, carries out the installation configuration of system, and node system is re-started installation.After integrated finishing was installed, system starts again came into force new system, and integration module 304 was installed is installed node label one-tenth integrated.At this moment, the again executing data information gathering and judge whether to break down of data information acquisition and judge module 301 when judgement does not have fault to produce, is installed integration module 304 and is removed the integrated flag bit of node installations; When judgement had fault to produce, data information acquisition and judge module sent message to the system manager 301 this moments, carry out artificial treatment by the keeper.
Above embodiment is exemplary embodiment of the present invention only, is not used in restriction the present invention, and protection scope of the present invention is limited by additional claims.Those skilled in the art can make various modifications or be equal to replacement the present invention in essence of the present invention and protection range, this modification or be equal to replacement and also should be considered as dropping in protection scope of the present invention.

Claims (10)

1.Linux the clustering fault automatic recovery method is characterized in that, this Linux clustering fault automatic recovery method comprises:
The executing data information gathering also judges whether to break down;
When judgement is broken down, restart node;
After restarting this node, again carry out this data information acquisition and judge that when breaking down, the maintenance of carrying out malfunctioning node is integrated;
After the maintenance of carrying out this malfunctioning node is integrated, again carries out this data information acquisition and judge that when breaking down, the installation of carrying out malfunctioning node is integrated; And
After the installation of carrying out this malfunctioning node is integrated, again carry out this data information acquisition and also judge when breaking down, carry out artificial treatment.
2. Linux clustering fault automatic recovery method according to claim 1 is characterized in that, this data information acquisition comprises Dynamic Data Acquiring, static information collection, the collection of system service status data and application message data acquisition.
3. Linux clustering fault automatic recovery method according to claim 2 is characterized in that, this Dynamic Data Acquiring and static information collection by reading system /the proc file system obtains system information.
4. Linux clustering fault automatic recovery method according to claim 2 is characterized in that, this system service status data collection is the service state of principal and subordinate's servers such as the DNS, the NIS that detect whole cluster, NTP, and its state is write in the database.
5.Linux clustering fault automatic recovery system, it is characterized in that, this Linux clustering fault automatic recovery system comprises data information acquisition and judge module, restarts node module, safeguards integration module and integration module is installed, this data information acquisition and judge module are used for the executing data information gathering and judge whether to break down, this restarts node module and is used for restarting node, it is integrated that this safeguards that integration module is used to carry out the maintenance of malfunctioning node, and this installs integration module, and to be used for carrying out the installation of malfunctioning node integrated.
6. Linux clustering fault automatic recovery system according to claim 5, it is characterized in that, this data information acquisition and the information gathering of judge module executing data also judge whether to break down, when this data information acquisition and judge module are judged when breaking down, this restarts node module and restarts node, after this restarts node module and restarts this node, this data information acquisition and judge module are again carried out this data information acquisition and are judged when breaking down, this maintenance of safeguarding integration module execution malfunctioning node is integrated, when this safeguard integration module carry out the maintenance of this malfunctioning node integrated after, this data information acquisition and judge module are again carried out this data information acquisition and are judged when breaking down, this installation that integration module execution malfunctioning node is installed is integrated, after the installation that this installation integration module is carried out this malfunctioning node is integrated, this data information acquisition and judge module are again carried out this data information acquisition and are judged when breaking down, and this data information acquisition and judge module send message to carry out artificial treatment.
7. Linux clustering fault automatic recovery system according to claim 5 is characterized in that, this data information acquisition comprises Dynamic Data Acquiring, static information collection, the collection of system service status data and application message data acquisition.
8. Linux clustering fault automatic recovery system according to claim 7 is characterized in that, this data information acquisition and judge module by reading system /the proc file system obtains system information to obtain this Dynamic Data Acquiring and this static information collection.
9. Linux clustering fault automatic recovery system according to claim 7, it is characterized in that, the service state that this data information acquisition and judge module detect principal and subordinate's servers such as DNS, NIS, the NTP of whole cluster to be obtaining this system service status data collection, and its state is write in the database.
10. Linux clustering fault automatic recovery system according to claim 5, it is characterized in that, this data information acquisition and judge module are according to the update time of this data information acquisition, set the maximum time interval at this renewal interval, when the refresh time of this data information acquisition had surpassed this maximum time interval, this data information acquisition and judge module judgement were broken down.
CN201210031209.5A 2011-08-16 2012-02-13 Linux clustering fault automatic recovery method and Linux clustering fault automatic recovery system Active CN102957563B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210031209.5A CN102957563B (en) 2011-08-16 2012-02-13 Linux clustering fault automatic recovery method and Linux clustering fault automatic recovery system

Applications Claiming Priority (7)

Application Number Priority Date Filing Date Title
CN2011102345474 2011-08-16
CN201110234547.4 2011-08-16
CN201110234547 2011-08-16
CN201110331264 2011-10-27
CN2011103312641 2011-10-27
CN201110331264.1 2011-10-27
CN201210031209.5A CN102957563B (en) 2011-08-16 2012-02-13 Linux clustering fault automatic recovery method and Linux clustering fault automatic recovery system

Publications (2)

Publication Number Publication Date
CN102957563A true CN102957563A (en) 2013-03-06
CN102957563B CN102957563B (en) 2016-07-06

Family

ID=47765829

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210031209.5A Active CN102957563B (en) 2011-08-16 2012-02-13 Linux clustering fault automatic recovery method and Linux clustering fault automatic recovery system

Country Status (1)

Country Link
CN (1) CN102957563B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103197992A (en) * 2013-04-08 2013-07-10 汉柏科技有限公司 Automatic recovering method of Gluster FS (File System) split-brain
CN103475734A (en) * 2013-09-25 2013-12-25 浪潮电子信息产业股份有限公司 Linux cluster user backup migration method
CN104123192A (en) * 2014-08-04 2014-10-29 浪潮电子信息产业股份有限公司 Performance optimization method based on memory subsystem in linux system
CN107391335A (en) * 2016-03-31 2017-11-24 阿里巴巴集团控股有限公司 A kind of method and apparatus for checking cluster health status
CN111193616A (en) * 2019-12-13 2020-05-22 广州朗国电子科技有限公司 Automatic operation and maintenance method, device and system, storage medium and automatic operation and maintenance server

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1852152A (en) * 2005-11-16 2006-10-25 华为技术有限公司 Method for recovering primary coufiguration performance of network terminal
CN101207519A (en) * 2007-12-13 2008-06-25 上海华为技术有限公司 Version server, operation maintenance unit and method for restoring failure
CN101403983A (en) * 2008-11-25 2009-04-08 北京航空航天大学 Resource monitoring method and system for multi-core processor based on virtual machine
CN101741619A (en) * 2009-12-24 2010-06-16 中国人民解放军信息工程大学 Self-curing J2EE application server for intrusion tolerance and self-curing method thereof

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1852152A (en) * 2005-11-16 2006-10-25 华为技术有限公司 Method for recovering primary coufiguration performance of network terminal
CN101207519A (en) * 2007-12-13 2008-06-25 上海华为技术有限公司 Version server, operation maintenance unit and method for restoring failure
CN101403983A (en) * 2008-11-25 2009-04-08 北京航空航天大学 Resource monitoring method and system for multi-core processor based on virtual machine
CN101741619A (en) * 2009-12-24 2010-06-16 中国人民解放军信息工程大学 Self-curing J2EE application server for intrusion tolerance and self-curing method thereof

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103197992A (en) * 2013-04-08 2013-07-10 汉柏科技有限公司 Automatic recovering method of Gluster FS (File System) split-brain
CN103197992B (en) * 2013-04-08 2016-05-18 汉柏科技有限公司 The automation restoration methods of GlusterFS fissure
CN103475734A (en) * 2013-09-25 2013-12-25 浪潮电子信息产业股份有限公司 Linux cluster user backup migration method
CN104123192A (en) * 2014-08-04 2014-10-29 浪潮电子信息产业股份有限公司 Performance optimization method based on memory subsystem in linux system
CN107391335A (en) * 2016-03-31 2017-11-24 阿里巴巴集团控股有限公司 A kind of method and apparatus for checking cluster health status
CN107391335B (en) * 2016-03-31 2021-09-03 阿里巴巴集团控股有限公司 Method and equipment for checking health state of cluster
CN111193616A (en) * 2019-12-13 2020-05-22 广州朗国电子科技有限公司 Automatic operation and maintenance method, device and system, storage medium and automatic operation and maintenance server

Also Published As

Publication number Publication date
CN102957563B (en) 2016-07-06

Similar Documents

Publication Publication Date Title
CN109885316B (en) Hdfs-hbase deployment method and device based on kubernetes
CN102981931B (en) Backup method and device for virtual machine
CN100465919C (en) Techniques for health monitoring and control of application servers
JP5140633B2 (en) Method for analyzing failure occurring in virtual environment, management server, and program
WO2023142054A1 (en) Container microservice-oriented performance monitoring and alarm method and alarm system
CN101650741B (en) Method and system for updating index of distributed full-text search in real time
CN101236515B (en) Multi-core system single-core abnormity restoration method
CN112667362B (en) Method and system for deploying Kubernetes virtual machine cluster on Kubernetes
CN104360878B (en) A kind of method and device of application software deployment
US11256576B2 (en) Intelligent scheduling of backups
CN102957563A (en) Linux cluster fault automatic recovery method and Linux cluster fault automatic recovery system
CN104486445A (en) Distributed extendable resource monitoring system and method based on cloud platform
CN112162821B (en) Container cluster resource monitoring method, device and system
CN103116531A (en) Storage system failure predicting method and storage system failure predicting device
CN101996106A (en) Method for monitoring software running state
CN100472468C (en) Computer system, computer network and method
CN105659562A (en) Tolerating failures using concurrency in a cluster
CN104077199A (en) Shared disk based high availability cluster isolation method and system
CN103399781A (en) Cloud server and virtual machine management method thereof
CN102110035A (en) Dmi redundancy in multiple processor computer systems
CN108009004B (en) Docker-based method for realizing measurement and monitoring of availability of service application
CN117130730A (en) Metadata management method for federal Kubernetes cluster
CN114064217B (en) OpenStack-based node virtual machine migration method and device
CN101650666B (en) A kind of computer management system and method
CN102521060A (en) Pseudo halt solving method of high-availability cluster system based on watchdog local detecting technique

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant