CN103368771A - Collecting method and device for fault site information of multi-node server system - Google Patents

Collecting method and device for fault site information of multi-node server system Download PDF

Info

Publication number
CN103368771A
CN103368771A CN2013102528953A CN201310252895A CN103368771A CN 103368771 A CN103368771 A CN 103368771A CN 2013102528953 A CN2013102528953 A CN 2013102528953A CN 201310252895 A CN201310252895 A CN 201310252895A CN 103368771 A CN103368771 A CN 103368771A
Authority
CN
China
Prior art keywords
fault
information
type
subregion
situ
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2013102528953A
Other languages
Chinese (zh)
Inventor
雷舒莹
吴登奔
廖义祥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to CN2013102528953A priority Critical patent/CN103368771A/en
Publication of CN103368771A publication Critical patent/CN103368771A/en
Priority to PCT/CN2014/072262 priority patent/WO2014206099A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/069Management of faults, events, alarms or notifications using logs of notifications; Post-processing of notifications

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The embodiment of the invention relates to the technical field of computers, and discloses a collecting method and device for fault site information of a multi-node server system. The collecting method for the fault site information of the multi-node server system comprises the steps of receiving the fault information reported by partition nodes; inquiring the fault type matched with the fault information according to the fault information; deciding the type of the fault site information needed to be collected according to the fault type; and collecting corresponding fault site information according to the type of the fault site information needed to be collected. The embodiment of the invention provides an effective collecting mechanism for the fault site information, and can effectively collect fault site information.

Description

A kind of collection method and device of fault in-situ information of multi-node server system
Technical field
The present invention relates to field of computer technology, relate in particular to a kind of collection method and device of fault in-situ information of multi-node server system.
Background technology
In multi-node server system, when the subregion node breaks down, can provide very important support for accident analysis to the collection of its fault in-situ information.Because multi-node server system more complicated, and the relevance between the subregion node is stronger, therefore causing reason that certain subregion node breaks down, to have multiple may (may be that bad operation by the user causes, also may be to be caused by unusual on every side environment, also might be to be caused etc. by other subregion nodes).Accurately locate the basic reason that fault occurs in order to maintain easily personnel, improve the efficient of troubleshooting, when occuring, fault not only wants the fault message of collector node, also to collect fault in-situ information (such as User operation log, SEL daily record, system environments temperature, fan speed etc.), and these information are offered failure analysis module, perhaps directly offer the attendant.
Yet finding in the practice owing to lacking the collection mechanism of effective fault in-situ information, therefore, when the subregion node breaks down, how effectively to collect fault in-situ information, is the technical barrier that those skilled in the art need solution badly.
Summary of the invention
The embodiment of the invention discloses a kind of collection method and device of fault in-situ information of multi-node server system, a kind of collection mechanism of effective fault in-situ information is provided, can effectively collect fault in-situ information.
Embodiment of the invention first aspect discloses a kind of collection method of fault in-situ information of multi-node server system, comprising:
Receive the fault message that the subregion node reports;
According to described fault message, obtain the fault type that is complementary with described fault message;
According to described fault type, extract the fault in-situ information type that needs collection;
According to the fault in-situ information type that described needs are collected, collect corresponding fault in-situ information.
In the possible implementation of the first of embodiment of the invention first aspect, described according to described fault message, obtain the fault type that is complementary with described fault message and comprise:
According to described fault message, from the matching relationship of the fault message of fault type module stores and fault type, obtain the fault type that is complementary with described fault message.
In conjunction with the possible implementation of the first of embodiment of the invention first aspect, described according to described fault type in the possible implementation of the second of embodiment of the invention first aspect, extract the fault in-situ information type that needs to collect and comprise:
According to described fault type, extracting from the private information typelib of the publicly-owned information type storehouse of information type module stores and described fault type coupling needs the fault in-situ information type of collecting.
The first or the possible implementation of the second in conjunction with embodiment of the invention first aspect or embodiment of the invention first aspect, in the third possible implementation of embodiment of the invention first aspect, if described subregion node is that subregion is from node, and the described fault in-situ information type that needs to collect comprises User operation log, SEL daily record, system environments temperature, fan speed, power, described fault in-situ information type of collecting according to described needs then, collect corresponding fault in-situ information and comprise:
Send the information request to log pattern, collect described subregion from User operation log and the SEL daily record of node to trigger described log pattern;
And, send the information request to environment monitoring module, collect described subregion from system environments temperature, fan speed and the power of node to trigger described environment monitoring module.
The first or the possible implementation of the second in conjunction with embodiment of the invention first aspect or embodiment of the invention first aspect, in the 4th kind of possible implementation of embodiment of the invention first aspect, if described subregion node is the subregion host node, and the described fault in-situ information type that needs to collect comprises User operation log, SEL daily record, system environments temperature, fan speed, power, described fault in-situ information type of collecting according to described needs then, collect corresponding fault in-situ information and comprise:
Send the information request to log pattern, collect User operation log and the SEL daily record of all subregion nodes of subregion under the described subregion host node to trigger described log pattern;
And, send the information request to environment monitoring module, collect system environments temperature, fan speed and the power of all subregion nodes of subregion under the described subregion host node to trigger described environment monitoring module.
Embodiment of the invention second aspect discloses a kind of gathering-device of fault in-situ information of multi-node server system, comprises fault management module, and described fault management module comprises:
Fault processing module is used for receiving the fault message that the subregion node reports, and according to described fault message, obtains the fault type that is complementary with described fault message;
The information module for the described fault type that obtains according to described fault processing module, is extracted the fault in-situ information type that needs collection, and according to the fault in-situ information type that described needs are collected, collects corresponding fault in-situ information.
In the possible implementation of the first of embodiment of the invention second aspect, described gathering-device also comprises:
The fault type module is used for the fault message of storage and the matching relationship of fault type;
Wherein, described fault processing module obtains the fault type that is complementary with described fault message and comprises according to described fault message:
Described fault processing module is used for according to described fault message, obtains the fault type that is complementary with described fault message from the matching relationship of the fault message of described fault type module stores and fault type.
In conjunction with the possible implementation of the first of embodiment of the invention first aspect, in the possible implementation of the second of embodiment of the invention first aspect, described gathering-device also comprises:
The information type module is used for storing the private information typelib of publicly-owned information type storehouse and fault type coupling;
The described fault type that described information module is obtained according to described fault processing module, extract the fault in-situ information type that needs to collect and comprise:
The described fault type that described information module is obtained according to described fault processing module, extracting from the private information typelib of the publicly-owned information type storehouse of described information type module stores and described fault type coupling needs the fault in-situ information type of collecting.
The first or the possible implementation of the second in conjunction with embodiment of the invention second aspect or embodiment of the invention second aspect, in the third possible implementation of embodiment of the invention second aspect, if described subregion node is that subregion is from node, and the described fault in-situ information type that needs to collect comprises User operation log, SEL daily record, system environments temperature, fan speed, power, and then described gathering-device also comprises log pattern and environment monitoring module:
The fault in-situ information type that described information module is collected according to described needs, collect corresponding fault in-situ information and comprise:
Described information module is used for sending the information request to described log pattern, collects described subregion from User operation log and the SEL daily record of node to trigger described log pattern;
And described information module is used for sending the information request to described environment monitoring module, collects described subregion from system environments temperature, fan speed and the power of node to trigger described environment monitoring module.
The first or the possible implementation of the second in conjunction with embodiment of the invention second aspect or embodiment of the invention second aspect, in the 4th kind of possible implementation of embodiment of the invention second aspect, if described subregion node is the subregion host node, and the described fault in-situ information type that needs to collect comprises User operation log, SEL daily record, system environments temperature, fan speed, power, and then described gathering-device also comprises log pattern and environment monitoring module:
The fault in-situ information type that described information module is collected according to described needs, collect corresponding fault in-situ information and comprise:
Described information module be used for to send the information request to described log pattern, collects User operation log and the SEL daily record of all subregion nodes of subregion under the described subregion host node to trigger described log pattern;
And, described information module be used for to send the information request to described environment monitoring module, collects system environments temperature, fan speed and the power of all subregion nodes of subregion under the described subregion host node to trigger described environment monitoring module.
In the embodiment of the invention, after receiving the fault message that the subregion node reports, can be according to this fault message, obtain the fault type that is complementary with this fault message, and according to this fault type, extract the fault in-situ information type that needs collection, and the fault in-situ information type of collecting as required, corresponding fault in-situ information collected.By implementing the embodiment of the invention, a kind of collection mechanism of effective fault in-situ information not only is provided, but also can have effectively collected fault in-situ information.
Description of drawings
In order to be illustrated more clearly in the technical scheme in the embodiment of the invention, the below will do to introduce simply to the accompanying drawing of required use among the embodiment, apparently, accompanying drawing in the following describes only is some embodiments of the present invention, for those of ordinary skills, under the prerequisite of not paying creative work, can also obtain according to these accompanying drawings other accompanying drawing.
Fig. 1 is the flow chart of collection method of the fault in-situ information of the disclosed a kind of multi-node server system of the embodiment of the invention;
Fig. 2 is the flow chart of collection method of the fault in-situ information of the disclosed another kind of multi-node server system of the embodiment of the invention;
Fig. 3 is the publicly-owned information type storehouse of the disclosed a kind of information type module stores of inventive embodiments and the schematic diagram of the private information typelib that fault type mates;
Fig. 4 is the flow chart of collection method of the fault in-situ information of the disclosed another kind of multi-node server system of the embodiment of the invention;
Fig. 5 is the structure chart of gathering-device of the fault in-situ information of the disclosed a kind of multi-node server system of the embodiment of the invention;
The structure chart of the gathering-device of the fault in-situ information of the disclosed another kind of multi-node server system of Fig. 6 embodiment of the invention.
Embodiment
Below in conjunction with the accompanying drawing in the embodiment of the invention, the technical scheme in the embodiment of the invention is clearly and completely described, obviously, described embodiment only is the present invention's part embodiment, rather than whole embodiment.Based on the embodiment among the present invention, those of ordinary skills belong to the scope of protection of the invention not making the every other embodiment that obtains under the creative work prerequisite.
The embodiment of the invention discloses a kind of collection method and device of fault in-situ information of multi-node server system, a kind of collection mechanism of effective fault in-situ information is provided, can effectively collect fault in-situ information.Below be elaborated respectively.
See also Fig. 1, Fig. 1 is the flow chart of collection method of the fault in-situ information of the disclosed a kind of multi-node server system of the embodiment of the invention.As shown in Figure 1, the collection method of the fault in-situ information of this multi-node server system can may further comprise the steps.
101, receive the fault message that the subregion node reports.
In one embodiment, can receive the fault message that the single board management controller (Baseboard Management Controller, BMC) of subregion node reports, wherein, fault message can be a fault numbering or simple character string.
102, according to this fault message, obtain the fault type that is complementary with this fault message.
In one embodiment, according to this fault message, obtaining the fault type that is complementary with this fault message can comprise:
According to this fault message, from the matching relationship of the fault message of fault type module stores and fault type, obtain the fault type that is complementary with this fault message.
A nearlyer step ground in the embodiment of the invention, can also obtain the detailed failure-description information of this fault message from the fault type module.
103, according to this fault type, extract the fault in-situ information type that needs collection.
In the embodiment of the invention, the fault in-situ information type that different fault types need to be collected is different, therefore, need to according to this fault type, extract the fault in-situ information type that needs collection.
In one embodiment, according to this fault type, extracting the fault in-situ information type that needs to collect can comprise:
According to this fault type, extracting from the private information typelib of the publicly-owned information type storehouse of information type module stores and this fault type coupling needs the fault in-situ information type of collecting.
The fault in-situ information type of 104, collecting is as required collected corresponding fault in-situ information.
In the embodiment of the invention, the fault in-situ information type that needs to collect can comprise User operation log, SEL daily record, system environments temperature, fan speed, power etc.
In method shown in Figure 1, after receiving the fault message that the subregion node reports, can be according to this fault message, obtain the fault type that is complementary with this fault message, and according to this fault type, extract the fault in-situ information type that needs collection, and the fault in-situ information type of collecting as required, corresponding fault in-situ information collected.By implementing method shown in Figure 1, a kind of collection mechanism of effective fault in-situ information not only is provided, but also can have effectively collected fault in-situ information.
See also Fig. 2, Fig. 2 is the flow chart of collection method of the fault in-situ information of the disclosed a kind of multi-node server system of the embodiment of the invention.As shown in Figure 2, the collection method of the fault in-situ information of this multi-node server system can may further comprise the steps.
201, receive the fault message that the subregion node reports.
In one embodiment, can receive the fault message that the BMC of subregion node reports, wherein, fault message can be a fault numbering or simple character string.
202, according to this fault message, obtain the fault type that is complementary with this fault message.
In one embodiment, according to this fault message, obtaining the fault type that is complementary with this fault message can comprise:
According to this fault message, from the matching relationship of the fault message of fault type module stores and fault type, obtain the fault type that is complementary with this fault message.
A nearlyer step ground in the embodiment of the invention, can also obtain the detailed failure-description information of this fault message from the fault type module.
203, according to this fault type, extract the fault in-situ information type that needs collection, wherein, the fault in-situ information type that needs to collect comprises User operation log, SEL daily record, system environments temperature, fan speed, power.
In the embodiment of the invention, the fault in-situ information type that different fault types need to be collected is different, therefore, need to according to this fault type, extract the fault in-situ information type that needs collection.In the embodiment of the invention, the fault in-situ information type that needs to collect comprises User operation log, SEL daily record, system environments temperature, fan speed, power.
In one embodiment, according to this fault type, extracting the fault in-situ information type that needs to collect can comprise:
According to this fault type, extracting from the private information typelib of the publicly-owned information type storehouse of information type module stores and this fault type coupling needs the fault in-situ information type of collecting.
In the embodiment of the invention, as shown in Figure 3, the information type module can be stored the private information typelib of publicly-owned information type storehouse and fault type coupling, for example, the information type module can be stored publicly-owned information type storehouse (comprising the fault in-situ information type 1~3 that needs are collected), and store the private information typelib 1(that fault type 1~3 mates respectively and comprise the fault in-situ information type 4 that needs are collected, 5,7), private information typelib 2(comprises the fault in-situ information type 4 that needs are collected, 6) and private information typelib 3(comprise the fault in-situ information type 5 that needs are collected, 7,11), wherein, the fault in-situ information type 1~3rd that the needs that publicly-owned information type storehouse comprises are collected, the fault in-situ information type that all fault types all will be collected.For instance, when fault type is fault type 1, according to this fault type 1, the fault in-situ information type 1~3 that needs collection need to be from the publicly-owned information type storehouse of information type module stores, extracted, and the fault in-situ information type 4,5,7 that to collect need to be from the private information typelib 1 of this fault type 1 coupling, extracted.
If 204 subregion nodes be subregion from node, send the information request to log pattern, collect this subregion from User operation log and the SEL daily record of node to trigger log pattern; And, send the information request to environment monitoring module, collect this subregion from system environments temperature, fan speed and the power of node to trigger environment monitoring module.
205, this subregion of collecting of storage is from User operation log, SEL daily record, system environments temperature, fan speed and the power of node.
Wherein, by implementing method shown in Figure 2, not only provide a kind of collection mechanism of effective fault in-situ information, but also can effectively collect fault in-situ information.
See also Fig. 4, Fig. 4 is the flow chart of collection method of the fault in-situ information of the disclosed a kind of multi-node server system of the embodiment of the invention.As shown in Figure 4, the collection method of the fault in-situ information of this multi-node server system can may further comprise the steps.
401, receive the fault message that the subregion node reports.
In one embodiment, can receive the fault message that the BMC of subregion node reports, wherein, fault message can be a fault numbering or simple character string.
402, according to this fault message, obtain the fault type that is complementary with this fault message.
In one embodiment, according to this fault message, obtaining the fault type that is complementary with this fault message can comprise:
According to this fault message, from the matching relationship of the fault message of fault type module stores and fault type, obtain the fault type that is complementary with this fault message.
A nearlyer step ground in the embodiment of the invention, can also obtain the detailed failure-description information of this fault message from the fault type module.
403, according to this fault type, extract the fault in-situ information type that needs collection, wherein, the fault in-situ information type that needs to collect comprises User operation log, SEL daily record, system environments temperature, fan speed, power.
In the embodiment of the invention, the fault in-situ information type that different fault types need to be collected is different, therefore, need to according to this fault type, extract the fault in-situ information type that needs collection.In the embodiment of the invention, the fault in-situ information type that needs to collect comprises User operation log, SEL daily record, system environments temperature, fan speed, power.
In one embodiment, according to this fault type, extracting the fault in-situ information type that needs to collect can comprise:
According to this fault type, extracting from the private information typelib of the publicly-owned information type storehouse of information type module stores and this fault type coupling needs the fault in-situ information type of collecting.
In the embodiment of the invention, as shown in Figure 3, the information type module can be stored the private information typelib of publicly-owned information type storehouse and fault type coupling, for example, the information type module can be stored publicly-owned information type storehouse (comprising the fault in-situ information type 1~3 that needs are collected), and store the private information typelib 1(that fault type 1~3 mates respectively and comprise the fault in-situ information type 4 that needs are collected, 5,7), private information typelib 2(comprises the fault in-situ information type 4 that needs are collected, 6) and private information typelib 3(comprise the fault in-situ information type 5 that needs are collected, 7,11), wherein, the fault in-situ information type 1~3rd that the needs that publicly-owned information type storehouse comprises are collected, the fault in-situ information type that all fault types all will be collected.For instance, when fault type is fault type 2, according to this fault type 2, the fault in-situ information type 1~3 that needs collection need to be from the publicly-owned information type storehouse of information type module stores, extracted, and the fault in-situ information type 4,6 that to collect need to be from the private information typelib 2 of these fault type 2 couplings, extracted.
If 404 subregion nodes are the subregion host node, send the information request to log pattern, collect User operation log and the SEL daily record of all subregion nodes of subregion under this subregion host node to trigger log pattern; And, send the information request to environment monitoring module, collect system environments temperature, fan speed and the power of all subregion nodes of subregion under this subregion host node to trigger environment monitoring module.
405, store User operation log, SEL daily record, system environments temperature, fan speed and the power of all subregion nodes of the affiliated subregion of this subregion host node of collecting.
Wherein, by implementing method shown in Figure 4, not only provide a kind of collection mechanism of effective fault in-situ information, but also can effectively collect fault in-situ information.
See also Fig. 5, Fig. 5 is the structure chart of gathering-device of the fault in-situ information of the disclosed a kind of multi-node server system of the embodiment of the invention.In the embodiment of the invention, multi-node server system can comprise m subregion, and each subregion is made of n subregion node, and wherein, subregion node 1 is the subregion host node, and other subregion nodes are that subregion is from node.When certain subregion node breaks down, this subregion node can to the gathering-device reporting fault signal of the fault in-situ information of multi-node server system shown in Figure 5, be carried out the collection operation of the fault in-situ information of the disclosed multi-node server system of the embodiment of the invention by this gathering-device.As shown in Figure 5, the gathering-device of the fault in-situ information of this multi-node server system comprises fault management module 500, and fault management module 500 comprises:
Fault processing module 501 is used for receiving the fault message that the subregion node reports, and according to this fault message, obtains the fault type that is complementary with this fault message;
Information module 502 for the fault type that obtains according to fault processing module 501, is extracted the fault in-situ information type that needs collection, and the fault in-situ information type of collecting as required, collects corresponding fault in-situ information.
In the embodiment of the invention, the gathering-device of the fault in-situ information of multi-node server system shown in Figure 5 also comprises:
Fault type module 503 is used for the fault message of storage and the matching relationship of fault type;
Wherein, fault processing module 501 obtains the fault type that is complementary with this fault message and comprises according to this fault message:
Fault processing module 501 is used for according to this fault message, obtains the fault type that is complementary with this fault message from the matching relationship of the fault message of fault type module 503 storages and fault type.
In the embodiment of the invention, the gathering-device of the fault in-situ information of multi-node server system shown in Figure 5 also comprises:
Information type module 504 is used for storing the private information typelib of publicly-owned information type storehouse and fault type coupling;
Wherein, the fault type that information module 502 is obtained according to fault processing module 504, extract the fault in-situ information type that needs to collect and comprise:
The fault type that information module 502 is obtained according to fault processing module 501, extracting from the private information typelib of the publicly-owned information type storehouse of information type module 504 storage and this fault type coupling needs the fault in-situ information type of collecting.
In the embodiment of the invention, in the gathering-device of the fault in-situ information of multi-node server system shown in Figure 5, if this subregion node is that subregion is from node, and when the fault in-situ information type that needs collection comprised User operation log, SEL daily record, system environments temperature, fan speed, power, this gathering-device also comprised log pattern 505 and environment monitoring module 506:
Wherein, the fault in-situ information type that information module 502 is collected as required, collect corresponding fault in-situ information and comprise:
Information module 502 is used for sending the information request to log pattern 505, collects this subregion from User operation log and the SEL daily record of node to trigger log pattern 505;
And information module 502 is used for sending the information request to environment monitoring module 506, collects this subregion from system environments temperature, fan speed and the power of node to trigger environment monitoring module 506.
In the embodiment of the invention, in the gathering-device of the fault in-situ information of multi-node server system shown in Figure 5, if this subregion node is the subregion host node, and when the fault in-situ information type that needs collection comprises User operation log, SEL daily record, system environments temperature, fan speed, power, the fault in-situ information type that information module 502 is collected as required, collect corresponding fault in-situ information and comprise:
Information module 502 is used for sending the information request to log pattern 505, with User operation log and the SEL daily record of all subregion nodes of subregion under triggering log pattern 505 these subregion host nodes of collection;
And information module 502 is used for sending the information request to environment monitoring module 506, with system environments temperature, fan speed and the power of all subregion nodes of subregion under triggering environment monitoring module 506 these subregion host nodes of collection.
In the embodiment of the invention, the job description of the gathering-device of the fault in-situ information of multi-node server system shown in Figure 5 is as follows:
1) fault type coupling:
Fault processing module 501 receives that the fault-signal that the subregion node reports is a fault numbering or simple characters string, after the subregion node reporting fault information, fault processing module 501 at first removes access fault type block 503 coupling fault types, be gathering-device input fault signal to fault type module 503, fault type module 503 return fault type and detailed failure-description information to.
2) fault in-situ information type decision-making:
Wherein, the fault in-situ information type that different fault types is corresponding is different, and information module 502 will be according to the fault type of fault processing module 501 acquisitions, the fault in-situ information type of the collection of making a strategic decision out from information type module 504.
Wherein, in information type module 504, can store the fault in-situ information type that each fault type need to be collected, and the fault in-situ information type be added in the information type storehouse (each fault type has all mated a private information type) of publicly-owned information type storehouse and this fault type coupling.Wherein, publicly-owned information type storehouse identifies with public, and the private information class libraries is used for private and identifies.Publicly-owned information type library storage be that all fault types all need the fault in-situ information type of collecting, and the fault in-situ information type that the private information typelib is each fault type will be collected especially.Information module 502 can visit information type block 504, and the input fault type is to information type module 504, returns the fault in-situ information type (i.e. set) that will collect to information module 502 by information type module 504.Wherein, information module 502 can travel through this fault in-situ information type (i.e. set) and comprise User operation log, SEL daily record, system environments temperature, fan speed, power, information module 502 can send the information request to log pattern 505 and environment monitoring module 506 respectively, collect User operation log and SEL daily record to trigger log pattern 505, and trigger environment monitoring module 506 and collect this system environments temperature, fan speed and power.
3) fault in-situ information:
In the embodiment of the invention, a multi-node server system can be divided into according to user's demand a plurality of subregions, and gathering-device can be preserved the corresponding relation of the subregion at each subregion node and its place.After subregion node reporting fault information, gathering-device is first determined subregion under this subregion node according to the subregion node number.Wherein, this subregion node may be the subregion host node, also may be that subregion is from node, when the subregion node of reporting fault information is that subregion is during from node, gathering-device can collect this subregion from the present information of the fault of node (such as User operation log, the SEL daily record, the system environments temperature, fan speed, power etc.), when the subregion node of reporting fault information is the subregion host node, because can't be confirmed to be this subregion host node self reason causes fault, or because other subregion nodes cause this subregion host node fault, therefore, gathering-device can be collected the present information of fault of all subregion nodes of subregion under this subregion host node (such as User operation log, the SEL daily record, the system environments temperature, fan speed, power etc.).
4) information is preserved
In the embodiment of the invention, gathering-device can be saved in database with the fault in-situ information of collecting, can set the fault in-situ information that keeps a week or one month, also can set limit value on the quantity of fault in-situ information, surpass the time of setting or surpass quantity higher limit, then cover time fault in-situ information or backup database the earliest.
Wherein, by implementing device shown in Figure 5, not only provide a kind of collection mechanism of effective fault in-situ information, but also can effectively collect fault in-situ information.
See also Fig. 6, Fig. 6 is the structure chart of gathering-device of the fault in-situ information of the disclosed another kind of multi-node server system of the embodiment of the invention, is used for carrying out the collection method of the fault in-situ information of the disclosed multi-node server system of the embodiment of the invention.As shown in Figure 6, the gathering-device 600 of the fault in-situ information of this multi-node server system comprises: at least one processor 601, CPU for example, at least one network interface 604 or other user interfaces 603, memory 605, at least one communication bus 602.Communication bus 602 is used for the connection communication between these assemblies of realization.Wherein, user interface 603 optionally can comprise USB interface and other standards interface, wireline interface.Network interface 604 optionally can comprise Wi-Fi interface and other wave points.Memory 605 may comprise the high-speed RAM memory, also may also comprise non-unsettled memory (non-volatile memory), for example at least one magnetic disc store.Memory 605 optionally can comprise at least one and be positioned at storage device away from aforementioned processing device 601.
In some embodiments, memory 605 has been stored following element, executable module or data structure, perhaps their subset, perhaps their superset:
Operating system 6051 comprises various hypervisors, is used for realizing the collection of fault in-situ information;
Application module 6052 comprises storage data, matching relationship.
Particularly, processor 601 is used for calling the program of memory 605 storages, carries out following operation:
Receive the fault message that the subregion node reports;
According to this fault message, obtain the fault type that is complementary with this fault message;
According to this fault type, extract the fault in-situ information type that needs collection;
The fault in-situ information type of collecting is as required collected corresponding fault in-situ information.
In the embodiment of the invention, processor 601 obtains the fault type that is complementary with this fault message and comprises according to this fault message:
Processor 601 is used for according to this fault message, obtains the fault type that is complementary with this fault message from the matching relationship of the fault message of fault type module stores and fault type.
In the embodiment of the invention, processor 601 extracts the fault in-situ information type that needs to collect and comprises according to this fault type:
Processor 601 is used for according to this fault type, and extracting from the private information typelib of the publicly-owned information type storehouse of information type module stores and this fault type coupling needs the fault in-situ information type of collecting.
In the embodiment of the invention, if this subregion node is that subregion is from node, and the fault in-situ information type that needs to collect comprises User operation log, SEL daily record, system environments temperature, fan speed, power, the processor 601 fault in-situ information type of collecting as required then, collect corresponding fault in-situ information and comprise:
Processor 601 sends the information request to log pattern, collects subregion from User operation log and the SEL daily record of node to trigger log pattern;
And, send the information request to environment monitoring module, collect subregion from system environments temperature, fan speed and the power of node to trigger environment monitoring module.
In the embodiment of the invention, if this subregion node is the subregion host node, and the fault in-situ information type that needs to collect comprises User operation log, SEL daily record, system environments temperature, fan speed, power, the processor 601 fault in-situ information type of collecting as required then, collect corresponding fault in-situ information and comprise:
Processor 601 sends the information request to log pattern, with User operation log and the SEL daily record of all subregion nodes of subregion under the triggering log pattern collection subregion host node;
And, send the information request to environment monitoring module, collect system environments temperature, fan speed and the power of all subregion nodes of subregion under the described subregion host node to trigger environment monitoring module.
Wherein, by implementing device shown in Figure 6, not only provide a kind of collection mechanism of effective fault in-situ information, but also can effectively collect fault in-situ information.
In the embodiment of the invention, fault in-situ information can also comprise other information except comprising User operation log, SEL daily record, system environments temperature, fan speed and power.
One of ordinary skill in the art will appreciate that all or part of step in the whole bag of tricks of above-described embodiment is to come the relevant hardware of instruction finish by program, this program can be stored in the computer-readable recording medium, storage medium can comprise: flash disk, read-only memory (Read-Only Memory, ROM), random access device (Random Access Memory, RAM), disk or CD etc.
Collection method and the device of above fault in-situ information to the disclosed multi-node server system of the embodiment of the invention are described in detail, used specific case herein principle of the present invention and execution mode are set forth, the explanation of above embodiment just is used for helping to understand method of the present invention and core concept thereof; Simultaneously, for one of ordinary skill in the art, according to thought of the present invention, all will change in specific embodiments and applications, in sum, this description should not be construed as limitation of the present invention.

Claims (10)

1. the collection method of the fault in-situ information of a multi-node server system is characterized in that, comprising:
Receive the fault message that the subregion node reports;
According to described fault message, obtain the fault type that is complementary with described fault message;
According to described fault type, extract the fault in-situ information type that needs collection;
According to the fault in-situ information type that described needs are collected, collect corresponding fault in-situ information.
2. the collection method of the fault in-situ information of multi-node server system according to claim 1 is characterized in that, and is described according to described fault message, obtains the fault type that is complementary with described fault message and comprises:
According to described fault message, from the matching relationship of the fault message of fault type module stores and fault type, obtain the fault type that is complementary with described fault message.
3. the collection method of the fault in-situ information of multi-node server system according to claim 2 is characterized in that, and is described according to described fault type, extracts the fault in-situ information type that needs to collect and comprises:
According to described fault type, extracting from the private information typelib of the publicly-owned information type storehouse of information type module stores and described fault type coupling needs the fault in-situ information type of collecting.
4. the collection method of the fault in-situ information of each described multi-node server system according to claim 1~3, it is characterized in that, if described subregion node is that subregion is from node, and the described fault in-situ information type that needs to collect comprises User operation log, SEL daily record, system environments temperature, fan speed, power, described fault in-situ information type of collecting according to described needs then, collect corresponding fault in-situ information and comprise:
Send the information request to log pattern, collect described subregion from User operation log and the SEL daily record of node to trigger described log pattern;
And, send the information request to environment monitoring module, collect described subregion from system environments temperature, fan speed and the power of node to trigger described environment monitoring module.
5. the collection method of the fault in-situ information of each described multi-node server system according to claim 1~3, it is characterized in that, if described subregion node is the subregion host node, and the described fault in-situ information type that needs to collect comprises User operation log, SEL daily record, system environments temperature, fan speed, power, described fault in-situ information type of collecting according to described needs then, collect corresponding fault in-situ information and comprise:
Send the information request to log pattern, collect User operation log and the SEL daily record of all subregion nodes of subregion under the described subregion host node to trigger described log pattern;
And, send the information request to environment monitoring module, collect system environments temperature, fan speed and the power of all subregion nodes of subregion under the described subregion host node to trigger described environment monitoring module.
6. the gathering-device of the fault in-situ information of a multi-node server system is characterized in that, comprises fault management module, and described fault management module comprises:
Fault processing module is used for receiving the fault message that the subregion node reports, and according to described fault message, obtains the fault type that is complementary with described fault message;
The information module for the described fault type that obtains according to described fault processing module, is extracted the fault in-situ information type that needs collection, and according to the fault in-situ information type that described needs are collected, collects corresponding fault in-situ information.
7. the gathering-device of the fault in-situ information of multi-node server system according to claim 6 is characterized in that, described gathering-device also comprises:
The fault type module is used for the fault message of storage and the matching relationship of fault type;
Wherein, described fault processing module obtains the fault type that is complementary with described fault message and comprises according to described fault message:
Described fault processing module is used for according to described fault message, obtains the fault type that is complementary with described fault message from the matching relationship of the fault message of described fault type module stores and fault type.
8. the gathering-device of the fault in-situ information of multi-node server system according to claim 7 is characterized in that, described gathering-device also comprises:
The information type module is used for storing the private information typelib of publicly-owned information type storehouse and fault type coupling;
The described fault type that described information module is obtained according to described fault processing module, extract the fault in-situ information type that needs to collect and comprise:
The described fault type that described information module is obtained according to described fault processing module, extracting from the private information typelib of the publicly-owned information type storehouse of described information type module stores and described fault type coupling needs the fault in-situ information type of collecting.
9. the gathering-device of the fault in-situ information of each described multi-node server system according to claim 6~8, it is characterized in that, if described subregion node is that subregion is from node, and the described fault in-situ information type that needs to collect comprises User operation log, SEL daily record, system environments temperature, fan speed, power, and then described gathering-device also comprises log pattern and environment monitoring module:
The fault in-situ information type that described information module is collected according to described needs, collect corresponding fault in-situ information and comprise:
Described information module is used for sending the information request to described log pattern, collects described subregion from User operation log and the SEL daily record of node to trigger described log pattern;
And described information module is used for sending the information request to described environment monitoring module, collects described subregion from system environments temperature, fan speed and the power of node to trigger described environment monitoring module.
10. the gathering-device of the fault in-situ information of each described multi-node server system according to claim 6~8, it is characterized in that, if described subregion node is the subregion host node, and the described fault in-situ information type that needs to collect comprises User operation log, SEL daily record, system environments temperature, fan speed, power, and then described gathering-device also comprises log pattern and environment monitoring module:
The fault in-situ information type that described information module is collected according to described needs, collect corresponding fault in-situ information and comprise:
Described information module be used for to send the information request to described log pattern, collects User operation log and the SEL daily record of all subregion nodes of subregion under the described subregion host node to trigger described log pattern;
And, described information module be used for to send the information request to described environment monitoring module, collects system environments temperature, fan speed and the power of all subregion nodes of subregion under the described subregion host node to trigger described environment monitoring module.
CN2013102528953A 2013-06-24 2013-06-24 Collecting method and device for fault site information of multi-node server system Pending CN103368771A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN2013102528953A CN103368771A (en) 2013-06-24 2013-06-24 Collecting method and device for fault site information of multi-node server system
PCT/CN2014/072262 WO2014206099A1 (en) 2013-06-24 2014-02-19 Method and device for collecting fault site information about multi-node server system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2013102528953A CN103368771A (en) 2013-06-24 2013-06-24 Collecting method and device for fault site information of multi-node server system

Publications (1)

Publication Number Publication Date
CN103368771A true CN103368771A (en) 2013-10-23

Family

ID=49369360

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2013102528953A Pending CN103368771A (en) 2013-06-24 2013-06-24 Collecting method and device for fault site information of multi-node server system

Country Status (2)

Country Link
CN (1) CN103368771A (en)
WO (1) WO2014206099A1 (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014206099A1 (en) * 2013-06-24 2014-12-31 华为技术有限公司 Method and device for collecting fault site information about multi-node server system
CN105245600A (en) * 2015-10-15 2016-01-13 珠海格力电器股份有限公司 Unit data uploading method and system of air conditioning system
CN105306272A (en) * 2015-11-10 2016-02-03 中国建设银行股份有限公司 Method and system for collecting fault scene information of information system
CN106100879A (en) * 2016-06-07 2016-11-09 青岛海信移动通信技术股份有限公司 Mobile terminal journal obtaining method and device
CN108289034A (en) * 2017-06-21 2018-07-17 新华三大数据技术有限公司 A kind of fault discovery method and apparatus
CN109062758A (en) * 2018-07-19 2018-12-21 郑州云海信息技术有限公司 A kind of server system delay machine processing method, system, medium and equipment
CN111931011A (en) * 2020-07-04 2020-11-13 华电联合(北京)电力工程有限公司 Accident information collection method, collection device, collection system and computer readable storage medium
US11269717B2 (en) * 2019-09-24 2022-03-08 Sap Se Issue-resolution automation

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000115168A (en) * 1998-09-30 2000-04-21 Toshiba Corp Fault management system applied to network and network management system
CN101227324A (en) * 2008-01-10 2008-07-23 华为技术有限公司 Fault information gathering method of communication equipment as well as communication equipment and system thereof
CN102571452A (en) * 2012-02-20 2012-07-11 华为技术有限公司 Multi-node management method and system
CN102855369A (en) * 2011-06-30 2013-01-02 上海西门子医疗器械有限公司 Method and system for collecting failure information and medical equipment

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103368771A (en) * 2013-06-24 2013-10-23 华为技术有限公司 Collecting method and device for fault site information of multi-node server system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000115168A (en) * 1998-09-30 2000-04-21 Toshiba Corp Fault management system applied to network and network management system
CN101227324A (en) * 2008-01-10 2008-07-23 华为技术有限公司 Fault information gathering method of communication equipment as well as communication equipment and system thereof
CN102855369A (en) * 2011-06-30 2013-01-02 上海西门子医疗器械有限公司 Method and system for collecting failure information and medical equipment
CN102571452A (en) * 2012-02-20 2012-07-11 华为技术有限公司 Multi-node management method and system

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014206099A1 (en) * 2013-06-24 2014-12-31 华为技术有限公司 Method and device for collecting fault site information about multi-node server system
CN105245600A (en) * 2015-10-15 2016-01-13 珠海格力电器股份有限公司 Unit data uploading method and system of air conditioning system
CN105245600B (en) * 2015-10-15 2019-10-22 珠海格力电器股份有限公司 A kind of the unit data uploading method and system of air-conditioning system
CN105306272B (en) * 2015-11-10 2019-01-25 中国建设银行股份有限公司 Information system fault scenes formation gathering method and system
CN105306272A (en) * 2015-11-10 2016-02-03 中国建设银行股份有限公司 Method and system for collecting fault scene information of information system
WO2016188100A1 (en) * 2015-11-10 2016-12-01 中国建设银行股份有限公司 Information system fault scenario information collection method and system
US10545807B2 (en) 2015-11-10 2020-01-28 China Construction Bank Corporation Method and system for acquiring parameter sets at a preset time interval and matching parameters to obtain a fault scenario type
CN106100879A (en) * 2016-06-07 2016-11-09 青岛海信移动通信技术股份有限公司 Mobile terminal journal obtaining method and device
CN108289034A (en) * 2017-06-21 2018-07-17 新华三大数据技术有限公司 A kind of fault discovery method and apparatus
CN109062758A (en) * 2018-07-19 2018-12-21 郑州云海信息技术有限公司 A kind of server system delay machine processing method, system, medium and equipment
US11269717B2 (en) * 2019-09-24 2022-03-08 Sap Se Issue-resolution automation
CN111931011A (en) * 2020-07-04 2020-11-13 华电联合(北京)电力工程有限公司 Accident information collection method, collection device, collection system and computer readable storage medium
CN111931011B (en) * 2020-07-04 2023-12-08 华电联合(北京)电力工程有限公司 Accident information collection method, collection device, collection system and computer readable storage medium

Also Published As

Publication number Publication date
WO2014206099A1 (en) 2014-12-31

Similar Documents

Publication Publication Date Title
CN103368771A (en) Collecting method and device for fault site information of multi-node server system
US10649838B2 (en) Automatic correlation of dynamic system events within computing devices
US9262260B2 (en) Information processing apparatus, information processing method, and recording medium
CN102567185B (en) Monitoring method of application server
CN108289034B (en) A kind of fault discovery method and apparatus
CN106469103B (en) The maintaining method and device of hard disk
CN105589782A (en) User behavior collection method based on browser
CN106294222A (en) A kind of method and device determining PCIE device and slot corresponding relation
CN111382008B (en) Virtual machine data backup method, device and system
CN103778513A (en) IT device operation and maintenance monitoring method based on two-dimensional codes
CN111046011A (en) Log collection method, system, node, electronic device and readable storage medium
CN103973470A (en) Cluster management method and equipment for shared-nothing cluster
CN104683147A (en) Hardware management method and system for large-scale data centre
CN112230847B (en) Method, system, terminal and storage medium for monitoring K8s storage volume
CN113625945A (en) Distributed storage slow disk processing method, system, terminal and storage medium
CN103109293A (en) User motion processing system and method
CN105573872A (en) Hardware maintenance method and device of data storage system
CN103778024A (en) Server system and message processing method thereof
CN103226501A (en) Logic backup method and logic backup system for database
CN108604231A (en) Mirror processing method and computing device
CN110928492A (en) Hard disk replacement method, system, terminal and storage medium of distributed file system
CN110764949A (en) Hard disk replacement method, hard disk replacement device, and storage medium
CN111104301B (en) Method and system for judging barrier user in webpage
CN114138187A (en) Method and device for managing timed snapshot volume of volume mapping copy service plug-in
CN114579415A (en) Method, device, equipment and medium for configuring and acquiring buried point data

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20131023