Summary of the invention
The invention provides a kind of IT O﹠M alert processing method, its key step is: 1) gather system running state and performance index the collection point; 2) collection point upload the data to processing server; 3) whether processing server exists abnormal conditions according to predefined rule judgment; If have unusually, then produce unusual; 4) unusual to new generation carries out correlation analysis, determines whether to produce new alarm; 5) to the alarm of new generation, to carry out note and send, alarm lamp drives, operations such as instant message transmission.
The present invention also provides a kind of device of making according to said method, as shown in Figure 1.This device comprises 3 parts: collecting unit, alarming processing unit and alarm transmitting element.Wherein collecting unit is responsible for gathering the state and the performance data of IT infrastructure.The alarming processing unit comprises four subelements: unusual judgement, be responsible for data being analyzed according to predefined rule, and determined whether unusual generation; Correlation analysis is analyzed new incident unusual and that have been found that, judges that unusually whether this should trigger a new incident; Information expansion, original alarm the inside may have only some Back ground Informations, and after expanding, it is abundanter that content becomes, and the O﹠M personnel can more effective understanding alarm and make best judgement.
The data of collecting unit collection comprise status data and performance data, and it can support multiple acquisition mode, comprises SNMP, Telnet/SSH, and JDBC, JMX etc. are contained multiple IT infrastructure such as network, server, database, middleware.
In the IT O﹠M, how to judge automatically that system's operation exception is very important.Some fault such as system can't visit, and this can cause business to handle, and the user can report complaint; But the problem that some is potential, user impression less than, but can make judgement according to relevant knowledge, such as, the flow normal condition in evening of certain link is below the 1M, if exceed 1M even higher, just may exist unusually.Abnormal deciding means is according to the problem of regular recognition system existence in service.In rule, the data that collecting unit collects all are called " value ", and each value all includes attributes such as corresponding device, module, index, acquisition time.Rule is the expression formula whether calculated value satisfies condition, and expression formula is by grand, and identifier and operator are formed.Abnormal deciding means carries out calculating after the macro substitution to each value that receives, if the value after calculating is true, then expression occurs unusual.The flexibility of expression formula makes this determination methods can adapt to the needs of number of different types equipment, index and scene.
Include only the alarm source in the original alarm information, time of origin, attributes such as content.Because operation system is complicated day by day, better grasp issuable risk of alarm or problem in order to help the O﹠M personnel, to the influence of business etc., the information expansion unit is realized the attribute of warning information is expanded.
In IT system, connect each other between the resources such as network, server, database.When certain assembly wherein take place unusual after and its assembly of being associated also can produce same exception reporting, thereby produce a series of alarm.How to find real failure cause and position by analysed for relevance between these a series of alarms, be a key that guarantees alarm validity.
After alarm takes place, need the O﹠M personnel that timely alarm notification need be understood.Adapt to different urgency levels, the alarm notification unit provides multiple alarm modes such as note, mail, light, message.Note, light, message etc. is applicable to promptly, to the demanding alarm of real-time, mail is applicable to general alarm.
In addition, according to embodiments of the invention, collection point of the present invention is made up of robot and a plurality of probe; Robot is responsible for dispatching probe and carries out the collection action;
In addition, according to embodiments of the invention, the acquisition mode of probe support comprises SNMP, Telnet, SSH, JDBC, JMX etc.
In addition, according to embodiments of the invention, the collection point can distribute and be installed in a plurality of places, but data are left concentratedly.
In addition, according to embodiments of the invention, acquisition probe is divided into the SNMP probe, JDBC probe, Telnet/SSH probe, JMX probe etc.
In addition, according to embodiments of the invention, be connected with messaging bus by data/address bus between collecting unit and the alarming processing unit; Data/address bus is used for reported data, and messaging bus is used to issue acquisition;
In addition, according to embodiments of the invention, an alarming processing unit can receive the data of a plurality of collecting units;
In addition, according to embodiments of the invention, when transmission fault occurring, collecting unit can be attempted one or more backup alarming processing unit;
In addition, according to embodiments of the invention, when data can't be transmitted, collecting unit can be preserved the data of up-to-date a period of time, recovered up to transmission.
In addition, according to embodiments of the invention, when the alarming processing unit is found to gather again, can notify collecting unit to gather again by messaging bus.
In addition, according to embodiments of the invention, unusual judgement is calculated by conditional expression, and conditional expression is quoted desired value, environment value etc. by macrodefinition;
In addition, according to embodiments of the invention, information expansion is by conditional expression sign alarm set, by the field value of value expression definition expansion;
In addition, according to embodiments of the invention, correlation analysis is by resource dependencies, temporal correlation and professional correlation between the rule definition alarm;
In addition, according to embodiments of the invention, correlation analysis is realized shielding, compression, upgrading, operation associated.
In addition, according to embodiments of the invention, alarm notification unit and alarming processing unit are by the Transmission Control Protocol transmitting warning; The alarming processing unit can be alarm pushing to a plurality of alarm notifications unit.
In addition, according to embodiments of the invention, the switch flicker and the color of alarm lamp just controlled in the alarm notification unit by the serial ports level.
In addition, according to embodiments of the invention, the alarm notification unit sends alarm by serial ports control note cat.
Embodiment
Disclosed all features in this specification, or the step in disclosed all methods or the process except mutually exclusive feature and/or step, all can make up by any way.
Disclosed arbitrary feature in this specification (comprising any accessory claim, summary and accompanying drawing) is unless special narration all can be replaced by other equivalences or the alternative features with similar purpose.That is, unless special narration, each feature is an example in a series of equivalences or the similar characteristics.
The present invention is described further below in conjunction with accompanying drawing
As Fig. 1, apparatus of the present invention comprise collecting unit, alarming processing unit, alarm notification unit.Collecting unit comprises robot and various probes such as SNMP, Telnet.According to the different technologies interface that equipment is supported, probe is by the running status of different technological means collecting devices.Collecting unit passes to the alarming processing unit with the data that collect by data/address bus.Simultaneously, collecting unit also receives the instruction from the alarming processing unit, heavily adopts when the image data mistake occurring, operation such as filling mining.Connection support backup between collecting unit and the alarming processing unit.Promptly when collecting unit finds can't to communicate by letter in the alarming processing unit of current use, the alarming processing unit of the backup that can be dynamically connected certainly.If all alarming processing unit all can't connect, the data of nearest a period of time can be preserved in the alarm collection unit, and are big or small less than specifying up to the remanence disk space.When remaining space was not enough, the alarm collection unit can abandon the data of " old "; But guarantee the promptness and the accuracy of alarm by the said method maximum possible.
After the alarming processing unit receives initial data, at first whether occurred unusually according to the unusual judgment rule analysis of presetting.Unusually can be certain concrete technical indicator of IT resource or operation system, can be certain tolerance of user experience; It also may be the judgement that draws after a plurality of index comprehensive computings.In order to adapt to the complexity of distinct device, different business systems, rule is described by the unconventionality expression formula unusually.The user can be described abnormal conditions according to own understanding to IT system with expression formula.Because the macro substitution of expression formula, computing etc. may be more consuming time, unusual judge module can write down the performance of expression formula processing and regularly analyze, and takes this to optimize and revise the concurrent Thread Count that expression formula is handled.
In order to increase the readability of alarm, help the O﹠M personnel to analyze alarm more accurately, the information expansion unit expands the alarm field.In this device, warning information has been reserved the expansion field.As shown in Figure 4, system at first defines a conditional expression, the alarm set of determining to satisfy condition, and then define the value expression of one or more expansion fields.To every alarm, its attribute expression formula that whether satisfies condition is judged by system, if satisfy, then with macro substitution call by value expression formulas such as the primitive attribute of alarm, environmental information, business information, plant maintenance information, calculates the value that expands field.
To a new alarm that produces, dependency analysis unit compares analysis with its and history alarm, determining whether there is correlation between these incidents, and definite Root alarm with derive alarm.This correlation comprises temporal correlation, resource dependencies and professional correlation.As shown in Figure 5, correlation is handled and is comprised the following steps: 1) user sets up association rules, and the priority of definite rule; The rule that system provides can be described correlations such as time, resource and business; 2) system reads presetting rule; 3) after new alarm produces, system calculates an alarm set according to the attribute and the association rules of alarm, if the alarm set comprises a more than element, then there are correlation in this alarm and other alarms, further analyze Root alarm and the alarm of deriving (acquiescence is that the alarm that produces earlier is a Root alarm); 4) to having the alarm of correlation, operations such as shielding, compression, upgrading are carried out in predefined action according to rule.5) have the alarm of correlation, on display unit, can divide into groups to show.
Alarm behind the correlation analysis, needs the relevant O﹠M personnel of notice through expanding, and comprises by inquiry, note, mail, light etc.As shown in Figure 1, in this device, communicate by TCP between alarm notification device and the alarm treatment device, alarm treatment device with alarm pushing to the alarm notification device.The alarm notification device is connected with note cat, alarm lamp etc. by serial ports.Device is communicated by letter with the note cat by serial port protocol and is sent note.Device is by the switch of high-low level control alarm lamp.
The present invention is not limited to aforesaid embodiment.The present invention expands to any new feature or any new combination that discloses in this manual, and the arbitrary new method that discloses or step or any new combination of process.