CN113821412A - Equipment operation and maintenance management method and device - Google Patents

Equipment operation and maintenance management method and device Download PDF

Info

Publication number
CN113821412A
CN113821412A CN202111128350.2A CN202111128350A CN113821412A CN 113821412 A CN113821412 A CN 113821412A CN 202111128350 A CN202111128350 A CN 202111128350A CN 113821412 A CN113821412 A CN 113821412A
Authority
CN
China
Prior art keywords
alarm
alarm information
target
information
level
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111128350.2A
Other languages
Chinese (zh)
Inventor
杨赭
王璐璐
朱斌
李晓宇
徐育全
陈其刚
黄明罡
蔡元飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Construction Bank Corp
Original Assignee
China Construction Bank Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Construction Bank Corp filed Critical China Construction Bank Corp
Priority to CN202111128350.2A priority Critical patent/CN113821412A/en
Publication of CN113821412A publication Critical patent/CN113821412A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3006Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is distributed, e.g. networked systems, clusters, multiprocessor systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3089Monitoring arrangements determined by the means or processing involved in sensing the monitored data, e.g. interfaces, connectors, sensors, probes, agents
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/32Monitoring with visual or acoustical indication of the functioning of the machine
    • G06F11/324Display of status information
    • G06F11/327Alarm or error message display

Abstract

The invention provides a device operation and maintenance management method and a device, which comprises the steps of splitting based on initial alarm information to obtain a target alarm field when the initial alarm information sent by a main alarm server is received; and writing the target alarm field into a preset alarm information table to generate target alarm information. Determining the alarm level of the standardized target alarm information; and when the alarm level of the target alarm information is determined, executing operation corresponding to the alarm level. In the scheme, manual state inspection is not needed any more, when any part of any equipment at any place has a fault, initial alarm information sent by an alarm center at a fault sending place is processed to obtain a target alarm field, and the required target alarm field is written into a preset alarm information table to generate target alarm information; thereby executing corresponding operation according to the alarm level of the target alarm information. Not only can reduce the operation and maintenance cost, but also can carry out alarm analysis rapidly.

Description

Equipment operation and maintenance management method and device
Technical Field
The invention relates to the technical field of data processing, in particular to a method and a device for managing operation and maintenance of equipment.
Background
Most of the core systems of large banks use hardware devices of IBM, which is a huge Mainframe (Mainframe), storage, switch, tape library, etc. of the international business machines corporation of ten thousand countries. With the data large concentration of large-scale banks and the use efficiency of the equipment, the centralized operation and maintenance management of the hardware equipment is needed in the data center. Due to the construction of the disaster backup environment with two places and three centers, the number of hardware devices is multiplied, and the connection relationship between the devices becomes more complicated, so that higher requirements are put forward on the operation and maintenance management of the devices.
At present, the operation and maintenance management mode of hardware equipment is to arrange hardware equipment operation and maintenance personnel to enter a machine room for routine inspection every day in three centers respectively, and after the equipment state is found to be abnormal, the topological relation is searched for and the influence on banking business is analyzed. Due to the fact that a large amount of labor is needed, and the connection relation between the devices is complex, operation and maintenance cost can be increased through the mode, and alarm analysis cannot be conducted quickly.
Disclosure of Invention
In view of this, embodiments of the present invention provide an apparatus operation and maintenance management method and apparatus, so as to solve the problems in the prior art that not only the labor cost is increased, but also the alarm analysis cannot be performed quickly.
In order to achieve the above purpose, the embodiments of the present invention provide the following technical solutions:
the first aspect of the embodiment of the present invention shows a device operation and maintenance management method, where the method includes:
when initial alarm information sent by a main alarm server is received, splitting processing is carried out based on the initial alarm information to obtain a target alarm field;
writing the target alarm field into a preset alarm information table to generate target alarm information;
determining the alarm level of the standardized target alarm information;
and when the alarm level of the target alarm information is determined, executing operation corresponding to the alarm level.
Optionally, the splitting based on the initial alarm information to obtain a target alarm field includes:
splitting the initial alarm information to obtain an initial alarm field;
and performing information supplement on the initial alarm field to obtain a target alarm field.
Optionally, the determining the alarm level of the standardized target alarm information includes:
standardizing the target alarm information to obtain standardized target alarm information;
identifying the target alarm information, and determining the values corresponding to the target alarm information and the sizes of a first threshold, a second threshold, a third threshold and a fourth threshold, wherein the first threshold is smaller than the second threshold, the second threshold is smaller than the third threshold, and the third threshold is smaller than the fourth threshold;
when the numerical value corresponding to the target alarm information is smaller than a first threshold value, determining that the alarm level of the target alarm information is zero level;
when the numerical value corresponding to the target alarm information is determined to be larger than a first threshold value, determining the alarm level of the target alarm information to be two levels;
when the numerical value corresponding to the target alarm information is determined to be larger than a second threshold value, determining the alarm level of the target alarm information to be three levels;
when the numerical value corresponding to the target alarm information is determined to be larger than a third threshold value, determining that the alarm level of the target alarm information is four levels;
and when the numerical value corresponding to the target alarm information is determined to be larger than a fourth threshold value, determining that the alarm level of the target alarm information is five levels.
Optionally, the method further includes:
and when the target alarm information cannot be identified, determining that the alarm level of the target information is a first level.
Optionally, when the alarm level of the target alarm information is determined, executing an operation corresponding to the alarm level, including:
when the alarm level of the target alarm information is determined to be greater than or equal to the second level, executing corresponding alarm operation based on the alarm level to prompt operation and maintenance personnel;
and when the alarm level of the target alarm information is determined to be less than the second level, not executing alarm operation.
Optionally, the method further includes:
and when the master alarm server is determined to be abnormal, receiving initial alarm information sent by the slave alarm server, and executing splitting processing based on the initial alarm information to obtain a target alarm field.
A second aspect of the embodiments of the present invention shows an apparatus for managing operation and maintenance of a device, where the apparatus includes:
the processing unit is used for splitting based on initial alarm information when the initial alarm information sent by the main alarm server is received to obtain a target alarm field;
the generating unit is used for writing the target alarm field into a preset alarm information table to generate target alarm information;
the determining unit is used for determining the alarm level of the standardized target alarm information;
and the execution unit is used for executing the operation corresponding to the alarm level when the alarm level of the target alarm information is determined.
Optionally, the processing unit that performs splitting processing based on the initial alarm information to obtain a target alarm field is specifically configured to: splitting the initial alarm information to obtain an initial alarm field; and performing information supplement on the initial alarm field to obtain a target alarm field.
Optionally, the determining unit is specifically configured to: standardizing the target alarm information to obtain standardized target alarm information; identifying the target alarm information, and determining the values corresponding to the target alarm information and the sizes of a first threshold, a second threshold, a third threshold and a fourth threshold, wherein the first threshold is smaller than the second threshold, the second threshold is smaller than the third threshold, and the third threshold is smaller than the fourth threshold; when the numerical value corresponding to the target alarm information is smaller than a first threshold value, determining that the alarm level of the target alarm information is zero level; when the numerical value corresponding to the target alarm information is determined to be larger than a first threshold value, determining the alarm level of the target alarm information to be two levels; when the numerical value corresponding to the target alarm information is determined to be larger than a second threshold value, determining the alarm level of the target alarm information to be three levels; when the numerical value corresponding to the target alarm information is determined to be larger than a third threshold value, determining that the alarm level of the target alarm information is four levels; and when the numerical value corresponding to the target alarm information is determined to be larger than a fourth threshold value, determining that the alarm level of the target alarm information is five levels.
Optionally, the processing unit is further configured to: and when the master alarm server is determined to have the abnormality, receiving initial alarm information sent from the alarm server.
Based on the above method and apparatus for managing operation and maintenance of a device provided by the embodiments of the present invention, the method includes: when initial alarm information sent by a main alarm server is received, splitting processing is carried out based on the initial alarm information to obtain a target alarm field; and writing the target alarm field into a preset alarm information table to generate target alarm information. Determining the alarm level of the standardized target alarm information; and when the alarm level of the target alarm information is determined, executing operation corresponding to the alarm level. In the embodiment of the invention, manual state inspection is not needed any more, when any part of any equipment at any place has a fault, the initial alarm information sent by the main alarm server at the place with the fault is received and processed to obtain a target alarm field, and the required target alarm field is written into a preset alarm information table to generate target alarm information; thereby executing corresponding operation according to the alarm level of the target alarm information. Not only can reduce the operation and maintenance cost, but also can carry out alarm analysis rapidly.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
FIG. 1 is an application architecture diagram of an alarm center and an alarm service provided by an embodiment of the present invention;
FIG. 2 is a schematic diagram of a deployment of an alarm center and an alarm service in a two-place-three-center system according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of a data center specifically deployed with an alarm center according to an embodiment of the present invention;
fig. 4 is a schematic flowchart of an apparatus operation and maintenance management method according to an embodiment of the present invention;
fig. 5 is a schematic diagram of a data flow in which alarm information is written in a preset alarm information table according to an embodiment of the present invention;
fig. 6 is a schematic flowchart of another device operation and maintenance management method according to an embodiment of the present invention;
fig. 7 is a schematic diagram of a data flow for determining a target alarm field according to an embodiment of the present invention;
fig. 8 is a schematic structural diagram of an apparatus operation and maintenance management device according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In this application, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
For convenience of understanding, terms appearing in the embodiments of the present invention are explained below:
OMNIbus: an alarm event processing tool.
TPC: a disk performance monitoring tool.
SNMP: an SNMP probe to receive an SNMP trap.
Impact: and an information enrichment module.
Syslog: a Syslog probe receiving a Syslog event.
JDBC: the event is transferred from OMNIbus to DB2 database for permanent retention.
BI-synchronize events from master OMNIbus to standby OMNIbus.
The two places and the three centers refer to two cities and three data centers.
And the event integration platform is a platform system which is responsible for forwarding information in a mode of mails or short messages.
Large host hardware device: IBM mainframe related hardware devices.
Referring to FIG. 1, an application architecture diagram of an alert center 10 and an alert service 20 is shown for an embodiment of the present invention.
Wherein, the alarm center 10 can be arranged in a two-place-three center, and the alarm center 10 comprises a master alarm server 101 and a slave alarm server 102.
The alarm service 20 is a same alarm center, and is configured to receive initial alarm information sent by each local alarm center.
The alarm service 20 includes a unified portal for hardware monitoring systems, alarm queries, topology checks, and maintenance of alarm information, which is a window for operations and maintenance personnel and general users.
Specifically, the window adopts a B/S architecture, is accessed in a Web mode, can define users, roles and organizational structures in a flexible data organization display mode, provides different functional modules for different roles, and distributes different display contents to different users.
Optionally, the operation and maintenance personnel in each region may jump to the topology interface of the corresponding data center and the performance data interface in the corresponding data by clicking the corresponding data center, that is, the alarm service 20.
Optionally, since the login warning service 20, i.e. the page jump, needs to modify the login authentication of the host hardware monitoring system synchronously, the adaptation of unified authority, such as allocating a login account, is needed.
It should be noted that the alarm service 20 can implement the unification of alarms, the unification of topology entries, the unification of index data viewing entries, and the unification of permissions in the two places and the three centers of the building. The alarm unification is realized by Omnibus, such as: the alarm service 20 and the own alarm center 10 are configured in the data center a1, and the data center B, the beijing ocean bridge data center and the data center a2 are provided with the own alarm center 10, as shown in fig. 2.
When receiving initial alarm information sent by the main alarm servers 101 of the data centers 10 in various regions, the alarm service 20 performs splitting processing based on the initial alarm information to obtain a target alarm field; and writing the target alarm field into a preset alarm information table to generate target alarm information. Determining the alarm level of the standardized target alarm information; and when the alarm level of the target alarm information is determined, executing operation corresponding to the alarm level.
Optionally, a master alarm server 101 and a slave alarm server 102 are deployed in the data center, a master service Primary Omnibus is deployed on the master alarm server 101, a Backup Omnibus component is deployed on the slave alarm server 102, and alarm synchronization is performed through a gateway of a bidirectional BI, as shown in fig. 3.
The data center comprises devices managed by operation and maintenance, specifically an optical fiber switch, a storage disk, a mainframe and a tape library.
Optionally, in order to reduce the influence of computer failure to the maximum extent, and achieve that when a certain machine has a problem, the alarm monitoring can still operate normally, multiple computers need to be used in the standard multi-layer architecture of the data center. The components in the architecture are located in three layers (or trilayers), specifically a collector layer, a polymer layer and a display layer.
Wherein, each layer displays the physical computer where the main alarm server ObjectServer and the slave alarm server ObjectServer gateway are located; the end of the gateway connected to the objectServer of the main alarm server is called a reader and is used for reading data from the main alarm server; the end of the gateway connected to the target is called a writer for writing data to the target. The bidirectional gateways in the aggregation layer have readers and writers at both ends.
The collection layer includes a master alarm server and a slave alarm server to which the probes are connected. This configuration shows a pair of collector level objectservers, but more pairs may be added if necessary, that is, the number of alert servers may be set according to the actual situation, and the embodiment of the present invention is not limited thereto.
It should be noted that the alarm server ObjectServer of each collection level has its own dedicated unidirectional ObjectServer gateway, which connects the ObjectServer to the aggregation level. Each aggregation gateway reader is connected and fixed to its dedicated aggregation ObjectServer, while the writers of each gateway are connected to a virtual aggregation ObjectServer pair. Thus, while the writer may failover and failback between the primary and backup aggregation level objectservers, the reader only maintains a connection to its private set ObjectServer.
The aggregation layer includes a pair of object servers, a master alert server and a slave alert server, which are connected by a bi-directional object server gateway to keep them synchronized. The bidirectional ObjectServer Gateway is running on the backup host.
All incoming collection gateway writers and all outgoing display gateway readers are connected to a virtual aggregation (named AGG _ V) so that the writers and readers can failover and failback when the main aggregation ObjectServer computer is not available.
The display layer includes two separate display objectservers to which both the desktop event list user and the Web GUI user are connected. The configuration includes two display layers ObjectServer, but other display layers ObjectServer may be added if necessary, that is, the number of alert servers may be set according to actual situations, and the embodiment of the present invention is not limited thereto.
In the embodiment of the invention, manual state inspection is not needed any more, when any part of any equipment at any place has a fault, the initial alarm information sent by the main alarm server at the place with the fault is received and processed to obtain a target alarm field, and the required target alarm field is written into a preset alarm information table to generate target alarm information; thereby executing corresponding operation according to the alarm level of the target alarm information. Not only can reduce the operation and maintenance cost, but also can carry out alarm analysis rapidly.
Referring to fig. 4, a schematic flow chart of an operation and maintenance management method for a device according to an embodiment of the present invention is shown, where the method includes:
step S401: and when initial alarm information sent by a main alarm server is received, processing based on the initial alarm information, and determining a target alarm field.
Optionally, when the Trap probe of the master alarm server of any center receives the initial alarm information of the device, the initial alarm information is stored in the omnibus library of the alarm service. That is, various alarms of the alarm centers in each region send the alarms from the mainframe, the storage, the tape library and the TPC threshold value to an alarm management module OMNIbus of the alarm service through an SNMP Trap probe for processing. Specifically, alarm information generated by an alarm condition is customized in a TPC in the equipment and is sent to an EIF Probe of the OMNIbus, and mainframe, tape library, storage and optical communication event messages in the equipment are sent to a Trap Probe of the OMNIbus through a Trap configured in the equipment and further sent to an alarm management module OMNIbus of an alarm service.
Specific contents of S401: the alarm service adopts an acquisition probe in the OMNIBUS, such as EIF, SNMP Trap and the like, to collect initial alarm information of various equipment faults of alarm centers in various regions. The alarm management module receives initial alarm information corresponding to the alarm by adopting the SnmpTrap Probe, completes the work of splitting the alarm information while receiving the initial alarm information, and defines the meaning of each field through a componentized programming rule to obtain a target alarm field.
Optionally, the alarm message of the storage and optical traffic performance is generated by a monitoring system (TPC) of each local alarm center, and is sent to an omnibus library of the alarm service through the EIF Probe.
It should be noted that the initial alarm information includes information such as device IP, alarm type, alarm source, date, time, alarm name, alarm level, and alarm description.
Step S402: and writing the target alarm field into a preset alarm information table to generate target alarm information.
Specific contents of S402: firstly, determining a target alarm field corresponding to the basic field name in the preset alarm information table, and filling the target alarm field in the preset alarm information table respectively, namely filling the target alarm field in the position corresponding to the basic field name.
Note that the process of filling in is defined in the rule document for each probe.
If the target warning information comprises a plurality of contents with the same meaning, unifying the plurality of contents with the same meaning into the corresponding field names.
The preset alarm information table is created by Omnibus in the installation process.
The preset alarm information table includes the following basic field names, as shown in table 1.
Table 1:
Figure BDA0003279610330000081
Figure BDA0003279610330000091
Figure BDA0003279610330000101
it should be noted that the table building process is created by Omnibus during the installation process, and the operation and maintenance personnel can modify the fields in the table through database operation at a later stage.
For example: the target alarm information comprises IP address information, and the target alarm information is determined to correspond to the Node field name in the preset alarm information table, so that the IP address information is filled to the position corresponding to the Node field name; or, the target alarm information further includes alarm content, and the alarm content is determined to correspond to the Summary field name in the preset alarm information table, so that the alarm content is filled in the position corresponding to the Summary field name.
Accordingly, based on the contents shown in the above steps S401 to S402, the embodiment of the present invention also corresponds to the corresponding data flow diagram shown in fig. 5.
Step S403: and determining the alarm level of the standardized target alarm information.
Specific contents of S403: standardizing the target alarm information to obtain standardized target alarm information; identifying the target alarm information, and determining the values corresponding to the target alarm information and the sizes of a first threshold, a second threshold, a third threshold and a fourth threshold; when the numerical value corresponding to the target alarm information is smaller than a first threshold value, determining that the alarm level of the target alarm information is zero level; when the numerical value corresponding to the target alarm information is determined to be larger than a first threshold value, determining the alarm level of the target alarm information to be two levels; when the numerical value corresponding to the target alarm information is determined to be larger than a second threshold value, determining the alarm level of the target alarm information to be three levels; when the numerical value corresponding to the target alarm information is determined to be larger than a third threshold value, determining that the alarm level of the target alarm information is four levels; and when the numerical value corresponding to the target alarm information is determined to be larger than a fourth threshold value, determining that the alarm level of the target alarm information is five levels.
Optionally, when it is determined that the target alarm information cannot be identified, it is determined that the alarm level of the target information is one level.
It should be noted that the first threshold is smaller than the second threshold, the second threshold is smaller than the third threshold, and the third threshold is smaller than the fourth threshold.
Step S404: and when the alarm level of the target alarm information is determined, executing operation corresponding to the alarm level.
In the process of implementing the step S404 specifically, when it is determined that the alarm level of the target alarm information is greater than or equal to the second level, a corresponding alarm operation is executed based on the alarm level to prompt operation and maintenance personnel; and when the alarm level of the target alarm information is determined to be less than the second level, not executing alarm operation.
It can be understood that the alarms of different levels correspond to different states of the device and correspond to different alarm notification modes, such as the emergency degree of alarm processing and whether human intervention is needed. As shown in table 2 below.
Table 2:
Figure BDA0003279610330000111
optionally, a color corresponding to the alarm level needs to be displayed for the operation and maintenance staff to view.
Specifically, when the alarm level of the target alarm information is determined to be zero level, displaying that the color corresponding to the alarm level is green; when the alarm level of the target alarm information is determined to be one level, displaying that the color corresponding to the alarm level is purple; when the alarm level of the target alarm information is determined to be two levels, displaying that the color corresponding to the alarm level is blue; when the alarm level of the target alarm information is determined to be three levels, displaying that the color corresponding to the alarm level is yellow; when the alarm level of the target alarm information is determined to be four levels, displaying that the color corresponding to the alarm level is orange; and when the alarm level of the target alarm information is determined to be five levels, displaying that the color corresponding to the alarm level is red.
Based on this, it can be seen that, the alarm level is determined by different colors, and the alarm level can be limited after color modification according to the actual situation, which is not limited in the embodiment of the present invention.
In the embodiment of the invention, manual state inspection is not needed any more, when any part of any equipment at any place has a fault, the initial alarm information sent by the main alarm server at the place with the fault is received and processed to obtain a target alarm field, and the required target alarm field is written into a preset alarm information table to generate target alarm information; thereby executing corresponding operation according to the alarm level of the target alarm information. Not only can reduce the operation and maintenance cost, but also can carry out alarm analysis rapidly.
The method for managing operation and maintenance of devices shown based on the above embodiment of the present invention, with reference to fig. 4 and fig. 6, further includes:
step S601: and when the master alarm server is determined to have the abnormality, receiving initial alarm information sent from the alarm server, and executing the step S401.
In the embodiment of the invention, a Master alarm service is respectively set as a Master mode Master in an attribute file, a slave alarm service is set as a slave mode slave, in the process of the concrete implementation step S601, the Master mode Trap probe is set to receive alarm information of equipment and store the alarm information into an omnibus library under normal conditions, when the Master alarm server is determined to be abnormal, the slave mode probe also receives the alarm information, and at the moment, the slave mode probe stores the alarm into a file cache.
It should be noted that the abnormality of the main alarm server means that the probe of the main alarm server loses heartbeat information.
In the embodiment of the invention, manual state inspection is not needed any more, when any part of any equipment at any place is in fault and the main alarm server is determined to be abnormal, the initial alarm information sent from the alarm server is received and processed to obtain a target alarm field, and the required target alarm field is written into a preset alarm information table to generate target alarm information; thereby executing corresponding operation according to the alarm level of the target alarm information. Not only can reduce the operation and maintenance cost, but also can carry out alarm analysis rapidly.
Optionally, based on the device operation and maintenance management method shown in the foregoing embodiment of the present invention, before performing the splitting process on the initial alarm information in step S401, the method further includes:
step S11: and initializing the initial alarm information to obtain the processed initial alarm information.
Specific contents of S11: and carrying out information filtering and information compression on the acquired alarm information to obtain the processed initial alarm information.
The above-described processes such as information filtering and information compression are steps for initializing the initial alarm information.
In the embodiment of the present invention, the process of filtering the collected warning information specifically includes: the initial alarm information meeting the preset conditions is shielded and filtered to filter information which is not considered to be important by monitoring personnel from the initial alarm information extracted from the bottom layer, so that the interference of slight alarm is reduced, and the monitoring and processing efficiency is improved.
It should be noted that the preset condition refers to that the operation and maintenance staff sets a corresponding condition in the rule file of the corresponding collector according to the alarm object, the alarm level, the alarm content, or a combination of the three.
Such as: filtering the up/down event of the switch access equipment port through relevance processing; by setting alarm content, events which do not need to be concerned are directly filtered in a default strategy, and the like; by setting the alarm object, events which do not need to be concerned with some monitoring objects are filtered in a specific monitoring strategy.
Optionally, the principle of the event filtering policy in the Omnibus core database is that a combination of multiple conditions can be realized, so that the filtering setting is flexible and changeable, and different filtering conditions can be set as required to shield information resource management events which do not want to be concerned.
In the embodiment of the present invention, the specific process of compressing the collected alarm information is as follows: when the same alarm is continuously and repeatedly generated, the monitoring tool can continuously and repeatedly send the same alarm to the active memory database Omnibus. The alarms are judged based on the De-duplication automatic compression function of Omnibus, the same alarms are compressed into 1 alarm, and only the repeated occurrence times, the last occurrence time and the alarm description are updated. Specifically, after Omnibus receives an alarm through the Probe, the field of the unique Identifier of the alarm is determined first by defining in the rule file of each Probe. When the fields of the identifiers are the same, the system considers the alarm as the same alarm and can compress the alarm.
It should be noted that the fields of the Identifier are formed by combining several fields, and the different types of alarm identifiers have different combination modes.
For example: fields of the Identifier of the mainframe alarm comprise $ ClassName, $ adapter _ host, $ msg _ id and $ msg; the fields of the Identifier of the performance alarm comprise @ Node, $ KeyType, @ Summary, @ AlertGroup and @ AlertKey; the fields of the Identifier with the library alarm comprise $ ClassName, @ Node, $ KeyType, @ Summary, @ AlertGroup and @ AlertKey; the fields of the Identifier of the storage and optical cross-talk alarm comprise @ Node, $ KeyType, @ Summary, @ Alertgroup and @ AlertKey.
It should be further noted that the combination of the Identifier fields is defined in the Rules file of each Probe at the time of alarm reception, and belongs to a part of alarm standardization.
It will be appreciated that the active memory database Omnibus processes the duplicate event compression faster than inserting a new alarm message because the probe will use the identification field to uniquely identify the fault while collecting the fault, so that the duplicate event compression can be done quickly when the fault is sent to the Omnibus core database.
The active memory database Omnibus can flexibly set identification fields according to specific network event types aiming at different network/system management software of different data centers and probes of different event sources. The rules of information compression are customized in the probe in advance, and when a fault is sent to the core database, the compression of repeated events can be completed quickly. All probe rule files (rulesfile) provided by Omnibus include definitions of compression rules.
Further, it should be noted that the customized modification of the compression rule can be realized by modifying the rulesfile.
For example: after Omnibus receives an alarm through probes, the unique identifier of the alarm is firstly determined by defining in the rule file of each Probe: an Identifier field. When the Identifier fields are the same, the system will consider the alarm as the same alarm and can compress the alarm. The Identifier field is formed by combining a plurality of fields, and the different types of alarm identifiers have different combination modes.
Optionally, in the process of information compression, a Trigger (complement Trigger) is defined in the Automation, when the identifiers of the 2 alarms are the same, the new alarm is discarded, and the old alarm updates the following fields:
@ Tally +1// number of repetitions + 1;
@ lastoccupancy @ EventTime// last occurrence time equals the latest alarm occurrence time;
and @ Summary// alarm description is updated to be the new alarm description.
In the embodiment of the invention, manual state inspection is not needed any more, when any part of any equipment at any place has a fault, initial alarm information sent by a main alarm server at the place with the fault is received, initialization processing is firstly carried out, then the initial alarm information after the initialization processing is processed to obtain a target alarm field, and the required target alarm field is written into a preset alarm information table to generate target alarm information; thereby executing corresponding operation according to the alarm level of the target alarm information. Not only can reduce the operation and maintenance cost, but also can carry out alarm analysis rapidly.
Based on the above-mentioned device operation and maintenance management method shown in the embodiment of the present invention, in the process of performing processing based on the initial alarm information and determining a target alarm field in the specific implementation step S401, the method includes:
step S21: and splitting the initial alarm information to obtain an initial alarm field.
In the process of implementing step S21 specifically, the alarm management module receives the initial alarm information corresponding to the alarm by using the SnmpTrap Probe, and completes the work of splitting the alarm information while receiving the initial alarm information, so as to obtain an initial alarm field.
Step S22: and performing information supplement on the initial alarm field to obtain a target alarm field.
In the embodiment of the present invention, when the alarm information is written before the preset alarm information table, i.e. the alert status table, that is, when the original alarm information does not contain some contents, and these contents are very meaningful for the alarm processing, the information of the initial alarm field is also supplemented, and the specific contents of step S22: in the componentized programming rule, information in the table may be directly referred to for key word indexing, and it is determined that a required value is assigned to a field in the initial alarm field, and further a target alarm field is determined, as shown in fig. 7, so as to write in a preset alarm information table.
Note that the path and file name of the table entry are defined in the rule rules file for reference.
The appearance refers to a default information supplement method of Omnibus products, and stores some rich information in a text file of lookup, and each piece of information is separated by a TAB key and is called as the appearance.
The appearance comprises Chinese names, equipment types, positions, contacts, maintenance providers, equipment models, key resource information, related services and the like of related event nodes, and the information can help managers to quickly know information of resources, personnel, services and the like related to the fault and quickly respond when receiving fault alarm.
It should be further noted that the related event node can be flexibly expanded, and hundreds of fields required for management are added, and the related information is associated with the building.
Optionally, the management structure may be customized, and may also be customized on the display, where the added extension fields may be defined in the display content of the event, and flexibly define different display formats according to different event classifications, for example, the event of the relevant line of the switch may display a node, a port, a drop device, a contact, a line number, etc., and the event with the performance may display a node, a performance parameter, a current performance value, a device location, a contact, etc. So that the operation and maintenance management personnel can monitor and check the fault information conveniently.
In the embodiment of the invention, manual state inspection is not needed any more, when any part of any equipment at any place fails, the initial alarm information sent by the main alarm server at the failed place is received, and the initial alarm information is split to obtain the initial alarm field; then, information supplement is carried out on the initial alarm field to obtain a target alarm field; writing the required target alarm field into a preset alarm information table to generate target alarm information; thereby executing corresponding operation according to the alarm level of the target alarm information. Not only can reduce the operation and maintenance cost, but also can carry out alarm analysis rapidly.
Based on the foregoing method for managing operation and maintenance of equipment shown in the embodiment of the present invention, correspondingly, an apparatus for managing operation and maintenance of equipment is also disclosed in the embodiment of the present invention, and as shown in fig. 8, a schematic structural diagram of an apparatus for managing operation of equipment is shown for the embodiment of the present invention, where the apparatus includes:
the processing unit 801 is configured to, when initial warning information sent by the master warning server is received, perform splitting processing based on the initial warning information to obtain a target warning field.
Optionally, the processing unit 801 is further configured to: and when the master alarm server is determined to have the abnormality, receiving initial alarm information sent from the alarm server.
A generating unit 802, configured to write the target alarm field into a preset alarm information table, and generate target alarm information.
The determining unit 803 is configured to determine the alarm level of the standardized target alarm information.
And the executing unit 804 is configured to execute an operation corresponding to the alarm level when the alarm level where the target alarm information is located is determined.
Optionally, the execution unit 804 is specifically configured to: when the alarm level of the target alarm information is determined to be greater than or equal to the second level, executing corresponding alarm operation based on the alarm level to prompt operation and maintenance personnel; and when the alarm level of the target alarm information is determined to be less than the second level, not executing alarm operation.
It should be noted that, the specific principle and the execution process of each unit in the device operation and maintenance management apparatus disclosed in the foregoing embodiment of the present invention are the same as the device operation and maintenance management method implemented in the foregoing embodiment of the present invention, and reference may be made to corresponding parts in the device operation and maintenance management method disclosed in the foregoing embodiment of the present invention, and details are not described here again.
In the embodiment of the invention, manual state inspection is not needed any more, when any part of any equipment at any place has a fault, the initial alarm information sent by the main alarm server at the place with the fault is received and processed to obtain a target alarm field, and the required target alarm field is written into a preset alarm information table to generate target alarm information; thereby executing corresponding operation according to the alarm level of the target alarm information. Not only can reduce the operation and maintenance cost, but also can carry out alarm analysis rapidly.
Based on the device operation and maintenance management apparatus shown in the above embodiment of the present invention, the processing unit 801 that performs splitting processing based on the initial alarm information to obtain a target alarm field is specifically configured to: splitting the initial alarm information to obtain an initial alarm field; and performing information supplement on the initial alarm field to obtain a target alarm field.
The determining unit 802 is specifically configured to: standardizing the target alarm information to obtain standardized target alarm information; identifying the target alarm information, and determining the values corresponding to the target alarm information and the sizes of a first threshold, a second threshold, a third threshold and a fourth threshold, wherein the first threshold is smaller than the second threshold, the second threshold is smaller than the third threshold, and the third threshold is smaller than the fourth threshold; when the numerical value corresponding to the target alarm information is smaller than a first threshold value, determining that the alarm level of the target alarm information is zero level; when the numerical value corresponding to the target alarm information is determined to be larger than a first threshold value, determining the alarm level of the target alarm information to be two levels; when the numerical value corresponding to the target alarm information is determined to be larger than a second threshold value, determining the alarm level of the target alarm information to be three levels; when the numerical value corresponding to the target alarm information is determined to be larger than a third threshold value, determining that the alarm level of the target alarm information is four levels; and when the numerical value corresponding to the target alarm information is determined to be larger than a fourth threshold value, determining that the alarm level of the target alarm information is five levels.
Optionally, the determining unit 802 is further configured to determine that the alarm level of the target information is a first level when it is determined that the target alarm information cannot be identified.
In the embodiment of the invention, manual state inspection is not needed any more, when any part of any equipment at any place has a fault, the initial alarm information sent by the main alarm server at the place with the fault is received and processed to obtain a target alarm field, and the required target alarm field is written into a preset alarm information table to generate target alarm information; thereby executing corresponding operation according to the alarm level of the target alarm information. Not only can reduce the operation and maintenance cost, but also can carry out alarm analysis rapidly.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, the system or system embodiments are substantially similar to the method embodiments and therefore are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for related points. The above-described system and system embodiments are only illustrative, wherein the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. An equipment operation and maintenance management method is characterized by comprising the following steps:
when initial alarm information sent by a main alarm server is received, splitting processing is carried out based on the initial alarm information to obtain a target alarm field;
writing the target alarm field into a preset alarm information table to generate target alarm information;
determining the alarm level of the standardized target alarm information;
and when the alarm level of the target alarm information is determined, executing operation corresponding to the alarm level.
2. The method of claim 1, wherein the performing the splitting process based on the initial alarm information to obtain a target alarm field comprises:
splitting the initial alarm information to obtain an initial alarm field;
and performing information supplement on the initial alarm field to obtain a target alarm field.
3. The method of claim 1, wherein the determining the alarm level of the standardized target alarm information comprises:
standardizing the target alarm information to obtain standardized target alarm information;
identifying the target alarm information, and determining the values corresponding to the target alarm information and the sizes of a first threshold, a second threshold, a third threshold and a fourth threshold, wherein the first threshold is smaller than the second threshold, the second threshold is smaller than the third threshold, and the third threshold is smaller than the fourth threshold;
when the numerical value corresponding to the target alarm information is smaller than a first threshold value, determining that the alarm level of the target alarm information is zero level;
when the numerical value corresponding to the target alarm information is determined to be larger than a first threshold value, determining the alarm level of the target alarm information to be two levels;
when the numerical value corresponding to the target alarm information is determined to be larger than a second threshold value, determining the alarm level of the target alarm information to be three levels;
when the numerical value corresponding to the target alarm information is determined to be larger than a third threshold value, determining that the alarm level of the target alarm information is four levels;
and when the numerical value corresponding to the target alarm information is determined to be larger than a fourth threshold value, determining that the alarm level of the target alarm information is five levels.
4. The method of claim 3, further comprising:
and when the target alarm information cannot be identified, determining that the alarm level of the target information is a first level.
5. The method according to claims 3 and 4, wherein when determining the alarm level of the target alarm information, performing an operation corresponding to the alarm level includes:
when the alarm level of the target alarm information is determined to be greater than or equal to the second level, executing corresponding alarm operation based on the alarm level to prompt operation and maintenance personnel;
and when the alarm level of the target alarm information is determined to be less than the second level, not executing alarm operation.
6. The method of claim 1, further comprising:
and when the master alarm server is determined to be abnormal, receiving initial alarm information sent by the slave alarm server, and executing splitting processing based on the initial alarm information to obtain a target alarm field.
7. An apparatus for managing operation and maintenance of a device, the apparatus comprising:
the processing unit is used for splitting based on initial alarm information when the initial alarm information sent by the main alarm server is received to obtain a target alarm field;
the generating unit is used for writing the target alarm field into a preset alarm information table to generate target alarm information;
the determining unit is used for determining the alarm level of the standardized target alarm information;
and the execution unit is used for executing the operation corresponding to the alarm level when the alarm level of the target alarm information is determined.
8. The apparatus according to claim 7, wherein the processing unit that performs splitting processing based on the initial alarm information to obtain a target alarm field is specifically configured to: splitting the initial alarm information to obtain an initial alarm field; and performing information supplement on the initial alarm field to obtain a target alarm field.
9. The apparatus according to claim 7, wherein the determining unit is specifically configured to: standardizing the target alarm information to obtain standardized target alarm information; identifying the target alarm information, and determining the values corresponding to the target alarm information and the sizes of a first threshold, a second threshold, a third threshold and a fourth threshold, wherein the first threshold is smaller than the second threshold, the second threshold is smaller than the third threshold, and the third threshold is smaller than the fourth threshold; when the numerical value corresponding to the target alarm information is smaller than a first threshold value, determining that the alarm level of the target alarm information is zero level; when the numerical value corresponding to the target alarm information is determined to be larger than a first threshold value, determining the alarm level of the target alarm information to be two levels; when the numerical value corresponding to the target alarm information is determined to be larger than a second threshold value, determining the alarm level of the target alarm information to be three levels; when the numerical value corresponding to the target alarm information is determined to be larger than a third threshold value, determining that the alarm level of the target alarm information is four levels; and when the numerical value corresponding to the target alarm information is determined to be larger than a fourth threshold value, determining that the alarm level of the target alarm information is five levels.
10. The apparatus of claim 7, wherein the processing unit is further configured to: and when the master alarm server is determined to have the abnormality, receiving initial alarm information sent from the alarm server.
CN202111128350.2A 2021-09-26 2021-09-26 Equipment operation and maintenance management method and device Pending CN113821412A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111128350.2A CN113821412A (en) 2021-09-26 2021-09-26 Equipment operation and maintenance management method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111128350.2A CN113821412A (en) 2021-09-26 2021-09-26 Equipment operation and maintenance management method and device

Publications (1)

Publication Number Publication Date
CN113821412A true CN113821412A (en) 2021-12-21

Family

ID=78921323

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111128350.2A Pending CN113821412A (en) 2021-09-26 2021-09-26 Equipment operation and maintenance management method and device

Country Status (1)

Country Link
CN (1) CN113821412A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114840219A (en) * 2022-07-06 2022-08-02 湖南傲思软件股份有限公司 Distributed event processing system

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104243236A (en) * 2014-09-17 2014-12-24 深圳供电局有限公司 Method, system and servers for analyzing monitoring system operation and maintenance alarm data
CN104753712A (en) * 2013-12-31 2015-07-01 中国移动通信集团公司 Alarming report method, alarming report node and alarming report system
CN108829558A (en) * 2018-05-22 2018-11-16 郑州云海信息技术有限公司 A kind of intelligent operation management method and system of data center's alarm
CN109194532A (en) * 2018-11-07 2019-01-11 广东电网有限责任公司 A kind of method for pushing and device of power grid warning information
CN109639465A (en) * 2018-11-27 2019-04-16 平安科技(深圳)有限公司 Warning information storage method and device based on cloud platform
WO2020215894A1 (en) * 2019-04-25 2020-10-29 深圳前海微众银行股份有限公司 Alarm method, device and system
CN112598205A (en) * 2019-09-17 2021-04-02 北京国双科技有限公司 Alarm information processing method and device, storage medium and electronic equipment

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104753712A (en) * 2013-12-31 2015-07-01 中国移动通信集团公司 Alarming report method, alarming report node and alarming report system
CN104243236A (en) * 2014-09-17 2014-12-24 深圳供电局有限公司 Method, system and servers for analyzing monitoring system operation and maintenance alarm data
CN108829558A (en) * 2018-05-22 2018-11-16 郑州云海信息技术有限公司 A kind of intelligent operation management method and system of data center's alarm
CN109194532A (en) * 2018-11-07 2019-01-11 广东电网有限责任公司 A kind of method for pushing and device of power grid warning information
CN109639465A (en) * 2018-11-27 2019-04-16 平安科技(深圳)有限公司 Warning information storage method and device based on cloud platform
WO2020215894A1 (en) * 2019-04-25 2020-10-29 深圳前海微众银行股份有限公司 Alarm method, device and system
CN112598205A (en) * 2019-09-17 2021-04-02 北京国双科技有限公司 Alarm information processing method and device, storage medium and electronic equipment

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114840219A (en) * 2022-07-06 2022-08-02 湖南傲思软件股份有限公司 Distributed event processing system
CN114840219B (en) * 2022-07-06 2023-05-05 湖南傲思软件股份有限公司 Distributed event processing system

Similar Documents

Publication Publication Date Title
US11194828B2 (en) Method and system for implementing a log parser in a log analytics system
CN110036600B (en) Network health data convergence service
US8166352B2 (en) Alarm correlation system
CN108234170B (en) Monitoring method and device for server cluster
CN110036599B (en) Programming interface for network health information
CN109902072A (en) A kind of log processing system
CN107958337A (en) A kind of information resources visualize mobile management system
US7617210B2 (en) Global inventory warehouse
US8930964B2 (en) Automatic event correlation in computing environments
CN110855473A (en) Monitoring method, device, server and storage medium
WO2016161381A1 (en) Method and system for implementing a log parser in a log analytics system
CN103490941A (en) Real-time monitoring on-line configuration method in cloud computing environment
KR20130019366A (en) Efficiently collecting transction-separated metrics in a distributed enviornment
CN111782345B (en) Container cloud platform log collection and analysis alarm method
WO2006117833A1 (en) Monitoring simulating device, method, and program
US20080208958A1 (en) Risk assessment program for a directory service
CN108390782A (en) A kind of centralization application system performance question synthesis analysis method
CN114244676A (en) Intelligent IT integrated gateway system
CN114153980A (en) Knowledge graph construction method and device, inspection method and storage medium
CN107846460B (en) System and method for reproducing information flow of military information system
CN113505048A (en) Unified monitoring platform based on application system portrait and implementation method
CN107888409B (en) Communication network configuration data automatic synchronization method with self-healing capability
CN113542074A (en) Method and system for visually managing east-west network traffic of kubernets cluster
CN113821412A (en) Equipment operation and maintenance management method and device
CN109997337B (en) Visualization of network health information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination