CN110275795A - A kind of O&M method and device based on alarm - Google Patents

A kind of O&M method and device based on alarm Download PDF

Info

Publication number
CN110275795A
CN110275795A CN201910579323.3A CN201910579323A CN110275795A CN 110275795 A CN110275795 A CN 110275795A CN 201910579323 A CN201910579323 A CN 201910579323A CN 110275795 A CN110275795 A CN 110275795A
Authority
CN
China
Prior art keywords
server
information
treatment measures
alarm
warning information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910579323.3A
Other languages
Chinese (zh)
Inventor
卢道和
杨军
程志峰
胡仲臣
周佳振
李兴龙
汪晓雪
陈刚
罗海湾
李勋棋
周琪
郭英亚
朱嘉伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
WeBank Co Ltd
Original Assignee
WeBank Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by WeBank Co Ltd filed Critical WeBank Co Ltd
Priority to CN201910579323.3A priority Critical patent/CN110275795A/en
Publication of CN110275795A publication Critical patent/CN110275795A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0751Error or fault detection not based on redundancy
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0793Remedial or corrective actions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/20Administration of product repair or maintenance
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2201/00Indexing scheme relating to error detection, to error correction, and to monitoring
    • G06F2201/80Database-specific techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Human Resources & Organizations (AREA)
  • General Engineering & Computer Science (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Tourism & Hospitality (AREA)
  • Strategic Management (AREA)
  • General Business, Economics & Management (AREA)
  • Operations Research (AREA)
  • Marketing (AREA)
  • Economics (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The present invention is suitable for financial technology field, and disclose a kind of O&M method and device based on alarm, wherein, method includes: the warning information for receiving first server, according to the alarm type of warning information, if it is determined that first server is failed server, the first treatment measures for being directed to first server are then generated according to the warning information, judge whether the first treatment measures meet High Availabitity condition in conjunction with the operation information and resource information of first server said system, if so, executing the first treatment measures to first server.The technical solution ensures the normal operation of financial service system to provide automatic processing scheme when financial service system exception.

Description

A kind of O&M method and device based on alarm
Technical field
The present embodiments relate to the financial technology field (Fintech) more particularly to a kind of O&M methods based on alarm And device.
Background technique
With the development of computer technology, more and more technical applications are in financial field, and traditional financial industry is gradually Change to financial technology (Fintech), O&M technology is no exception, but since finance, the safety of payment industry, real-time are wanted It asks, the higher requirement that also technology is proposed.
In the prior art, monitoring system monitors financial service system and is abnormal, and can generate alarm information noticing to phase It closes staff to solve, staff executes manual processing to the financial service system according to warning information, such as restarts or close Server in the system, but this kind of settling mode not only low efficiency, it is also possible to influence the normal operation of entire financial business.
Summary of the invention
The embodiment of the present invention provides a kind of O&M method and device based on alarm, to provide financial service system exception When automatic processing scheme, and ensure financial service system normal operation.
A kind of O&M method based on alarm provided in an embodiment of the present invention, comprising:
Receive the warning information of first server;
According to the alarm type of the warning information, determine whether the first server is failed server, if so, The first treatment measures for being directed to the first server are generated according to the warning information;
Judge that first treatment measures are in conjunction with the operation information and resource information of the first server said system It is no to meet High Availabitity condition;The operation information of the system includes the operation information of each server in the system;The system Resource information include each server in the system hardware information;The High Availabitity condition is preset is used to indicate After executing first treatment measures to the first server, the operation information of the system is not less than the first pre-set level;
If meeting the High Availabitity condition, first treatment measures are executed to the first server.
In above-mentioned technical proposal, after receiving the warning information of first server, the first server can be first judged It whether is failed server, if so, can just generate the first treatment measures of the first server, avoiding need not to server progress The O&M processing wanted, to influence the normal operation of whole system;Further, it is arranged in the first processing for generating the first server After applying, first treatment measures are judged in conjunction with the operation information and resource information of the first server said system, It determines if to meet high availability condition, if so, first treatment measures are executed to the first server, safeguards system High availability further avoids influencing the normal operation of whole system due to the maintenance work of a server.And using automatic The processing scheme of change improves troubleshooting efficiency.
Optionally, the alarm type according to the warning information determines whether the first server is failure clothes Business device, comprising:
When the alarm type of the warning information is minor alarm, judge whether the minor alarm is to alert for the first time, If so, closing the minor alarm, and the minor alarm is updated to third database, notifies staff so that described The treatment measures of the minor alarm are arranged in staff;Otherwise, the minor alarm is closed, and executes the minor alarm Treatment measures;
When the alarm type of the warning information is high-risk alarm, it is determined that whether the first server is failure clothes Business device.
In above-mentioned technical proposal, alert analysis module judges that the warning information is generally to accuse after receiving warning information Alert or high-risk alarm.If minor alarm, then judge that the minor alarm is that alarm or history alarm, for the first time alarm are for the first time Not stored in action data library to have treatment measures corresponding to the alarm, history alarm, that is, action data has been stored with this in library The corresponding treatment measures of alarm.If alerting for the first time, then need that operation maintenance personnel is notified to be handled manually, and by treatment measures It automatically updates to action data library, i.e. third database, i.e., reports corresponding optimization item then can be comprehensive if history alarm automatically The treatment measures in action data library to be closed, corresponding treatment measures are generated, automation carries out alarming processing, and closes alarm, It notifies operation maintenance personnel executive condition, alarms for the first time for non-in minor alarm, execute automatic processing and simultaneously close, improve O&M Efficiency.
Optionally, the determination first server is not failed server, comprising:
When the warning information of the first server indicates that the data processing of information of the first server is unsatisfactory for second When pre-set level, according to location information of the first server in said system, determines and be located in the first server The second server of trip;If it is determined that the second server is failed server, it is determined that the first server is not failure Server;Or
The determination first server is not failed server, comprising:
According to the mark of the first server in the warning information, first clothes are obtained from first database The version information of business device;If it is determined that the warning information is that the first server is in caused when release status variation, Then determine that the first server is not failed server;Record has each server in the system in the first database Release status.
In above-mentioned technical proposal, from the angle for the high availability for ensureing distributed system, determining first service is provided Device is not the mode of failed server, avoids executing treatment measures to such failed server, to influence whole system just Often operation.
It is optionally, described that first treatment measures are executed to the first server, comprising:
According to first treatment measures, the corresponding script letter of first treatment measures is called from the second database Breath executes first treatment measures to the first server to realize;There is for described record in second database The corresponding script information of the treatment measures of each server in system.
In above-mentioned technical proposal, script information corresponding with the first treatment measures is stored in the second database, from second The script information is obtained in database, thereby executing the first treatment measures, which is the place determined for each server The script information of reason measure has specific aim, is applicable to the demand of different server.
Optionally, first treatment measures are used to indicate the isolation first server;
It is described that first treatment measures are executed to the first server, comprising:
The first out code is sent to the corresponding message-oriented middleware of the system, and first out code includes described the The mark of one server and the first duration, first out code are used to indicate the message-oriented middleware isolation first clothes It is engaged in when a length of first duration of device;Or
Send the second out code to the first server, be used to indicate first server closing operate in it is described Srvice instance in first server;Or
Third out code is sent to the power interface of the first server, the power interface is used to indicate and closes, So that the first server power-off.
In above-mentioned technical proposal, first server can be isolated, so that data are sent to first in anti-locking system Server avoids in system data processing in loss of data or influence system.
Optionally, described according to the warning information, generate the first treatment measures for being directed to the first server, packet It includes:
According to the mark of the first server in the warning information, first clothes are obtained from third database The historical failure information and corresponding treatment measures of business device;The third database is for storing each server in the system Historical failure information and corresponding treatment measures;
In conjunction with the historical failure information and corresponding processing of fault message, the first server in the warning information Measure generates the first treatment measures for being directed to the first server.
In above-mentioned technical proposal, in conjunction with the current fault message of first server and historical failure information, generate for the The treatment measures of one server current failure information, the processing more can efficiently solve current failure information.
Correspondingly, the embodiment of the invention also provides a kind of O&M device based on alarm, comprising:
Receiving unit, for receiving the warning information of first server;
Processing unit determines whether the first server is failure for the alarm type according to the warning information Server, if so, generating the first treatment measures for being directed to the first server according to the warning information;In conjunction with described The operation information and resource information of one server said system judge whether first treatment measures meet High Availabitity condition;If Meet the High Availabitity condition, then first treatment measures is executed to the first server;
The operation information of the system includes the operation information of each server in the system;The resource information of the system Hardware information including each server in the system;The High Availabitity condition is preset is used to indicate to described first After server executes first treatment measures, the operation information of the system is not less than the first pre-set level.
Optionally, the processing unit is specifically used for:
When the alarm type of the warning information is minor alarm, judge whether the minor alarm is to alert for the first time, If so, closing the minor alarm, and the minor alarm is updated to third database, notifies staff so that described The treatment measures of the minor alarm are arranged in staff;Otherwise, the minor alarm is closed, and executes the minor alarm Treatment measures;
When the alarm type of the warning information is high-risk alarm, it is determined that whether the first server is failure clothes Business device.
Optionally, the processing unit is specifically used for:
When the warning information of the first server indicates that the data processing of information of the first server is unsatisfactory for second When pre-set level, according to location information of the first server in said system, determines and be located in the first server The second server of trip;If it is determined that the second server is failed server, it is determined that the first server is not failure Server;Or
The processing unit is specifically used for:
According to the mark of the first server in the warning information, first clothes are obtained from first database The version information of business device;If it is determined that the warning information is that the first server is in caused when release status variation, Then determine that the first server is not failed server;Record has each server in the system in the first database Release status.
Optionally, the processing unit is specifically used for:
According to first treatment measures, the corresponding script letter of first treatment measures is called from the second database Breath executes first treatment measures to the first server to realize;There is for described record in second database The corresponding script information of the treatment measures of each server in system.
Optionally, first treatment measures are used to indicate the isolation first server;
The processing unit is specifically used for:
The first out code is sent to the corresponding message-oriented middleware of the system, and first out code includes described the The mark of one server and the first duration, first out code are used to indicate the message-oriented middleware isolation first clothes It is engaged in when a length of first duration of device;Or
Send the second out code to the first server, be used to indicate first server closing operate in it is described Srvice instance in first server;Or
Third out code is sent to the power interface of the first server, the power interface is used to indicate and closes, So that the first server power-off.
Optionally, the processing unit is specifically used for:
According to the mark of the first server in the warning information, first clothes are obtained from third database The historical failure information and corresponding treatment measures of business device;The third database is for storing each server in the system Historical failure information and corresponding treatment measures;
In conjunction with the historical failure information and corresponding processing of fault message, the first server in the warning information Measure generates the first treatment measures for being directed to the first server.
Correspondingly, the embodiment of the invention also provides a kind of calculating equipment, comprising:
Memory, for storing program instruction;
Processor executes above-mentioned be based on according to the program of acquisition for calling the program instruction stored in the memory The O&M method of alarm.
Correspondingly, the embodiment of the invention also provides a kind of computer-readable non-volatile memory medium, including computer Readable instruction, when computer is read and executes the computer-readable instruction so that computer execute it is above-mentioned based on alarm O&M method.
Detailed description of the invention
To describe the technical solutions in the embodiments of the present invention more clearly, make required in being described below to embodiment Attached drawing is briefly introduced, it should be apparent that, drawings in the following description are only some embodiments of the invention, for this For the those of ordinary skill in field, without creative efforts, it can also be obtained according to these attached drawings other Attached drawing.
Fig. 1 is a kind of schematic diagram of system architecture provided in an embodiment of the present invention;
Fig. 2 is a kind of structural schematic diagram of automated programming system provided in an embodiment of the present invention;
Fig. 3 is a kind of flow diagram of the O&M method based on alarm provided in an embodiment of the present invention;
Fig. 4 is the flow diagram of another O&M method based on alarm provided in an embodiment of the present invention;
Fig. 5 is a kind of structural schematic diagram of the O&M device based on alarm provided in an embodiment of the present invention.
Specific embodiment
To make the objectives, technical solutions, and advantages of the present invention clearer, below in conjunction with attached drawing to the present invention make into It is described in detail to one step, it is clear that described embodiments are only a part of the embodiments of the present invention, rather than whole implementation Example.Based on the embodiments of the present invention, obtained by those of ordinary skill in the art without making creative efforts All other embodiment, shall fall within the protection scope of the present invention.
Fig. 1 illustratively shows the embodiment of the present invention and provides the system architecture that the O&M method based on alarm is applicable in, The system architecture may include financial service system 100, monitoring system 200, automated programming system 300, Database Systems 400。
Financial service system 100 is system for executing financial business, which can be adapted for point Cloth scene, i.e. financial service system 100 may include multiple clusters, each of which cluster can run multiple srvice instance.
Monitoring system 200 can monitor the indices of financial service system 100, and monitoring system 200 may include transaction Monitoring system and resource monitoring, the amount of access of transaction business of the transaction monitoring system for monitoring financial service system 100, Request amount etc.;Resource monitoring be used for server in financial service system 100, operating system, message-oriented middleware, using into The comprehensive monitoring of row, such as CPU, memory, the network interface card of monitoring server.Transaction monitoring system such as IMS system, monitoring resource system Such as Falcon system of system.
Automated programming system 300 is used to carry out automatic processing to the warning information of monitoring system 200, that is, receives prison , can be with the classification of automatic identification warning information after the warning information that control system 200 is sent, and be directed to different classes of to alarm letter Breath carries out automatic processing.
Database Systems 400 can be divided into action data library, edition data library, configuration management database.Action data library For the historical failure information and corresponding treatment measures of server each in storage system, that is, it is configured with the place of relevant abnormalities scene Reason measure, action data library such as SOP (Standard Operating Procedure, standard operating procedure);Edition data Release status of the library for each server in storage system, current version information and locating release status such as certain server It is just in upgraded version;The various configuration informations of equipment in configuration management database storage and management enterprise IT architecture, it and institute There are service support and service delivery process all closely linked, supports the operating of these processes, plays the value of configuration information, simultaneously Guarantee the accuracy of data, configuration management database such as CMDB (Configuration dependent on related procedure Management Database, configuration management database).
In the embodiment of the present invention, automated programming system 300 may include alarm AM access module 310, alert analysis module 320, High Availabitity engine modules 330, general automation processing module 340, specific automation processing module 350, warning information fortune Module 360 and processing information notification module 370 are sought, it can be as shown in Figure 2.
1, AM access module 310 is alerted
Alarm AM access module 310 is used to receive the upper all warning information of production.Specifically, (1) receiving host, network, Warning information in terms of the basic resources such as database, the warning information can be sent to automatic processing system by Falcon system System 300;(2) warning information in terms of the business that financial service system 100 generates is received, which can pass through IMS system System is sent to automated programming system 300, which may include the memory situation of server, thread pool service condition Deng.
2, alert analysis module 320
Alert analysis module 320 provides alarm judgement and corresponds to for carrying out analysis matching to the warning information of input Suggestion for operation, optionally, alert analysis module 320 matches the warning information of input to related SOP, and combine history alarm And the publication information summary of current system judges the influence grade and operating method of the alarm.
The embodiment of the present invention can execute in two steps: SOP library inquiry and history alarm matching.Wherein, it SOP library inquiry: looks into It askes operation system failure and pre-processes mechanism library, be equivalent to action data library, match corresponding fault handling steps;History alarm Match: in the SOP library inquiry post analysis operation system history alarm information, including the last alarm, alarm in nearly seven days and three days Interior version situation, in combination with recorded in the library SOP processing method and current health status (main clause, network, database, The comprehensive score of jvm etc.) provide corresponding treatment measures.
3, High Availabitity engine modules 330
High Availabitity engine modules 330 are for guaranteeing that failure of the financial service system 100 under distributed scene automatically processes In high availability, illustratively, it is possible that multiple operation system simultaneous faults or one are in fault treating procedure In system the case where multiple Instance failures, the automatic processing of failure needs to guarantee that the height of system can while quick processing at this time With property, executed under the assistance of High Availabitity engine modules 330 by optimal step.
High Availabitity engine modules 330 pass through example quantity, the failure calculated in each cluster of each system in real time and occurred In journey fault point substantially positioning, system basic resource situation, determine alert analysis module 320 generate treatment measures whether It is feasible.Illustratively, High Availabitity engine modules 330 can judge network, database, host resource state, such as work as system Platform hostdown, High Availabitity engine modules 330 will judge whether system example stopping instantly meeting system health Judge whether that this example is isolated.
In the embodiment of the present invention, 100 execution of financial service system processing can be assessed by High Availabitity engine modules 330 and is arranged The operation information of front and back is applied, such as handling capacity, the processing data of business datum are time-consuming, consumed resource, and then determine whether to Execute this treatment measures.
4, general automation processing module 340
General automation processing module 340 is used for the failure automatic processing under generic scenario.Under unitized overall development environment, Most of system restart, be isolated, stopping and all specific similitude of version rollback, can by general automation processing module 340 It seeks unity of action these four types of operations to the exception of system, reduces because of script disunity bring risk, improve and solve efficiency.This module For predominantly basic resource class it is abnormal, isolation, trading volume are executed to the srvice instance of failed server when such as message congestion Health detection is executed to system under exception.
5, specific automation processing module 350
Specific automation processing module 350 is for automatic under part special system and the peculiar fault scenes of all kinds of servers Change the management of processing script.It, can be by this module for the inspection and processing of the skimble-scamble system in part and peculiar fault scenes Configure corresponding automatized script.
It should be noted that general automation processing module 340 and specific automation processing module 350 are both provided with correspondence High Availabitity check code, checked so that it is guaranteed that automatized script can do primary confirmation comprehensively before execution to system.
6, warning information runs module 360
Warning information runs the unified management and data operation that module 360 is used for all kinds of warning information.Specifically, for existing Stage, all kinds of alarm datas were various in a jumble, included all kinds of possible hidden danger, and warning information runs module 360 by excavating daily accuse Alert information and historical data do big data comparison, and daily warning information is done classified finishing, and to be supplied to staff, work people Member updates the corresponding library SOP, to facilitate subsequent automation maintenance work, forms the closed loop of entire automation O&M.
7, information notification module 370 is handled
Information notification module 370 is handled to be used for the result notice after all kinds of warning information and automatic processing to each O&M Staff, to realize that staff timely updates system mode.
Based on foregoing description, Fig. 3 illustratively shows a kind of O&M side based on alarm provided in an embodiment of the present invention The process of method, the process can be executed by the O&M device based on alarm.
As shown in figure 3, the process specifically includes:
Step 301, the warning information of first server is received.
Some server in the first server, that is, financial service system, the warning information can be monitoring system and monitor And be sent to automated programming system.May include in the warning information of first server first server mark, first The fault message etc. of server.
Step 302, according to the alarm type of warning information, determine whether first server is failed server, if so, The first treatment measures for being directed to first server are generated according to warning information.
The alarm type of warning information is divided into minor alarm and high-risk alarm.When the alarm type of warning information is general accuses When alert, judge whether minor alarm is to alert for the first time, if so, closing minor alarm, and minor alarm is updated to third number According to library, i.e. action data library, staff is notified so that the treatment measures of minor alarm are arranged in staff;Otherwise, one is closed As alert, and execute the treatment measures of minor alarm;When the alarm type of warning information is high-risk alarm, it is determined that the first clothes Whether business device is failed server.
Herein, it can judge whether the first server is failed server according to the warning information of first server, lead to Default rule is crossed, non-faulting server is filtered out, specifically, when determining first server not is failed server, at least There can be following two situation:
Situation one: when the data processing of information of the warning information instruction first server of first server is unsatisfactory for second in advance If when index, according to location information of the first server in said system, determining the second clothes for being located at first server upstream Business device, however, it is determined that second server is failed server, it is determined that first server is not failed server.The situation is explained When some upstream server failure, to will affect the processing timeliness of downstream server, being closed according to the calling having between server The downstream server can't be determined as failed server, but first can do automatic processing to upstream server by connection relationship.
Situation two: according to the mark of the first server in warning information, first server is obtained from first database Version information, however, it is determined that warning information be first server be in release status variation when it is caused, it is determined that first clothes Business device is not failed server;Wherein, the release status for having each server in system is recorded in first database.Herein first Database, that is, edition data library.
After determining that first server is failed server, it can be generated according to warning information for first server First treatment measures, at this time, it may be necessary to consider fault message of the first server in historical record and for each failure The treatment measures taken can be inquired from third database according to the mark of first server, which arranges Database is applied, specifically, obtaining first server from third database according to the mark of the first server in warning information Historical failure information and corresponding treatment measures, and by fault message, the first service in the warning information of first server The historical failure information and corresponding treatment measures of device combine, and generate the first treatment measures for being directed to first server.
In the concrete realization, in the warning information for receiving first server, first the warning information can be accused Alert grading, comprehensively consider because being known as: alert keyword, the analysis of system business amount, time delay, the whether relevant announcement of upstream and downstream system These Considerations are weighted processing by police, history alarm information etc., so that it is determined that going out alarm grading.Herein, history alarm The first server in history or in history the alarm number of preset period of time, alarm grade, processing side can be considered in information Method etc..Alarm grading can be divided into minor alarm and high-risk alarm.
For minor alarm, alarm can be automatically processed according to the processing method in third database, at this point, if this is general Alarm is to alert for the first time, then can notify staff, so that staff carries out artificial treatment, and treatment measures is updated Into third database;May further determine whether for high-risk alarm can be with the automatic processing alarm, if so, generating Automatic processing measure, i.e. the first treatment measures;Otherwise, it needs to notify staff, so that staff is manually located Reason, and treatment measures are updated in third database.
Step 303, judge that the first treatment measures are in conjunction with the operation information and resource information of first server said system It is no to meet High Availabitity condition.
Herein, the operation information of system includes the operation information of each server in system, such as the business processing of each server Amount, service delay etc., the resource information of system include the hardware information of each server in system, such as the CPU, interior of each server It deposits, connection relationship etc..
High Availabitity condition be it is preset be used to indicate to first server execute the first treatment measures after, the fortune of system Row information is not less than the first pre-set level.For example, the business processing amount of current financial operation system is a, there is event in certain server Barrier then will first assess the pass that the business processing amount b and business processing amount a of treated financial service system are carried out to the server System, for example, if b > a or b are not less than (a × 90%), it is determined that first treatment measures meet High Availabitity condition.
Step 304, if meeting the High Availabitity condition, the first treatment measures are executed to first server.
First treatment measures can be general measure or special measure, for general measure, can use general automation Processing module executes, and is provided in the general automation processing module and to restart, be isolated, stop and the general measures such as version rollback Script information is suitable for most of server, can be reduced by the way that such script information is arranged because of script disunity bring wind Danger improves and solves efficiency.
In a kind of implementation, the first treatment measures first server can be isolated for setting, specifically, can be with The srvice instance that will be run in first server isolation or deactivated first server, if first server passes through message-oriented middleware It is connect with other servers, message-oriented middleware is for transmitting the server of message in system, which can receive The message of server transmission is simultaneously sent the message on other corresponding servers, it can whether control sender can send out It send whether message and recipient can receive message, when executing the first treatment measures, the can be sent to message-oriented middleware One out code, wherein include the mark and the first duration of first server in first out code, message-oriented middleware can be with The first duration is isolated in the first server, as in the first out code first when a length of 5min, then message-oriented middleware can be with 5min is isolated in the first server, and cancels the isolation to the first server after 5min.If first server directly with The connection of other servers, then can directly close the srvice instance in first server, can be stepped at this time by first server The srvice instance is recorded, the second out code is sent in first server, which, which is used to close, operates in the Srvice instance on one server.When the srvice instance can not be logged on in first server, first can be directly closed The power supply of server is illustratively provided with power interface in the first server, directly can send the to the power interface Three out codes are used to indicate power interface closing, to reach first server power-off.
For special measure, the corresponding script letter of the server can be set to for the fault message of each server Breath, and store server each in system is corresponding with the script information of each server into the second database, in the specific implementation, root According to the first treatment measures, the corresponding script information of the first treatment measures is called from the second database, to realize to first service Device executes the first treatment measures.
Embodiment in order to preferably explain the present invention will describe the fortune based on alarm under specific implement scene below Process is tieed up, as shown in figure 4, specific as follows:
Alert analysis module judges that the warning information is minor alarm or high-risk alarm after receiving warning information. If minor alarm, then judge the minor alarm be for the first time alarm or history alarm, for the first time alarm be action data library in not Treatment measures corresponding to the alarm are stored with, have been stored with place corresponding to the alarm in history alarm, that is, action data library Reason measure.If alerting for the first time, then need that operation maintenance personnel is notified to be handled manually, and treatment measures are automatically updated to measure Database reports corresponding optimization item automatically, then can be raw with the treatment measures in aggregate measures database if history alarm At corresponding treatment measures, automation carries out alarming processing, and closes alarm, notifies operation maintenance personnel executive condition.
If high-risk alarm, then judge this it is high-risk alarm whether can with automated execution, that is, judge be in action data library It is no to have the corresponding treatment measures of high-risk alarm, if so, needing further to judge whether treatment measures pass through High Availabitity engine High Availabitity condition, if so, execute the corresponding automatized script of the treatment measures, otherwise, notice operation maintenance personnel is handled manually. When automatized script can be executed, corresponding optimization item, and notice operation maintenance personnel can be reported to hold automatically after being finished Market condition.When that cannot execute automatized script, operation maintenance personnel can improve corresponding automation foot after being handled manually This information, and the corresponding Automatic Optimal option of configuration.
In above-mentioned technical proposal, after receiving the warning information of first server, the first server can be first judged It whether is failed server, if so, can just generate the first treatment measures of the first server, avoiding need not to server progress The O&M processing wanted, to influence the normal operation of whole system;Further, it is arranged in the first processing for generating the first server After applying, first treatment measures are judged in conjunction with the operation information and resource information of the first server said system, It determines if to meet high availability condition, if so, first treatment measures are executed to the first server, safeguards system High availability further avoids influencing the normal operation of whole system due to the maintenance work of a server.And using automatic The processing scheme of change improves troubleshooting efficiency.
Based on the same inventive concept, Fig. 5 illustratively shows a kind of fortune based on alarm provided in an embodiment of the present invention The structure of device is tieed up, which can execute the process of the O&M method based on alarm.
The device, comprising:
Receiving unit 501, for receiving the warning information of first server;
Processing unit 502 determines whether the first server is event for the alarm type according to the warning information Hinder server, if so, generating the first treatment measures for being directed to the first server according to the warning information;In conjunction with described The operation information and resource information of first server said system judge whether first treatment measures meet High Availabitity condition; If meeting the High Availabitity condition, first treatment measures are executed to the first server;
The operation information of the system includes the operation information of each server in the system;The resource information of the system Hardware information including each server in the system;The High Availabitity condition is preset is used to indicate to described first After server executes first treatment measures, the operation information of the system is not less than the first pre-set level.
Optionally, the processing unit 502 is also used to:
When the alarm type of the warning information is minor alarm, judge whether the minor alarm is to alert for the first time, If so, closing the minor alarm, and the minor alarm is updated to third database, notifies staff so that described The treatment measures of the minor alarm are arranged in staff;Otherwise, the minor alarm is closed, and executes the minor alarm Treatment measures;
When the alarm type of the warning information is high-risk alarm, it is determined that whether the first server is failure clothes Business device.
Optionally, the processing unit 502 is specifically used for:
When the warning information of the first server indicates that the data processing of information of the first server is unsatisfactory for second When pre-set level, according to location information of the first server in said system, determines and be located in the first server The second server of trip;If it is determined that the second server is failed server, it is determined that the first server is not failure Server;Or
The processing unit 502 is specifically used for:
According to the mark of the first server in the warning information, first clothes are obtained from first database The version information of business device;If it is determined that the warning information is that the first server is in caused when release status variation, Then determine that the first server is not failed server;Record has each server in the system in the first database Release status.
Optionally, the processing unit 502 is specifically used for:
According to first treatment measures, the corresponding script letter of first treatment measures is called from the second database Breath executes first treatment measures to the first server to realize;There is for described record in second database The corresponding script information of the treatment measures of each server in system.
Optionally, first treatment measures are used to indicate the isolation first server;
The processing unit 502 is specifically used for:
The first out code is sent to the corresponding message-oriented middleware of the system, and first out code includes described the The mark of one server and the first duration, first out code are used to indicate the message-oriented middleware isolation first clothes It is engaged in when a length of first duration of device;Or
Send the second out code to the first server, be used to indicate first server closing operate in it is described Srvice instance in first server;Or
Third out code is sent to the power interface of the first server, the power interface is used to indicate and closes, So that the first server power-off.
Optionally, the processing unit 502 is specifically used for:
According to the mark of the first server in the warning information, first clothes are obtained from third database The historical failure information and corresponding treatment measures of business device;The third database is for storing each server in the system Historical failure information and corresponding treatment measures;
In conjunction with the historical failure information and corresponding processing of fault message, the first server in the warning information Measure generates the first treatment measures for being directed to the first server.
Based on the same inventive concept, the embodiment of the invention also provides a kind of calculating equipment, comprising:
Memory, for storing program instruction;
Processor executes above-mentioned be based on according to the program of acquisition for calling the program instruction stored in the memory The O&M method of alarm.
Based on the same inventive concept, the embodiment of the invention also provides a kind of computer-readable non-volatile memory medium, Including computer-readable instruction, when computer is read and executes the computer-readable instruction, so that computer execution is above-mentioned O&M method based on alarm.
The present invention be referring to according to the method for the embodiment of the present invention, the process of equipment (system) and computer program product Figure and/or block diagram describe.It should be understood that every one stream in flowchart and/or the block diagram can be realized by computer program instructions The combination of process and/or box in journey and/or box and flowchart and/or the block diagram.It can provide these computer programs Instruct the processor of general purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices to produce A raw server, so that the instruction generation executed by computer or the processor of other programmable data processing devices is used for Realize the dress for the function of specifying in one or more flows of the flowchart and/or one or more blocks of the block diagram It sets.
These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works, so that it includes referring to that instruction stored in the computer readable memory, which generates, Enable the manufacture of device, the command device realize in one box of one or more flows of the flowchart and/or block diagram or The function of being specified in multiple boxes.
These computer program instructions also can be loaded onto a computer or other programmable data processing device, so that counting Series of operation steps are executed on calculation machine or other programmable devices to generate computer implemented processing, thus in computer or The instruction executed on other programmable devices is provided for realizing in one or more flows of the flowchart and/or block diagram one The step of function of being specified in a box or multiple boxes.
Although preferred embodiments of the present invention have been described, it is created once a person skilled in the art knows basic Property concept, then additional changes and modifications may be made to these embodiments.So it includes excellent that the following claims are intended to be interpreted as It selects embodiment and falls into all change and modification of the scope of the invention.
Obviously, various changes and modifications can be made to the invention without departing from essence of the invention by those skilled in the art Mind and range.In this way, if these modifications and changes of the present invention belongs to the range of the claims in the present invention and its equivalent technologies Within, then the present invention is also intended to include these modifications and variations.

Claims (14)

1. a kind of O&M method based on alarm characterized by comprising
Receive the warning information of first server;
According to the alarm type of the warning information, determine whether the first server is failed server, if so, according to The warning information generates the first treatment measures for being directed to the first server;
Judge whether first treatment measures accord in conjunction with the operation information and resource information of the first server said system Close High Availabitity condition;The operation information of the system includes the operation information of each server in the system;The money of the system Source information includes the hardware information of each server in the system;The High Availabitity condition is preset is used to indicate to institute After stating first server execution first treatment measures, the operation information of the system is not less than the first pre-set level;
If meeting the High Availabitity condition, first treatment measures are executed to the first server.
2. the method as described in claim 1, which is characterized in that the alarm type according to the warning information determines institute State whether first server is failed server, comprising:
When the alarm type of the warning information be minor alarm when, judge the minor alarm whether be alert for the first time, if so, The minor alarm is then closed, and the minor alarm is updated to third database, notifies staff so that the work The treatment measures of the minor alarm are arranged in personnel;Otherwise, the minor alarm is closed, and executes the processing of the minor alarm Measure;
When the alarm type of the warning information is high-risk alarm, it is determined that whether the first server is failed services Device.
3. method according to claim 2, which is characterized in that the determination first server is not failed server, Include:
When the warning information of the first server indicates that the data processing of information of the first server is unsatisfactory for second and presets When index, according to location information of the first server in said system, determines and be located at the first server upstream Second server;If it is determined that the second server is failed server, it is determined that the first server is not failed services Device;Or
The determination first server is not failed server, comprising:
According to the mark of the first server in the warning information, the first server is obtained from first database Version information;If it is determined that the warning information is that the first server is in caused when release status variation, then really The fixed first server is not failed server;Record has the version of each server in the system in the first database State.
4. the method as described in claim 1, which is characterized in that first processing described to first server execution is arranged It applies, comprising:
According to first treatment measures, the corresponding script information of first treatment measures is called from the second database, with It realizes and first treatment measures is executed to the first server;There is in the system record in second database The corresponding script information of the treatment measures of each server.
5. the method as described in claim 1, which is characterized in that first treatment measures are used to indicate isolation first clothes Business device;
It is described that first treatment measures are executed to the first server, comprising:
The first out code is sent to the corresponding message-oriented middleware of the system, first out code includes first clothes The mark and the first duration of business device, first out code are used to indicate the message-oriented middleware and the first server are isolated When a length of first duration;Or
The second out code is sent to the first server, the first server closing is used to indicate and operates in described first Srvice instance on server;Or
Third out code is sent to the power interface of the first server, the power interface is used to indicate and closes, so that The first server power-off.
6. such as method described in any one of claim 1 to 5, which is characterized in that described to be directed to according to warning information generation First treatment measures of the first server, comprising:
According to the mark of the first server in the warning information, the first server is obtained from third database Historical failure information and corresponding treatment measures;The third database is used to store the history of each server in the system Fault message and corresponding treatment measures;
It is arranged in conjunction with the historical failure information of fault message, the first server in the warning information and corresponding processing It applies, generates the first treatment measures for being directed to the first server.
7. a kind of O&M device based on alarm characterized by comprising
Receiving unit, for receiving the warning information of first server;
Processing unit determines whether the first server is failed services for the alarm type according to the warning information Device, if so, generating the first treatment measures for being directed to the first server according to the warning information;In conjunction with first clothes The operation information and resource information of business device said system judge whether first treatment measures meet High Availabitity condition;If meeting The High Availabitity condition then executes first treatment measures to the first server;
The operation information of the system includes the operation information of each server in the system;The resource information of the system includes The hardware information of each server in the system;The High Availabitity condition is preset is used to indicate to the first service After device executes first treatment measures, the operation information of the system is not less than the first pre-set level.
8. device as claimed in claim 7, which is characterized in that the processing unit is specifically used for:
When the alarm type of the warning information be minor alarm when, judge the minor alarm whether be alert for the first time, if so, The minor alarm is then closed, and the minor alarm is updated to third database, notifies staff so that the work The treatment measures of the minor alarm are arranged in personnel;Otherwise, the minor alarm is closed, and executes the processing of the minor alarm Measure;
When the alarm type of the warning information is high-risk alarm, it is determined that whether the first server is failed services Device.
9. device as claimed in claim 8, which is characterized in that the processing unit is specifically used for:
When the warning information of the first server indicates that the data processing of information of the first server is unsatisfactory for second and presets When index, according to location information of the first server in said system, determines and be located at the first server upstream Second server;If it is determined that the second server is failed server, it is determined that the first server is not failed services Device;Or
The processing unit is specifically used for:
According to the mark of the first server in the warning information, the first server is obtained from first database Version information;If it is determined that the warning information is that the first server is in caused when release status variation, then really The fixed first server is not failed server;Record has the version of each server in the system in the first database State.
10. device as claimed in claim 7, which is characterized in that the processing unit is specifically used for:
According to first treatment measures, the corresponding script information of first treatment measures is called from the second database, with It realizes and first treatment measures is executed to the first server;There is in the system record in second database The corresponding script information of the treatment measures of each server.
11. device as claimed in claim 7, which is characterized in that first treatment measures are used to indicate isolation described first Server;
The processing unit is specifically used for:
The first out code is sent to the corresponding message-oriented middleware of the system, first out code includes first clothes The mark and the first duration of business device, first out code are used to indicate the message-oriented middleware and the first server are isolated When a length of first duration;Or
The second out code is sent to the first server, the first server closing is used to indicate and operates in described first Srvice instance on server;Or
Third out code is sent to the power interface of the first server, the power interface is used to indicate and closes, so that The first server power-off.
12. such as the described in any item devices of claim 7 to 11, which is characterized in that the processing unit is specifically used for:
According to the mark of the first server in the warning information, the first server is obtained from third database Historical failure information and corresponding treatment measures;The third database is used to store the history of each server in the system Fault message and corresponding treatment measures;
It is arranged in conjunction with the historical failure information of fault message, the first server in the warning information and corresponding processing It applies, generates the first treatment measures for being directed to the first server.
13. a kind of calculating equipment characterized by comprising
Memory, for storing program instruction;
Processor requires 1 to 6 according to the program execution benefit of acquisition for calling the program instruction stored in the memory Described in any item methods.
14. a kind of computer-readable non-volatile memory medium, which is characterized in that including computer-readable instruction, work as computer When reading and executing the computer-readable instruction, so that computer executes such as method as claimed in any one of claims 1 to 6.
CN201910579323.3A 2019-06-28 2019-06-28 A kind of O&M method and device based on alarm Pending CN110275795A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910579323.3A CN110275795A (en) 2019-06-28 2019-06-28 A kind of O&M method and device based on alarm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910579323.3A CN110275795A (en) 2019-06-28 2019-06-28 A kind of O&M method and device based on alarm

Publications (1)

Publication Number Publication Date
CN110275795A true CN110275795A (en) 2019-09-24

Family

ID=67963730

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910579323.3A Pending CN110275795A (en) 2019-06-28 2019-06-28 A kind of O&M method and device based on alarm

Country Status (1)

Country Link
CN (1) CN110275795A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110851322A (en) * 2019-10-11 2020-02-28 平安科技(深圳)有限公司 Hardware equipment abnormity monitoring method, server and computer readable storage medium
CN111008105A (en) * 2019-11-07 2020-04-14 泰康保险集团股份有限公司 Distributed system call relation visualization method and device
CN112036588A (en) * 2020-09-04 2020-12-04 中国平安财产保险股份有限公司 System operation and maintenance flow management and control method and device
CN112491625A (en) * 2020-11-30 2021-03-12 深圳前海微众银行股份有限公司 Operation and maintenance alarming method, device and equipment based on instant communication platform
CN112700343A (en) * 2019-10-23 2021-04-23 中国石油天然气股份有限公司 Operation monitoring method and system based on oil gas Internet of things

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110851322A (en) * 2019-10-11 2020-02-28 平安科技(深圳)有限公司 Hardware equipment abnormity monitoring method, server and computer readable storage medium
CN112700343A (en) * 2019-10-23 2021-04-23 中国石油天然气股份有限公司 Operation monitoring method and system based on oil gas Internet of things
CN112700343B (en) * 2019-10-23 2024-07-30 中国石油天然气股份有限公司 Operation monitoring method and system based on oil-gas Internet of things
CN111008105A (en) * 2019-11-07 2020-04-14 泰康保险集团股份有限公司 Distributed system call relation visualization method and device
CN112036588A (en) * 2020-09-04 2020-12-04 中国平安财产保险股份有限公司 System operation and maintenance flow management and control method and device
CN112491625A (en) * 2020-11-30 2021-03-12 深圳前海微众银行股份有限公司 Operation and maintenance alarming method, device and equipment based on instant communication platform

Similar Documents

Publication Publication Date Title
CN110275795A (en) A kind of O&M method and device based on alarm
US20190378073A1 (en) Business-Aware Intelligent Incident and Change Management
US9558459B2 (en) Dynamic selection of actions in an information technology environment
CN111897671A (en) Failure recovery method, computer device, and storage medium
CN110166297A (en) O&M method, system, equipment and computer readable storage medium
US8572244B2 (en) Monitoring tool deployment module and method of operation
US12111738B2 (en) Managing data center failure events
CN107707415B (en) SaltStack-based automatic monitoring and warning method for server configuration
US20090201144A1 (en) Alarm management apparatus
CN111552556A (en) GPU cluster service management system and method
CN108353086A (en) Deployment assurance checks for monitoring industrial control systems
CN117992304A (en) Integrated intelligent operation and maintenance platform
CN112615737B (en) Method and system for automatically monitoring service system
CN109783310A (en) The Dynamic and Multi dimensional method for safety monitoring and its monitoring device of information technoloy equipment
CN117687874A (en) Monitoring method and device for operation and maintenance platform
CN117313012A (en) Fault management method, device, equipment and storage medium of service orchestration system
US8984122B2 (en) Monitoring tool auditing module and method of operation
CN111222928A (en) Method and system for monitoring enterprise standard invoicing
CN116149824A (en) Task re-running processing method, device, equipment and storage medium
US8560375B2 (en) Monitoring object system and method of operation
CN114418488B (en) Inventory information processing method, device and system
CN107797915B (en) Fault repairing method, device and system
CN112540771A (en) Automated operation and maintenance method, system, equipment and computer readable storage medium
CN110956456A (en) Money printing processing method, device and system
CN110569287B (en) Control method, system, electronic equipment and storage medium for product spot test

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination