CN110275795A - A kind of O&M method and device based on alarm - Google Patents
A kind of O&M method and device based on alarm Download PDFInfo
- Publication number
- CN110275795A CN110275795A CN201910579323.3A CN201910579323A CN110275795A CN 110275795 A CN110275795 A CN 110275795A CN 201910579323 A CN201910579323 A CN 201910579323A CN 110275795 A CN110275795 A CN 110275795A
- Authority
- CN
- China
- Prior art keywords
- server
- information
- treatment measures
- alarm
- warning information
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0751—Error or fault detection not based on redundancy
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0793—Remedial or corrective actions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/20—Administration of product repair or maintenance
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2201/00—Indexing scheme relating to error detection, to error correction, and to monitoring
- G06F2201/80—Database-specific techniques
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Business, Economics & Management (AREA)
- General Physics & Mathematics (AREA)
- Quality & Reliability (AREA)
- Human Resources & Organizations (AREA)
- General Engineering & Computer Science (AREA)
- Entrepreneurship & Innovation (AREA)
- Tourism & Hospitality (AREA)
- Strategic Management (AREA)
- General Business, Economics & Management (AREA)
- Operations Research (AREA)
- Marketing (AREA)
- Economics (AREA)
- Debugging And Monitoring (AREA)
Abstract
The present invention is suitable for financial technology field, and disclose a kind of O&M method and device based on alarm, wherein, method includes: the warning information for receiving first server, according to the alarm type of warning information, if it is determined that first server is failed server, the first treatment measures for being directed to first server are then generated according to the warning information, judge whether the first treatment measures meet High Availabitity condition in conjunction with the operation information and resource information of first server said system, if so, executing the first treatment measures to first server.The technical solution ensures the normal operation of financial service system to provide automatic processing scheme when financial service system exception.
Description
Technical field
The present embodiments relate to the financial technology field (Fintech) more particularly to a kind of O&M methods based on alarm
And device.
Background technique
With the development of computer technology, more and more technical applications are in financial field, and traditional financial industry is gradually
Change to financial technology (Fintech), O&M technology is no exception, but since finance, the safety of payment industry, real-time are wanted
It asks, the higher requirement that also technology is proposed.
In the prior art, monitoring system monitors financial service system and is abnormal, and can generate alarm information noticing to phase
It closes staff to solve, staff executes manual processing to the financial service system according to warning information, such as restarts or close
Server in the system, but this kind of settling mode not only low efficiency, it is also possible to influence the normal operation of entire financial business.
Summary of the invention
The embodiment of the present invention provides a kind of O&M method and device based on alarm, to provide financial service system exception
When automatic processing scheme, and ensure financial service system normal operation.
A kind of O&M method based on alarm provided in an embodiment of the present invention, comprising:
Receive the warning information of first server;
According to the alarm type of the warning information, determine whether the first server is failed server, if so,
The first treatment measures for being directed to the first server are generated according to the warning information;
Judge that first treatment measures are in conjunction with the operation information and resource information of the first server said system
It is no to meet High Availabitity condition;The operation information of the system includes the operation information of each server in the system;The system
Resource information include each server in the system hardware information;The High Availabitity condition is preset is used to indicate
After executing first treatment measures to the first server, the operation information of the system is not less than the first pre-set level;
If meeting the High Availabitity condition, first treatment measures are executed to the first server.
In above-mentioned technical proposal, after receiving the warning information of first server, the first server can be first judged
It whether is failed server, if so, can just generate the first treatment measures of the first server, avoiding need not to server progress
The O&M processing wanted, to influence the normal operation of whole system;Further, it is arranged in the first processing for generating the first server
After applying, first treatment measures are judged in conjunction with the operation information and resource information of the first server said system,
It determines if to meet high availability condition, if so, first treatment measures are executed to the first server, safeguards system
High availability further avoids influencing the normal operation of whole system due to the maintenance work of a server.And using automatic
The processing scheme of change improves troubleshooting efficiency.
Optionally, the alarm type according to the warning information determines whether the first server is failure clothes
Business device, comprising:
When the alarm type of the warning information is minor alarm, judge whether the minor alarm is to alert for the first time,
If so, closing the minor alarm, and the minor alarm is updated to third database, notifies staff so that described
The treatment measures of the minor alarm are arranged in staff;Otherwise, the minor alarm is closed, and executes the minor alarm
Treatment measures;
When the alarm type of the warning information is high-risk alarm, it is determined that whether the first server is failure clothes
Business device.
In above-mentioned technical proposal, alert analysis module judges that the warning information is generally to accuse after receiving warning information
Alert or high-risk alarm.If minor alarm, then judge that the minor alarm is that alarm or history alarm, for the first time alarm are for the first time
Not stored in action data library to have treatment measures corresponding to the alarm, history alarm, that is, action data has been stored with this in library
The corresponding treatment measures of alarm.If alerting for the first time, then need that operation maintenance personnel is notified to be handled manually, and by treatment measures
It automatically updates to action data library, i.e. third database, i.e., reports corresponding optimization item then can be comprehensive if history alarm automatically
The treatment measures in action data library to be closed, corresponding treatment measures are generated, automation carries out alarming processing, and closes alarm,
It notifies operation maintenance personnel executive condition, alarms for the first time for non-in minor alarm, execute automatic processing and simultaneously close, improve O&M
Efficiency.
Optionally, the determination first server is not failed server, comprising:
When the warning information of the first server indicates that the data processing of information of the first server is unsatisfactory for second
When pre-set level, according to location information of the first server in said system, determines and be located in the first server
The second server of trip;If it is determined that the second server is failed server, it is determined that the first server is not failure
Server;Or
The determination first server is not failed server, comprising:
According to the mark of the first server in the warning information, first clothes are obtained from first database
The version information of business device;If it is determined that the warning information is that the first server is in caused when release status variation,
Then determine that the first server is not failed server;Record has each server in the system in the first database
Release status.
In above-mentioned technical proposal, from the angle for the high availability for ensureing distributed system, determining first service is provided
Device is not the mode of failed server, avoids executing treatment measures to such failed server, to influence whole system just
Often operation.
It is optionally, described that first treatment measures are executed to the first server, comprising:
According to first treatment measures, the corresponding script letter of first treatment measures is called from the second database
Breath executes first treatment measures to the first server to realize;There is for described record in second database
The corresponding script information of the treatment measures of each server in system.
In above-mentioned technical proposal, script information corresponding with the first treatment measures is stored in the second database, from second
The script information is obtained in database, thereby executing the first treatment measures, which is the place determined for each server
The script information of reason measure has specific aim, is applicable to the demand of different server.
Optionally, first treatment measures are used to indicate the isolation first server;
It is described that first treatment measures are executed to the first server, comprising:
The first out code is sent to the corresponding message-oriented middleware of the system, and first out code includes described the
The mark of one server and the first duration, first out code are used to indicate the message-oriented middleware isolation first clothes
It is engaged in when a length of first duration of device;Or
Send the second out code to the first server, be used to indicate first server closing operate in it is described
Srvice instance in first server;Or
Third out code is sent to the power interface of the first server, the power interface is used to indicate and closes,
So that the first server power-off.
In above-mentioned technical proposal, first server can be isolated, so that data are sent to first in anti-locking system
Server avoids in system data processing in loss of data or influence system.
Optionally, described according to the warning information, generate the first treatment measures for being directed to the first server, packet
It includes:
According to the mark of the first server in the warning information, first clothes are obtained from third database
The historical failure information and corresponding treatment measures of business device;The third database is for storing each server in the system
Historical failure information and corresponding treatment measures;
In conjunction with the historical failure information and corresponding processing of fault message, the first server in the warning information
Measure generates the first treatment measures for being directed to the first server.
In above-mentioned technical proposal, in conjunction with the current fault message of first server and historical failure information, generate for the
The treatment measures of one server current failure information, the processing more can efficiently solve current failure information.
Correspondingly, the embodiment of the invention also provides a kind of O&M device based on alarm, comprising:
Receiving unit, for receiving the warning information of first server;
Processing unit determines whether the first server is failure for the alarm type according to the warning information
Server, if so, generating the first treatment measures for being directed to the first server according to the warning information;In conjunction with described
The operation information and resource information of one server said system judge whether first treatment measures meet High Availabitity condition;If
Meet the High Availabitity condition, then first treatment measures is executed to the first server;
The operation information of the system includes the operation information of each server in the system;The resource information of the system
Hardware information including each server in the system;The High Availabitity condition is preset is used to indicate to described first
After server executes first treatment measures, the operation information of the system is not less than the first pre-set level.
Optionally, the processing unit is specifically used for:
When the alarm type of the warning information is minor alarm, judge whether the minor alarm is to alert for the first time,
If so, closing the minor alarm, and the minor alarm is updated to third database, notifies staff so that described
The treatment measures of the minor alarm are arranged in staff;Otherwise, the minor alarm is closed, and executes the minor alarm
Treatment measures;
When the alarm type of the warning information is high-risk alarm, it is determined that whether the first server is failure clothes
Business device.
Optionally, the processing unit is specifically used for:
When the warning information of the first server indicates that the data processing of information of the first server is unsatisfactory for second
When pre-set level, according to location information of the first server in said system, determines and be located in the first server
The second server of trip;If it is determined that the second server is failed server, it is determined that the first server is not failure
Server;Or
The processing unit is specifically used for:
According to the mark of the first server in the warning information, first clothes are obtained from first database
The version information of business device;If it is determined that the warning information is that the first server is in caused when release status variation,
Then determine that the first server is not failed server;Record has each server in the system in the first database
Release status.
Optionally, the processing unit is specifically used for:
According to first treatment measures, the corresponding script letter of first treatment measures is called from the second database
Breath executes first treatment measures to the first server to realize;There is for described record in second database
The corresponding script information of the treatment measures of each server in system.
Optionally, first treatment measures are used to indicate the isolation first server;
The processing unit is specifically used for:
The first out code is sent to the corresponding message-oriented middleware of the system, and first out code includes described the
The mark of one server and the first duration, first out code are used to indicate the message-oriented middleware isolation first clothes
It is engaged in when a length of first duration of device;Or
Send the second out code to the first server, be used to indicate first server closing operate in it is described
Srvice instance in first server;Or
Third out code is sent to the power interface of the first server, the power interface is used to indicate and closes,
So that the first server power-off.
Optionally, the processing unit is specifically used for:
According to the mark of the first server in the warning information, first clothes are obtained from third database
The historical failure information and corresponding treatment measures of business device;The third database is for storing each server in the system
Historical failure information and corresponding treatment measures;
In conjunction with the historical failure information and corresponding processing of fault message, the first server in the warning information
Measure generates the first treatment measures for being directed to the first server.
Correspondingly, the embodiment of the invention also provides a kind of calculating equipment, comprising:
Memory, for storing program instruction;
Processor executes above-mentioned be based on according to the program of acquisition for calling the program instruction stored in the memory
The O&M method of alarm.
Correspondingly, the embodiment of the invention also provides a kind of computer-readable non-volatile memory medium, including computer
Readable instruction, when computer is read and executes the computer-readable instruction so that computer execute it is above-mentioned based on alarm
O&M method.
Detailed description of the invention
To describe the technical solutions in the embodiments of the present invention more clearly, make required in being described below to embodiment
Attached drawing is briefly introduced, it should be apparent that, drawings in the following description are only some embodiments of the invention, for this
For the those of ordinary skill in field, without creative efforts, it can also be obtained according to these attached drawings other
Attached drawing.
Fig. 1 is a kind of schematic diagram of system architecture provided in an embodiment of the present invention;
Fig. 2 is a kind of structural schematic diagram of automated programming system provided in an embodiment of the present invention;
Fig. 3 is a kind of flow diagram of the O&M method based on alarm provided in an embodiment of the present invention;
Fig. 4 is the flow diagram of another O&M method based on alarm provided in an embodiment of the present invention;
Fig. 5 is a kind of structural schematic diagram of the O&M device based on alarm provided in an embodiment of the present invention.
Specific embodiment
To make the objectives, technical solutions, and advantages of the present invention clearer, below in conjunction with attached drawing to the present invention make into
It is described in detail to one step, it is clear that described embodiments are only a part of the embodiments of the present invention, rather than whole implementation
Example.Based on the embodiments of the present invention, obtained by those of ordinary skill in the art without making creative efforts
All other embodiment, shall fall within the protection scope of the present invention.
Fig. 1 illustratively shows the embodiment of the present invention and provides the system architecture that the O&M method based on alarm is applicable in,
The system architecture may include financial service system 100, monitoring system 200, automated programming system 300, Database Systems
400。
Financial service system 100 is system for executing financial business, which can be adapted for point
Cloth scene, i.e. financial service system 100 may include multiple clusters, each of which cluster can run multiple srvice instance.
Monitoring system 200 can monitor the indices of financial service system 100, and monitoring system 200 may include transaction
Monitoring system and resource monitoring, the amount of access of transaction business of the transaction monitoring system for monitoring financial service system 100,
Request amount etc.;Resource monitoring be used for server in financial service system 100, operating system, message-oriented middleware, using into
The comprehensive monitoring of row, such as CPU, memory, the network interface card of monitoring server.Transaction monitoring system such as IMS system, monitoring resource system
Such as Falcon system of system.
Automated programming system 300 is used to carry out automatic processing to the warning information of monitoring system 200, that is, receives prison
, can be with the classification of automatic identification warning information after the warning information that control system 200 is sent, and be directed to different classes of to alarm letter
Breath carries out automatic processing.
Database Systems 400 can be divided into action data library, edition data library, configuration management database.Action data library
For the historical failure information and corresponding treatment measures of server each in storage system, that is, it is configured with the place of relevant abnormalities scene
Reason measure, action data library such as SOP (Standard Operating Procedure, standard operating procedure);Edition data
Release status of the library for each server in storage system, current version information and locating release status such as certain server
It is just in upgraded version;The various configuration informations of equipment in configuration management database storage and management enterprise IT architecture, it and institute
There are service support and service delivery process all closely linked, supports the operating of these processes, plays the value of configuration information, simultaneously
Guarantee the accuracy of data, configuration management database such as CMDB (Configuration dependent on related procedure
Management Database, configuration management database).
In the embodiment of the present invention, automated programming system 300 may include alarm AM access module 310, alert analysis module
320, High Availabitity engine modules 330, general automation processing module 340, specific automation processing module 350, warning information fortune
Module 360 and processing information notification module 370 are sought, it can be as shown in Figure 2.
1, AM access module 310 is alerted
Alarm AM access module 310 is used to receive the upper all warning information of production.Specifically, (1) receiving host, network,
Warning information in terms of the basic resources such as database, the warning information can be sent to automatic processing system by Falcon system
System 300;(2) warning information in terms of the business that financial service system 100 generates is received, which can pass through IMS system
System is sent to automated programming system 300, which may include the memory situation of server, thread pool service condition
Deng.
2, alert analysis module 320
Alert analysis module 320 provides alarm judgement and corresponds to for carrying out analysis matching to the warning information of input
Suggestion for operation, optionally, alert analysis module 320 matches the warning information of input to related SOP, and combine history alarm
And the publication information summary of current system judges the influence grade and operating method of the alarm.
The embodiment of the present invention can execute in two steps: SOP library inquiry and history alarm matching.Wherein, it SOP library inquiry: looks into
It askes operation system failure and pre-processes mechanism library, be equivalent to action data library, match corresponding fault handling steps;History alarm
Match: in the SOP library inquiry post analysis operation system history alarm information, including the last alarm, alarm in nearly seven days and three days
Interior version situation, in combination with recorded in the library SOP processing method and current health status (main clause, network, database,
The comprehensive score of jvm etc.) provide corresponding treatment measures.
3, High Availabitity engine modules 330
High Availabitity engine modules 330 are for guaranteeing that failure of the financial service system 100 under distributed scene automatically processes
In high availability, illustratively, it is possible that multiple operation system simultaneous faults or one are in fault treating procedure
In system the case where multiple Instance failures, the automatic processing of failure needs to guarantee that the height of system can while quick processing at this time
With property, executed under the assistance of High Availabitity engine modules 330 by optimal step.
High Availabitity engine modules 330 pass through example quantity, the failure calculated in each cluster of each system in real time and occurred
In journey fault point substantially positioning, system basic resource situation, determine alert analysis module 320 generate treatment measures whether
It is feasible.Illustratively, High Availabitity engine modules 330 can judge network, database, host resource state, such as work as system
Platform hostdown, High Availabitity engine modules 330 will judge whether system example stopping instantly meeting system health
Judge whether that this example is isolated.
In the embodiment of the present invention, 100 execution of financial service system processing can be assessed by High Availabitity engine modules 330 and is arranged
The operation information of front and back is applied, such as handling capacity, the processing data of business datum are time-consuming, consumed resource, and then determine whether to
Execute this treatment measures.
4, general automation processing module 340
General automation processing module 340 is used for the failure automatic processing under generic scenario.Under unitized overall development environment,
Most of system restart, be isolated, stopping and all specific similitude of version rollback, can by general automation processing module 340
It seeks unity of action these four types of operations to the exception of system, reduces because of script disunity bring risk, improve and solve efficiency.This module
For predominantly basic resource class it is abnormal, isolation, trading volume are executed to the srvice instance of failed server when such as message congestion
Health detection is executed to system under exception.
5, specific automation processing module 350
Specific automation processing module 350 is for automatic under part special system and the peculiar fault scenes of all kinds of servers
Change the management of processing script.It, can be by this module for the inspection and processing of the skimble-scamble system in part and peculiar fault scenes
Configure corresponding automatized script.
It should be noted that general automation processing module 340 and specific automation processing module 350 are both provided with correspondence
High Availabitity check code, checked so that it is guaranteed that automatized script can do primary confirmation comprehensively before execution to system.
6, warning information runs module 360
Warning information runs the unified management and data operation that module 360 is used for all kinds of warning information.Specifically, for existing
Stage, all kinds of alarm datas were various in a jumble, included all kinds of possible hidden danger, and warning information runs module 360 by excavating daily accuse
Alert information and historical data do big data comparison, and daily warning information is done classified finishing, and to be supplied to staff, work people
Member updates the corresponding library SOP, to facilitate subsequent automation maintenance work, forms the closed loop of entire automation O&M.
7, information notification module 370 is handled
Information notification module 370 is handled to be used for the result notice after all kinds of warning information and automatic processing to each O&M
Staff, to realize that staff timely updates system mode.
Based on foregoing description, Fig. 3 illustratively shows a kind of O&M side based on alarm provided in an embodiment of the present invention
The process of method, the process can be executed by the O&M device based on alarm.
As shown in figure 3, the process specifically includes:
Step 301, the warning information of first server is received.
Some server in the first server, that is, financial service system, the warning information can be monitoring system and monitor
And be sent to automated programming system.May include in the warning information of first server first server mark, first
The fault message etc. of server.
Step 302, according to the alarm type of warning information, determine whether first server is failed server, if so,
The first treatment measures for being directed to first server are generated according to warning information.
The alarm type of warning information is divided into minor alarm and high-risk alarm.When the alarm type of warning information is general accuses
When alert, judge whether minor alarm is to alert for the first time, if so, closing minor alarm, and minor alarm is updated to third number
According to library, i.e. action data library, staff is notified so that the treatment measures of minor alarm are arranged in staff;Otherwise, one is closed
As alert, and execute the treatment measures of minor alarm;When the alarm type of warning information is high-risk alarm, it is determined that the first clothes
Whether business device is failed server.
Herein, it can judge whether the first server is failed server according to the warning information of first server, lead to
Default rule is crossed, non-faulting server is filtered out, specifically, when determining first server not is failed server, at least
There can be following two situation:
Situation one: when the data processing of information of the warning information instruction first server of first server is unsatisfactory for second in advance
If when index, according to location information of the first server in said system, determining the second clothes for being located at first server upstream
Business device, however, it is determined that second server is failed server, it is determined that first server is not failed server.The situation is explained
When some upstream server failure, to will affect the processing timeliness of downstream server, being closed according to the calling having between server
The downstream server can't be determined as failed server, but first can do automatic processing to upstream server by connection relationship.
Situation two: according to the mark of the first server in warning information, first server is obtained from first database
Version information, however, it is determined that warning information be first server be in release status variation when it is caused, it is determined that first clothes
Business device is not failed server;Wherein, the release status for having each server in system is recorded in first database.Herein first
Database, that is, edition data library.
After determining that first server is failed server, it can be generated according to warning information for first server
First treatment measures, at this time, it may be necessary to consider fault message of the first server in historical record and for each failure
The treatment measures taken can be inquired from third database according to the mark of first server, which arranges
Database is applied, specifically, obtaining first server from third database according to the mark of the first server in warning information
Historical failure information and corresponding treatment measures, and by fault message, the first service in the warning information of first server
The historical failure information and corresponding treatment measures of device combine, and generate the first treatment measures for being directed to first server.
In the concrete realization, in the warning information for receiving first server, first the warning information can be accused
Alert grading, comprehensively consider because being known as: alert keyword, the analysis of system business amount, time delay, the whether relevant announcement of upstream and downstream system
These Considerations are weighted processing by police, history alarm information etc., so that it is determined that going out alarm grading.Herein, history alarm
The first server in history or in history the alarm number of preset period of time, alarm grade, processing side can be considered in information
Method etc..Alarm grading can be divided into minor alarm and high-risk alarm.
For minor alarm, alarm can be automatically processed according to the processing method in third database, at this point, if this is general
Alarm is to alert for the first time, then can notify staff, so that staff carries out artificial treatment, and treatment measures is updated
Into third database;May further determine whether for high-risk alarm can be with the automatic processing alarm, if so, generating
Automatic processing measure, i.e. the first treatment measures;Otherwise, it needs to notify staff, so that staff is manually located
Reason, and treatment measures are updated in third database.
Step 303, judge that the first treatment measures are in conjunction with the operation information and resource information of first server said system
It is no to meet High Availabitity condition.
Herein, the operation information of system includes the operation information of each server in system, such as the business processing of each server
Amount, service delay etc., the resource information of system include the hardware information of each server in system, such as the CPU, interior of each server
It deposits, connection relationship etc..
High Availabitity condition be it is preset be used to indicate to first server execute the first treatment measures after, the fortune of system
Row information is not less than the first pre-set level.For example, the business processing amount of current financial operation system is a, there is event in certain server
Barrier then will first assess the pass that the business processing amount b and business processing amount a of treated financial service system are carried out to the server
System, for example, if b > a or b are not less than (a × 90%), it is determined that first treatment measures meet High Availabitity condition.
Step 304, if meeting the High Availabitity condition, the first treatment measures are executed to first server.
First treatment measures can be general measure or special measure, for general measure, can use general automation
Processing module executes, and is provided in the general automation processing module and to restart, be isolated, stop and the general measures such as version rollback
Script information is suitable for most of server, can be reduced by the way that such script information is arranged because of script disunity bring wind
Danger improves and solves efficiency.
In a kind of implementation, the first treatment measures first server can be isolated for setting, specifically, can be with
The srvice instance that will be run in first server isolation or deactivated first server, if first server passes through message-oriented middleware
It is connect with other servers, message-oriented middleware is for transmitting the server of message in system, which can receive
The message of server transmission is simultaneously sent the message on other corresponding servers, it can whether control sender can send out
It send whether message and recipient can receive message, when executing the first treatment measures, the can be sent to message-oriented middleware
One out code, wherein include the mark and the first duration of first server in first out code, message-oriented middleware can be with
The first duration is isolated in the first server, as in the first out code first when a length of 5min, then message-oriented middleware can be with
5min is isolated in the first server, and cancels the isolation to the first server after 5min.If first server directly with
The connection of other servers, then can directly close the srvice instance in first server, can be stepped at this time by first server
The srvice instance is recorded, the second out code is sent in first server, which, which is used to close, operates in the
Srvice instance on one server.When the srvice instance can not be logged on in first server, first can be directly closed
The power supply of server is illustratively provided with power interface in the first server, directly can send the to the power interface
Three out codes are used to indicate power interface closing, to reach first server power-off.
For special measure, the corresponding script letter of the server can be set to for the fault message of each server
Breath, and store server each in system is corresponding with the script information of each server into the second database, in the specific implementation, root
According to the first treatment measures, the corresponding script information of the first treatment measures is called from the second database, to realize to first service
Device executes the first treatment measures.
Embodiment in order to preferably explain the present invention will describe the fortune based on alarm under specific implement scene below
Process is tieed up, as shown in figure 4, specific as follows:
Alert analysis module judges that the warning information is minor alarm or high-risk alarm after receiving warning information.
If minor alarm, then judge the minor alarm be for the first time alarm or history alarm, for the first time alarm be action data library in not
Treatment measures corresponding to the alarm are stored with, have been stored with place corresponding to the alarm in history alarm, that is, action data library
Reason measure.If alerting for the first time, then need that operation maintenance personnel is notified to be handled manually, and treatment measures are automatically updated to measure
Database reports corresponding optimization item automatically, then can be raw with the treatment measures in aggregate measures database if history alarm
At corresponding treatment measures, automation carries out alarming processing, and closes alarm, notifies operation maintenance personnel executive condition.
If high-risk alarm, then judge this it is high-risk alarm whether can with automated execution, that is, judge be in action data library
It is no to have the corresponding treatment measures of high-risk alarm, if so, needing further to judge whether treatment measures pass through High Availabitity engine
High Availabitity condition, if so, execute the corresponding automatized script of the treatment measures, otherwise, notice operation maintenance personnel is handled manually.
When automatized script can be executed, corresponding optimization item, and notice operation maintenance personnel can be reported to hold automatically after being finished
Market condition.When that cannot execute automatized script, operation maintenance personnel can improve corresponding automation foot after being handled manually
This information, and the corresponding Automatic Optimal option of configuration.
In above-mentioned technical proposal, after receiving the warning information of first server, the first server can be first judged
It whether is failed server, if so, can just generate the first treatment measures of the first server, avoiding need not to server progress
The O&M processing wanted, to influence the normal operation of whole system;Further, it is arranged in the first processing for generating the first server
After applying, first treatment measures are judged in conjunction with the operation information and resource information of the first server said system,
It determines if to meet high availability condition, if so, first treatment measures are executed to the first server, safeguards system
High availability further avoids influencing the normal operation of whole system due to the maintenance work of a server.And using automatic
The processing scheme of change improves troubleshooting efficiency.
Based on the same inventive concept, Fig. 5 illustratively shows a kind of fortune based on alarm provided in an embodiment of the present invention
The structure of device is tieed up, which can execute the process of the O&M method based on alarm.
The device, comprising:
Receiving unit 501, for receiving the warning information of first server;
Processing unit 502 determines whether the first server is event for the alarm type according to the warning information
Hinder server, if so, generating the first treatment measures for being directed to the first server according to the warning information;In conjunction with described
The operation information and resource information of first server said system judge whether first treatment measures meet High Availabitity condition;
If meeting the High Availabitity condition, first treatment measures are executed to the first server;
The operation information of the system includes the operation information of each server in the system;The resource information of the system
Hardware information including each server in the system;The High Availabitity condition is preset is used to indicate to described first
After server executes first treatment measures, the operation information of the system is not less than the first pre-set level.
Optionally, the processing unit 502 is also used to:
When the alarm type of the warning information is minor alarm, judge whether the minor alarm is to alert for the first time,
If so, closing the minor alarm, and the minor alarm is updated to third database, notifies staff so that described
The treatment measures of the minor alarm are arranged in staff;Otherwise, the minor alarm is closed, and executes the minor alarm
Treatment measures;
When the alarm type of the warning information is high-risk alarm, it is determined that whether the first server is failure clothes
Business device.
Optionally, the processing unit 502 is specifically used for:
When the warning information of the first server indicates that the data processing of information of the first server is unsatisfactory for second
When pre-set level, according to location information of the first server in said system, determines and be located in the first server
The second server of trip;If it is determined that the second server is failed server, it is determined that the first server is not failure
Server;Or
The processing unit 502 is specifically used for:
According to the mark of the first server in the warning information, first clothes are obtained from first database
The version information of business device;If it is determined that the warning information is that the first server is in caused when release status variation,
Then determine that the first server is not failed server;Record has each server in the system in the first database
Release status.
Optionally, the processing unit 502 is specifically used for:
According to first treatment measures, the corresponding script letter of first treatment measures is called from the second database
Breath executes first treatment measures to the first server to realize;There is for described record in second database
The corresponding script information of the treatment measures of each server in system.
Optionally, first treatment measures are used to indicate the isolation first server;
The processing unit 502 is specifically used for:
The first out code is sent to the corresponding message-oriented middleware of the system, and first out code includes described the
The mark of one server and the first duration, first out code are used to indicate the message-oriented middleware isolation first clothes
It is engaged in when a length of first duration of device;Or
Send the second out code to the first server, be used to indicate first server closing operate in it is described
Srvice instance in first server;Or
Third out code is sent to the power interface of the first server, the power interface is used to indicate and closes,
So that the first server power-off.
Optionally, the processing unit 502 is specifically used for:
According to the mark of the first server in the warning information, first clothes are obtained from third database
The historical failure information and corresponding treatment measures of business device;The third database is for storing each server in the system
Historical failure information and corresponding treatment measures;
In conjunction with the historical failure information and corresponding processing of fault message, the first server in the warning information
Measure generates the first treatment measures for being directed to the first server.
Based on the same inventive concept, the embodiment of the invention also provides a kind of calculating equipment, comprising:
Memory, for storing program instruction;
Processor executes above-mentioned be based on according to the program of acquisition for calling the program instruction stored in the memory
The O&M method of alarm.
Based on the same inventive concept, the embodiment of the invention also provides a kind of computer-readable non-volatile memory medium,
Including computer-readable instruction, when computer is read and executes the computer-readable instruction, so that computer execution is above-mentioned
O&M method based on alarm.
The present invention be referring to according to the method for the embodiment of the present invention, the process of equipment (system) and computer program product
Figure and/or block diagram describe.It should be understood that every one stream in flowchart and/or the block diagram can be realized by computer program instructions
The combination of process and/or box in journey and/or box and flowchart and/or the block diagram.It can provide these computer programs
Instruct the processor of general purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices to produce
A raw server, so that the instruction generation executed by computer or the processor of other programmable data processing devices is used for
Realize the dress for the function of specifying in one or more flows of the flowchart and/or one or more blocks of the block diagram
It sets.
These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing devices with spy
Determine in the computer-readable memory that mode works, so that it includes referring to that instruction stored in the computer readable memory, which generates,
Enable the manufacture of device, the command device realize in one box of one or more flows of the flowchart and/or block diagram or
The function of being specified in multiple boxes.
These computer program instructions also can be loaded onto a computer or other programmable data processing device, so that counting
Series of operation steps are executed on calculation machine or other programmable devices to generate computer implemented processing, thus in computer or
The instruction executed on other programmable devices is provided for realizing in one or more flows of the flowchart and/or block diagram one
The step of function of being specified in a box or multiple boxes.
Although preferred embodiments of the present invention have been described, it is created once a person skilled in the art knows basic
Property concept, then additional changes and modifications may be made to these embodiments.So it includes excellent that the following claims are intended to be interpreted as
It selects embodiment and falls into all change and modification of the scope of the invention.
Obviously, various changes and modifications can be made to the invention without departing from essence of the invention by those skilled in the art
Mind and range.In this way, if these modifications and changes of the present invention belongs to the range of the claims in the present invention and its equivalent technologies
Within, then the present invention is also intended to include these modifications and variations.
Claims (14)
1. a kind of O&M method based on alarm characterized by comprising
Receive the warning information of first server;
According to the alarm type of the warning information, determine whether the first server is failed server, if so, according to
The warning information generates the first treatment measures for being directed to the first server;
Judge whether first treatment measures accord in conjunction with the operation information and resource information of the first server said system
Close High Availabitity condition;The operation information of the system includes the operation information of each server in the system;The money of the system
Source information includes the hardware information of each server in the system;The High Availabitity condition is preset is used to indicate to institute
After stating first server execution first treatment measures, the operation information of the system is not less than the first pre-set level;
If meeting the High Availabitity condition, first treatment measures are executed to the first server.
2. the method as described in claim 1, which is characterized in that the alarm type according to the warning information determines institute
State whether first server is failed server, comprising:
When the alarm type of the warning information be minor alarm when, judge the minor alarm whether be alert for the first time, if so,
The minor alarm is then closed, and the minor alarm is updated to third database, notifies staff so that the work
The treatment measures of the minor alarm are arranged in personnel;Otherwise, the minor alarm is closed, and executes the processing of the minor alarm
Measure;
When the alarm type of the warning information is high-risk alarm, it is determined that whether the first server is failed services
Device.
3. method according to claim 2, which is characterized in that the determination first server is not failed server,
Include:
When the warning information of the first server indicates that the data processing of information of the first server is unsatisfactory for second and presets
When index, according to location information of the first server in said system, determines and be located at the first server upstream
Second server;If it is determined that the second server is failed server, it is determined that the first server is not failed services
Device;Or
The determination first server is not failed server, comprising:
According to the mark of the first server in the warning information, the first server is obtained from first database
Version information;If it is determined that the warning information is that the first server is in caused when release status variation, then really
The fixed first server is not failed server;Record has the version of each server in the system in the first database
State.
4. the method as described in claim 1, which is characterized in that first processing described to first server execution is arranged
It applies, comprising:
According to first treatment measures, the corresponding script information of first treatment measures is called from the second database, with
It realizes and first treatment measures is executed to the first server;There is in the system record in second database
The corresponding script information of the treatment measures of each server.
5. the method as described in claim 1, which is characterized in that first treatment measures are used to indicate isolation first clothes
Business device;
It is described that first treatment measures are executed to the first server, comprising:
The first out code is sent to the corresponding message-oriented middleware of the system, first out code includes first clothes
The mark and the first duration of business device, first out code are used to indicate the message-oriented middleware and the first server are isolated
When a length of first duration;Or
The second out code is sent to the first server, the first server closing is used to indicate and operates in described first
Srvice instance on server;Or
Third out code is sent to the power interface of the first server, the power interface is used to indicate and closes, so that
The first server power-off.
6. such as method described in any one of claim 1 to 5, which is characterized in that described to be directed to according to warning information generation
First treatment measures of the first server, comprising:
According to the mark of the first server in the warning information, the first server is obtained from third database
Historical failure information and corresponding treatment measures;The third database is used to store the history of each server in the system
Fault message and corresponding treatment measures;
It is arranged in conjunction with the historical failure information of fault message, the first server in the warning information and corresponding processing
It applies, generates the first treatment measures for being directed to the first server.
7. a kind of O&M device based on alarm characterized by comprising
Receiving unit, for receiving the warning information of first server;
Processing unit determines whether the first server is failed services for the alarm type according to the warning information
Device, if so, generating the first treatment measures for being directed to the first server according to the warning information;In conjunction with first clothes
The operation information and resource information of business device said system judge whether first treatment measures meet High Availabitity condition;If meeting
The High Availabitity condition then executes first treatment measures to the first server;
The operation information of the system includes the operation information of each server in the system;The resource information of the system includes
The hardware information of each server in the system;The High Availabitity condition is preset is used to indicate to the first service
After device executes first treatment measures, the operation information of the system is not less than the first pre-set level.
8. device as claimed in claim 7, which is characterized in that the processing unit is specifically used for:
When the alarm type of the warning information be minor alarm when, judge the minor alarm whether be alert for the first time, if so,
The minor alarm is then closed, and the minor alarm is updated to third database, notifies staff so that the work
The treatment measures of the minor alarm are arranged in personnel;Otherwise, the minor alarm is closed, and executes the processing of the minor alarm
Measure;
When the alarm type of the warning information is high-risk alarm, it is determined that whether the first server is failed services
Device.
9. device as claimed in claim 8, which is characterized in that the processing unit is specifically used for:
When the warning information of the first server indicates that the data processing of information of the first server is unsatisfactory for second and presets
When index, according to location information of the first server in said system, determines and be located at the first server upstream
Second server;If it is determined that the second server is failed server, it is determined that the first server is not failed services
Device;Or
The processing unit is specifically used for:
According to the mark of the first server in the warning information, the first server is obtained from first database
Version information;If it is determined that the warning information is that the first server is in caused when release status variation, then really
The fixed first server is not failed server;Record has the version of each server in the system in the first database
State.
10. device as claimed in claim 7, which is characterized in that the processing unit is specifically used for:
According to first treatment measures, the corresponding script information of first treatment measures is called from the second database, with
It realizes and first treatment measures is executed to the first server;There is in the system record in second database
The corresponding script information of the treatment measures of each server.
11. device as claimed in claim 7, which is characterized in that first treatment measures are used to indicate isolation described first
Server;
The processing unit is specifically used for:
The first out code is sent to the corresponding message-oriented middleware of the system, first out code includes first clothes
The mark and the first duration of business device, first out code are used to indicate the message-oriented middleware and the first server are isolated
When a length of first duration;Or
The second out code is sent to the first server, the first server closing is used to indicate and operates in described first
Srvice instance on server;Or
Third out code is sent to the power interface of the first server, the power interface is used to indicate and closes, so that
The first server power-off.
12. such as the described in any item devices of claim 7 to 11, which is characterized in that the processing unit is specifically used for:
According to the mark of the first server in the warning information, the first server is obtained from third database
Historical failure information and corresponding treatment measures;The third database is used to store the history of each server in the system
Fault message and corresponding treatment measures;
It is arranged in conjunction with the historical failure information of fault message, the first server in the warning information and corresponding processing
It applies, generates the first treatment measures for being directed to the first server.
13. a kind of calculating equipment characterized by comprising
Memory, for storing program instruction;
Processor requires 1 to 6 according to the program execution benefit of acquisition for calling the program instruction stored in the memory
Described in any item methods.
14. a kind of computer-readable non-volatile memory medium, which is characterized in that including computer-readable instruction, work as computer
When reading and executing the computer-readable instruction, so that computer executes such as method as claimed in any one of claims 1 to 6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910579323.3A CN110275795A (en) | 2019-06-28 | 2019-06-28 | A kind of O&M method and device based on alarm |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910579323.3A CN110275795A (en) | 2019-06-28 | 2019-06-28 | A kind of O&M method and device based on alarm |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110275795A true CN110275795A (en) | 2019-09-24 |
Family
ID=67963730
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910579323.3A Pending CN110275795A (en) | 2019-06-28 | 2019-06-28 | A kind of O&M method and device based on alarm |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110275795A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110851322A (en) * | 2019-10-11 | 2020-02-28 | 平安科技(深圳)有限公司 | Hardware equipment abnormity monitoring method, server and computer readable storage medium |
CN111008105A (en) * | 2019-11-07 | 2020-04-14 | 泰康保险集团股份有限公司 | Distributed system call relation visualization method and device |
CN112036588A (en) * | 2020-09-04 | 2020-12-04 | 中国平安财产保险股份有限公司 | System operation and maintenance flow management and control method and device |
CN112491625A (en) * | 2020-11-30 | 2021-03-12 | 深圳前海微众银行股份有限公司 | Operation and maintenance alarming method, device and equipment based on instant communication platform |
CN112700343A (en) * | 2019-10-23 | 2021-04-23 | 中国石油天然气股份有限公司 | Operation monitoring method and system based on oil gas Internet of things |
-
2019
- 2019-06-28 CN CN201910579323.3A patent/CN110275795A/en active Pending
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110851322A (en) * | 2019-10-11 | 2020-02-28 | 平安科技(深圳)有限公司 | Hardware equipment abnormity monitoring method, server and computer readable storage medium |
CN112700343A (en) * | 2019-10-23 | 2021-04-23 | 中国石油天然气股份有限公司 | Operation monitoring method and system based on oil gas Internet of things |
CN112700343B (en) * | 2019-10-23 | 2024-07-30 | 中国石油天然气股份有限公司 | Operation monitoring method and system based on oil-gas Internet of things |
CN111008105A (en) * | 2019-11-07 | 2020-04-14 | 泰康保险集团股份有限公司 | Distributed system call relation visualization method and device |
CN112036588A (en) * | 2020-09-04 | 2020-12-04 | 中国平安财产保险股份有限公司 | System operation and maintenance flow management and control method and device |
CN112491625A (en) * | 2020-11-30 | 2021-03-12 | 深圳前海微众银行股份有限公司 | Operation and maintenance alarming method, device and equipment based on instant communication platform |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110275795A (en) | A kind of O&M method and device based on alarm | |
US20190378073A1 (en) | Business-Aware Intelligent Incident and Change Management | |
US9558459B2 (en) | Dynamic selection of actions in an information technology environment | |
CN111897671A (en) | Failure recovery method, computer device, and storage medium | |
CN110166297A (en) | O&M method, system, equipment and computer readable storage medium | |
US8572244B2 (en) | Monitoring tool deployment module and method of operation | |
US12111738B2 (en) | Managing data center failure events | |
CN107707415B (en) | SaltStack-based automatic monitoring and warning method for server configuration | |
US20090201144A1 (en) | Alarm management apparatus | |
CN111552556A (en) | GPU cluster service management system and method | |
CN108353086A (en) | Deployment assurance checks for monitoring industrial control systems | |
CN117992304A (en) | Integrated intelligent operation and maintenance platform | |
CN112615737B (en) | Method and system for automatically monitoring service system | |
CN109783310A (en) | The Dynamic and Multi dimensional method for safety monitoring and its monitoring device of information technoloy equipment | |
CN117687874A (en) | Monitoring method and device for operation and maintenance platform | |
CN117313012A (en) | Fault management method, device, equipment and storage medium of service orchestration system | |
US8984122B2 (en) | Monitoring tool auditing module and method of operation | |
CN111222928A (en) | Method and system for monitoring enterprise standard invoicing | |
CN116149824A (en) | Task re-running processing method, device, equipment and storage medium | |
US8560375B2 (en) | Monitoring object system and method of operation | |
CN114418488B (en) | Inventory information processing method, device and system | |
CN107797915B (en) | Fault repairing method, device and system | |
CN112540771A (en) | Automated operation and maintenance method, system, equipment and computer readable storage medium | |
CN110956456A (en) | Money printing processing method, device and system | |
CN110569287B (en) | Control method, system, electronic equipment and storage medium for product spot test |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |