CN104104537A - State-based service monitoring and recovery method and device - Google Patents

State-based service monitoring and recovery method and device Download PDF

Info

Publication number
CN104104537A
CN104104537A CN201310129532.0A CN201310129532A CN104104537A CN 104104537 A CN104104537 A CN 104104537A CN 201310129532 A CN201310129532 A CN 201310129532A CN 104104537 A CN104104537 A CN 104104537A
Authority
CN
China
Prior art keywords
service
unit
monitoring
forms
application
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201310129532.0A
Other languages
Chinese (zh)
Other versions
CN104104537B (en
Inventor
沙永刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
TIMESCHINA BEIJING TECHNOLOGY CO LTD
Original Assignee
TIMESCHINA BEIJING TECHNOLOGY CO LTD
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by TIMESCHINA BEIJING TECHNOLOGY CO LTD filed Critical TIMESCHINA BEIJING TECHNOLOGY CO LTD
Priority to CN201310129532.0A priority Critical patent/CN104104537B/en
Publication of CN104104537A publication Critical patent/CN104104537A/en
Application granted granted Critical
Publication of CN104104537B publication Critical patent/CN104104537B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention discloses a state-based service monitoring and recovery method and a device. the device comprises a strategy configuration unit, a communication analysis unit, an operation analysis unit, an output analysis unit, a resource analysis unit, a clearing unit, a recovery unit, a scheduling control unit and a protocol interaction unit, wherein the strategy configuration unit carries out parameter configuration on service monitoring and recovery; the communication analysis unit analyzes the service state; the operation analysis unit analyzes the operation state; the output analysis unit analyzes service input; the resource analysis unit analyzes resources used by the service; the clearing unit realizes nondestructive service stop; the recovery unit carries out service recovery; the scheduling control unit realizes control on method steps and processes; and the protocol interaction unit acquires service monitoring configuration and the strategy, and the monitoring result is provided. Accurate monitoring and automatic recovery of operation services can be provided for a computer in forms such as service, program and application, continuity of operation, and timeliness and effectiveness of maintenance can be effectively improved, and safe monitoring can be provided.

Description

A kind of service monitoring and restoration methods and device based on state
Technical field
The present invention relates to information service monitoring and recovery technology, relate in particular to operation monitoring, O&M and continuous service ensuring method and the technology of information service system.
Background technology
Along with deepening continuously of informatization, information service system has spreaded all over industry-by-industry.They move incessantly, due to system damage, can not safeguard in time and safeguard that the impact that the improper system-down causing causes is very serious.So monitoring O&M technology development of information system.A key point of information system continuous service is that the continuous service of application guarantees, its groundwork principle is: service is monitored, find that it cannot provide after normal service, recovers it.The work effect that generally need to reach is: realize two-shipper or the machine automatically perform, without manual intervention.
Monitoring is more accurately better by the common requirement result of link, recovering is to guarantee not cause secondary injury, guarantees to recover validity, and recovery time is more short better, also need to consider to adapt to the ability of different application simultaneously, also will consider business in system not to be caused and had a strong impact on simultaneously.
Summary of the invention
In view of this, the invention provides a kind of service monitoring and recovery device based on state, monitoring only need to once be disposed with all links of recovery, realizes convenient maintenance, and intelligence is moved automatically.Mainframe program end is as the core component of this device.This device comprises:
Communications analysis unit, to computer with forms such as service, program, application, the service providing in tcp/ip communication port mode, analyzes the correctness of the service state of its communication unit, service response ability, service, and result supplies other unit as foundation;
Operating analysis unit, the service that computer is provided with forms such as service, program, application, is analyzed its running status, operational factor, and result supplies other unit as foundation;
Output analytic unit, the service that computer is provided with forms such as service, program, application, analyzes the output of its regularity, contingency, and result supplies other unit as foundation;
Resource analysis unit; The service that computer is provided with forms such as service, program, application, the running status that it is moved to required software, hardware resource is analyzed, and result supplies other unit as foundation;
Cleaning unit, according to the operation result of relevant each unit, when service is broken down, carries out this unit, realizes and stops service harmlessly; Releasing resource;
Recovery unit, when service is broken down, according to the operation result of relevant each unit, carries out this unit, realizes Resume service;
Scheduling controlling unit, according to strategy, analyzes whether need service monitoring, and will start or stop the work of relevant unit;
Protocol interaction unit, obtains pre-configured configuration, strategy that service is monitored, flows to relevant unit, and returns to monitored results to using assembly.
Administrative center, comprises non-core tactful dispensing unit and notification unit, is the input and output unit of device;
Preferably, the running parameter of tactful dispensing unit comprises that the member composition of service place equipment, service and the operating system classification of job order, service are, data such as the time scheduling of software, hardware resource, monitoring and the recovery of service dependence, communication port, notice object, customized development interface, executive programs; Described parameter be mainly by this unit according to instruction acquisition to, do not need user manually to input, only have non-existent parameter in system to be specified by user.
Preferably, the interval of cleaning unit and performance element is very important to systematic influence, and this parameter is adjustable, and to being generally not less than 30 seconds, they should not be higher than 5 minutes.
Preferably, hardware resource generally comprises the disk array that service is used, and with the resource of the forms such as file system or raw device, hardware resource generally comprises the resource of the forms such as NFS, WebService.
Preferably, running environment and identity that cleaning and the executive program of recovery unit need to be consistent with application, and include signature protection in, unwarranted modification can trigger alarm and automatically recover fail safe when assurance is safeguarded.
The present invention is a kind of application monitoring and restoration methods based on state also, and application monitoring and all links of resuming work of O&M only need to once be disposed, and realize convenient maintenance, and intelligence is supervised automatically.The method comprises:
To computer, with forms such as service, program, application, the service providing in tcp/ip communication port mode, analyzes the correctness of the service state of its communication unit, service response ability, service, and result supplies other unit as foundation;
The service that computer is provided with forms such as service, program, application, is analyzed its running status, operational factor, and result supplies other unit as foundation;
The service that computer is provided with forms such as service, program, application, analyzes the output of its regularity, contingency, and result supplies other unit as foundation;
The service that computer is provided with forms such as service, program, application, the running status that it is moved to required software, hardware resource is analyzed, and result supplies other unit as foundation;
According to the operation result of relevant each unit, when service is broken down, carry out this unit, realize and stop service harmlessly; Releasing resource;
When service is broken down, according to the operation result of relevant each unit, carry out this unit, realize Resume service;
According to strategy, analyze whether need service monitoring, and will start or stop the work of relevant unit;
Obtain pre-configured configuration, strategy that service is monitored, flow to relevant unit, and return to monitored results to using assembly.
Preferably, the running parameter of strategy configuration comprises service place equipment, the member composition of service and the operating system classification of job order, service, data such as the time scheduling of software, hardware resource, monitoring and the recovery of service dependence, communication port, notice object, customized development interface, executive programs; Described parameter is mainly automatically to collect, and does not need user manually to input, and only has non-existent parameter in system to be specified by user.
Preferably, cleaning is very important to systematic influence with the execution interval recovering, and this parameter is adjustable, and to being generally not less than 30 seconds, they should not be higher than 5 minutes.
Preferably, hardware resource generally comprises the disk array that service is used, and with the resource of the forms such as file system or raw device, hardware resource generally comprises the resource of the forms such as NFS, WebService.
Preferably, running environment and identity that cleaning and the execution recovering need to be consistent with application, and include signature protection in, unwarranted modification can trigger alarm and automatically recover fail safe when assurance is safeguarded.
The present invention is based on tactful configuring technical, is that monitoring has realized object-oriented strategy, and deployment, maintenance work are simplified greatly, and after policy object is set up, secondary deployment time minimizing is more than 90%.The flexibility of disposing simultaneously, safeguarding strengthens greatly, can be according to the feature design of business monitoring policy own.Between unit of the present invention, integrated level is high, and work accurately, reliably.In actual test, not only obtained highly desirable result of use, signaling mechanism has also guaranteed to recover with company's two-shipper such as IBM, HP, Oracle (SUN) compatibility of product and offline backup software.
Accompanying drawing explanation:
Fig. 1 is the applied environment of one embodiment of the present invention.
Fig. 2 is the application monitoring of the state of the present invention is based on and the building-block of logic of recovery device.
Fig. 3 is the user interface schematic diagram of strategy configuration of the present invention and management.
Embodiment:
Please refer to Fig. 1, in information system operational monitoring scene, conventionally can adopt and mainframe program is installed on main frame is brought in implementing monitoring, monitoring and Resume service are provided.The present invention is based on the application monitoring of state and the monitoring device of recovery technology is applied in mainframe program end, this device can be realized by software.This device mainly comprises communications analysis unit 11, operating analysis unit 12, output analytic unit 13, resource analysis unit 14, cleaning unit 15, recovery unit 16, scheduling controlling unit 17, protocol interaction unit 18, monitoring and protection unit 19.Performed handling process when being embodied as example and describing the operation of this device with software below.
Step 1, receives and is written into the policy information that allocation engine sends, and described policy information comprises the detailed technology parameter of monitoring and recovery policy; This step is carried out by protocol interaction unit 18.
First need to be in all parameters of strategy configuration end Input Monitor Connector strategy.By cryptographic protocol passage, to protocol interaction unit Input Monitor Connector policing parameter, protocol interaction unit checks strategy according to processing logic, then parameter is injected into status retrieval unit.
Policing parameter is with the required all parameters of monitoring engine work.
Work engine based on tactful can be realized monitoring, resuming work departs from O&M personnel's intervention and management, automatically works flexibly in real time, comprises whether Monitoring Rules enters sleep period automatically.Please refer to Fig. 3.
Step 2, according to the monitoring parameter injecting, carries out scheduling controlling to monitoring and resuming work, and controls the work of each working cell.
At operation time, according to the running parameter receiving, monitoring session (optional) is initiated in communications analysis unit 11; 12(is optional in operating analysis unit), 14(is optional in resource analysis unit), analytic unit 13(is optional in output) continuous operation in order, their selective binding guaranteed comprehensive analysis accurately, reliable, simultaneous adaptation the complexity of user environment.
Due to the multi-protocols of monitoring, different protocols need to be disposed in communications analysis unit, carrys out the powerful and complete of Realization analysis function.
Step 3, above-mentioned steps is carried out respectively collection and treatment by analysis result, then requires to carry out comprehensive analysis and judgement according to strategy, first start the veritification work of signature, the work of selective actuation cleaning unit again, or directly start protocol interaction unit, result is carried out to fault and present situation circular.
Step 4, repeating step 2, carries out respectively collection and treatment by analysis result, according to strategy, require to carry out comprehensive analysis and judgement again, first start the veritification work of signature, more optionally start the work of recovery unit, or directly start protocol interaction unit, result is carried out to fault and present situation circular.
Furthermore, due to diversity and the complexity of service, cause the accuracy that guarantees monitoring to realize difficulty.Need to comprise system process state, communication protocol integrated service state, software and hardware resources, configuration file, working document, all operation processes, service and the parameter thereof etc. that rely on are comprehensively analyzed, these analysis results, as the work foundation of each step of cleaning, recovery unit, are guaranteed effect.
Currently realize service bureau monitoring and the software automatically replying seldom, and mostly be the external product of realizing dual-host backup.The less disclosure of its specific works mechanism.Its major defect is, with high costs; Be single product, there is no unified technology, operation and maintenance system, professional skill requires high, and maintenance difficulties is high.And adopting of the present inventionly based on strategy, based on state, the device of realizing with the unified system working mechanism of allocation engine and the interlock of circular engine has very little maintenance, management workload, has realized monitoring work clothes business, reaches desirable effect.Not only realized monitoring, resumed work, and these change processing by usage policy, with clear, detailed notice classification, send to follower.
With the way unifying operation and maintenance system and realize in information system O&M monitoring, recover extremely rare.
The present invention can realize service monitoring and resume work unified in single O&M system with operation monitoring, network management, safety alarm, ITIL O&M etc., realize system-wide organic management, the O&M level that has greatly improved information system, has reduced O&M workload.
The present invention is by adopting the objectification of policy deployment, realizing high adaptive capacity, on functional safety, reliable basis, the good unified graphical interfaces of compatible UNIX, Linux, Windows system operating system is also provided, for user's operation management provides good experience, realize that secondary is disposed and preventive maintenance time minimizing more than 90%.And in the prior art, ubiquity craft+artificial parameter adjustment, dispose numerous and diverse, without notifying mechanism, the shortcoming that adaptive capacity is poor.The present invention has eliminated the above shortcoming of monitoring system, and other features of fit applications this patent product make user's O&M work substantially mate demand, and the employing device of the present invention of having realized can be deployed in all main flow commercial operation systems.In having a plurality of cases, realize monitoring, resume work unmanned the intervention, satisfactory for result, to quote smooth and easyly, continuous operating time reaches more than 2 years.
Described above is only preferably implementation of the present invention, not in order to limit protection scope of the present invention, within any variation being equal to and modification all should be encompassed in protection scope of the present invention.

Claims (10)

1. the service monitoring based on state and recovery technology and a device, application monitoring and all links of resuming work of O&M only need to once be disposed, and realize convenient maintenance, and intelligence is supervised automatically, and this device comprises:
Communications analysis unit, to computer with forms such as service, program, application, the service providing in tcp/ip communication port mode, analyzes the correctness of the service state of its communication unit, service response ability, service, and result supplies other unit as foundation;
Operating analysis unit, the service that computer is provided with forms such as service, program, application, is analyzed its running status, operational factor, and result supplies other unit as foundation;
Output analytic unit, the service that computer is provided with forms such as service, program, application, analyzes the output of its regularity, contingency, and result supplies other unit as foundation;
Resource analysis unit; The service that computer is provided with forms such as service, program, application, the running status that it is moved to required software, hardware resource is analyzed, and result supplies other unit as foundation;
Cleaning unit, according to the operation result of relevant each unit, when service is broken down, carries out this unit, realizes and stops service harmlessly; Releasing resource;
Recovery unit, when service is broken down, according to the operation result of relevant each unit, carries out this unit, realizes Resume service;
Scheduling controlling unit, according to strategy, analyzes whether need service monitoring, and will start or stop the work of relevant unit;
Protocol interaction unit, obtains pre-configured configuration, strategy that service is monitored, flows to relevant unit, and returns to monitored results to using assembly.
2. according to the device described in claim 1, the running parameter of tactful dispensing unit comprises service place equipment, the member composition of service and the operating system classification of job order, service, data such as the time scheduling of software, hardware resource, monitoring and the recovery of service dependence, communication port, notice object, customized development interface, executive programs; Described parameter be mainly by this unit according to instruction acquisition to, do not need user manually to input, only have non-existent parameter in system to be specified by user.
3. according to the device described in claim 1, the interval of cleaning unit and performance element is very important to systematic influence, and this parameter is adjustable, and to being generally not less than 30 seconds, they should not be higher than 5 minutes.
4. according to the device described in claim 1, hardware resource generally comprises the disk array that service is used, and with the resource of the forms such as file system or raw device, hardware resource generally comprises the resource of the forms such as NFS, WebService.
5. according to the device described in claim 1, running environment and identity that cleaning and the executive program of recovery unit need to be consistent with application, and include signature protection in, unwarranted modification can trigger alarm and automatically recover fail safe when assurance is safeguarded.
6. the service monitoring based on state and a restoration methods, application monitoring and all links of resuming work of O&M only need to once be disposed, and realize convenient maintenance, and intelligence is supervised automatically, and the method comprises:
To computer, with forms such as service, program, application, the service providing in tcp/ip communication port mode, analyzes the correctness of the service state of its communication unit, service response ability, service, and result supplies other unit as foundation;
The service that computer is provided with forms such as service, program, application, is analyzed its running status, operational factor, and result supplies other unit as foundation;
The service that computer is provided with forms such as service, program, application, analyzes the output of its regularity, contingency, and result supplies other unit as foundation;
The service that computer is provided with forms such as service, program, application, the running status that it is moved to required software, hardware resource is analyzed, and result supplies other unit as foundation;
According to the operation result of relevant each unit, when service is broken down, carry out this unit, realize and stop service harmlessly; Releasing resource;
When service is broken down, according to the operation result of relevant each unit, carry out this unit, realize Resume service;
According to strategy, analyze whether need service monitoring, and will start or stop the work of relevant unit;
Obtain pre-configured configuration, strategy that service is monitored, flow to relevant unit, and return to monitored results to using assembly.
7. according to the method described in claim 6, the running parameter of strategy configuration comprises service place equipment, the member composition of service and the operating system classification of job order, service, data such as the time scheduling of software, hardware resource, monitoring and the recovery of service dependence, communication port, notice object, customized development interface, executive programs; Described parameter is mainly automatically to collect, and does not need user manually to input, and only has non-existent parameter in system to be specified by user.
8. according to the method described in claim 6, cleaning is very important to systematic influence with the execution interval recovering, and this parameter is adjustable, and to being generally not less than 30 seconds, they should not be higher than 5 minutes.
9. according to the method described in claim 6, hardware resource generally comprises the disk array that service is used, and with the resource of the forms such as file system or raw device, hardware resource generally comprises the resource of the forms such as NFS, WebService.
10. according to the method described in claim 6, running environment and identity that cleaning and the execution recovering need to be consistent with application, and include signature protection in, unwarranted modification can trigger alarm and automatically recover fail safe when assurance is safeguarded.
CN201310129532.0A 2013-04-15 2013-04-15 A kind of service monitoring based on state and restoration methods and device Expired - Fee Related CN104104537B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310129532.0A CN104104537B (en) 2013-04-15 2013-04-15 A kind of service monitoring based on state and restoration methods and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310129532.0A CN104104537B (en) 2013-04-15 2013-04-15 A kind of service monitoring based on state and restoration methods and device

Publications (2)

Publication Number Publication Date
CN104104537A true CN104104537A (en) 2014-10-15
CN104104537B CN104104537B (en) 2017-07-07

Family

ID=51672361

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310129532.0A Expired - Fee Related CN104104537B (en) 2013-04-15 2013-04-15 A kind of service monitoring based on state and restoration methods and device

Country Status (1)

Country Link
CN (1) CN104104537B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060190243A1 (en) * 2005-02-24 2006-08-24 Sharon Barkai Method and apparatus for data management
CN101699824A (en) * 2009-11-16 2010-04-28 中兴通讯股份有限公司 Device and method for failure recovery
CN102143002A (en) * 2011-04-07 2011-08-03 中兴通讯股份有限公司 Method and system for backing up single-boards

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060190243A1 (en) * 2005-02-24 2006-08-24 Sharon Barkai Method and apparatus for data management
CN101699824A (en) * 2009-11-16 2010-04-28 中兴通讯股份有限公司 Device and method for failure recovery
CN102143002A (en) * 2011-04-07 2011-08-03 中兴通讯股份有限公司 Method and system for backing up single-boards

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
刘继全: "信息系统运行安全综合管理监控平台的设计与实现", 《铁路计算机应用》 *

Also Published As

Publication number Publication date
CN104104537B (en) 2017-07-07

Similar Documents

Publication Publication Date Title
CN107241224B (en) Network risk monitoring method and system for transformer substation
CN104022904B (en) Distributed computer room information technoloy equipment management platform
CN101877618B (en) Monitoring method, server and system based on proxy-free mode
CN107632918B (en) Monitoring system and method for computing storage equipment
WO2016188100A1 (en) Information system fault scenario information collection method and system
CN103095492A (en) Data collection method and data collection device
CN101222742B (en) Alarm self-positioning and self-processing method and system for mobile communication network guard system
CA2564153A1 (en) Agent-less systems, methods and computer program products for managing a plurality of remotely located data storage systems
CN106201844A (en) A kind of log collecting method and device
US10270859B2 (en) Systems and methods for system-wide digital process bus fault recording
CN109802843A (en) A kind of network equipment monitoring system based on SNMP
US20120072556A1 (en) Method and System for Detecting Network Upgrades
CN104104537A (en) State-based service monitoring and recovery method and device
CN105045100A (en) Intelligent operation monitoring platform for management by use of mass data
CN112506154A (en) Internet of things monitoring system for domestic sewage treatment station
CN104104536B (en) A kind of concurrent poll monitoring method of self-regulation and device based on strategy
CN104104535A (en) Strategy-based unified monitoring and operation and maintenance method and device
CN103576673B (en) A kind of onboard replaceable unit detection system and detection method
Toueir et al. A goal-oriented approach for adaptive sla monitoring: a cloud provider case study
CN107526008A (en) Business electrical monitoring device and failure analysis methods
CN111913448A (en) Informationized intelligent control system
CN112565407A (en) Large-scale equipment remote cooperative operation and maintenance system based on industrial internet APP
CN103903107A (en) Intelligent real-time alarming method for energy management system
Cao et al. IT Operation and Maintenance Process improvement and design under virtualization environment
EP3751420B1 (en) Maintainable distributed fail-safe real-time computer system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20170707