CN104104537A - State-based service monitoring and recovery method and device - Google Patents
State-based service monitoring and recovery method and device Download PDFInfo
- Publication number
- CN104104537A CN104104537A CN201310129532.0A CN201310129532A CN104104537A CN 104104537 A CN104104537 A CN 104104537A CN 201310129532 A CN201310129532 A CN 201310129532A CN 104104537 A CN104104537 A CN 104104537A
- Authority
- CN
- China
- Prior art keywords
- service
- unit
- monitoring
- forms
- application
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000012544 monitoring process Methods 0.000 title claims abstract description 53
- 238000011084 recovery Methods 0.000 title claims abstract description 26
- 238000000034 method Methods 0.000 title claims abstract description 16
- 238000004891 communication Methods 0.000 claims abstract description 20
- 238000012423 maintenance Methods 0.000 claims abstract description 11
- 230000003993 interaction Effects 0.000 claims abstract description 10
- 238000004140 cleaning Methods 0.000 claims description 13
- 238000005516 engineering process Methods 0.000 claims description 8
- 238000011161 development Methods 0.000 claims description 5
- 238000012986 modification Methods 0.000 claims description 5
- 230000004048 modification Effects 0.000 claims description 5
- 230000004044 response Effects 0.000 claims description 4
- 230000009897 systematic effect Effects 0.000 claims description 4
- 230000008569 process Effects 0.000 abstract description 3
- 230000000694 effects Effects 0.000 description 3
- 230000007246 mechanism Effects 0.000 description 3
- 230000003044 adaptive effect Effects 0.000 description 2
- 230000006378 damage Effects 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 101000797623 Homo sapiens Protein AMBP Proteins 0.000 description 1
- 102100032859 Protein AMBP Human genes 0.000 description 1
- 208000027418 Wounds and injury Diseases 0.000 description 1
- 230000006978 adaptation Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 239000008358 core component Substances 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 208000014674 injury Diseases 0.000 description 1
- 238000012806 monitoring device Methods 0.000 description 1
- 230000003449 preventive effect Effects 0.000 description 1
- 230000007727 signaling mechanism Effects 0.000 description 1
- 230000026676 system process Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Abstract
The invention discloses a state-based service monitoring and recovery method and a device. the device comprises a strategy configuration unit, a communication analysis unit, an operation analysis unit, an output analysis unit, a resource analysis unit, a clearing unit, a recovery unit, a scheduling control unit and a protocol interaction unit, wherein the strategy configuration unit carries out parameter configuration on service monitoring and recovery; the communication analysis unit analyzes the service state; the operation analysis unit analyzes the operation state; the output analysis unit analyzes service input; the resource analysis unit analyzes resources used by the service; the clearing unit realizes nondestructive service stop; the recovery unit carries out service recovery; the scheduling control unit realizes control on method steps and processes; and the protocol interaction unit acquires service monitoring configuration and the strategy, and the monitoring result is provided. Accurate monitoring and automatic recovery of operation services can be provided for a computer in forms such as service, program and application, continuity of operation, and timeliness and effectiveness of maintenance can be effectively improved, and safe monitoring can be provided.
Description
Technical field
The present invention relates to information service monitoring and recovery technology, relate in particular to operation monitoring, O&M and continuous service ensuring method and the technology of information service system.
Background technology
Along with deepening continuously of informatization, information service system has spreaded all over industry-by-industry.They move incessantly, due to system damage, can not safeguard in time and safeguard that the impact that the improper system-down causing causes is very serious.So monitoring O&M technology development of information system.A key point of information system continuous service is that the continuous service of application guarantees, its groundwork principle is: service is monitored, find that it cannot provide after normal service, recovers it.The work effect that generally need to reach is: realize two-shipper or the machine automatically perform, without manual intervention.
Monitoring is more accurately better by the common requirement result of link, recovering is to guarantee not cause secondary injury, guarantees to recover validity, and recovery time is more short better, also need to consider to adapt to the ability of different application simultaneously, also will consider business in system not to be caused and had a strong impact on simultaneously.
Summary of the invention
In view of this, the invention provides a kind of service monitoring and recovery device based on state, monitoring only need to once be disposed with all links of recovery, realizes convenient maintenance, and intelligence is moved automatically.Mainframe program end is as the core component of this device.This device comprises:
Communications analysis unit, to computer with forms such as service, program, application, the service providing in tcp/ip communication port mode, analyzes the correctness of the service state of its communication unit, service response ability, service, and result supplies other unit as foundation;
Operating analysis unit, the service that computer is provided with forms such as service, program, application, is analyzed its running status, operational factor, and result supplies other unit as foundation;
Output analytic unit, the service that computer is provided with forms such as service, program, application, analyzes the output of its regularity, contingency, and result supplies other unit as foundation;
Resource analysis unit; The service that computer is provided with forms such as service, program, application, the running status that it is moved to required software, hardware resource is analyzed, and result supplies other unit as foundation;
Cleaning unit, according to the operation result of relevant each unit, when service is broken down, carries out this unit, realizes and stops service harmlessly; Releasing resource;
Recovery unit, when service is broken down, according to the operation result of relevant each unit, carries out this unit, realizes Resume service;
Scheduling controlling unit, according to strategy, analyzes whether need service monitoring, and will start or stop the work of relevant unit;
Protocol interaction unit, obtains pre-configured configuration, strategy that service is monitored, flows to relevant unit, and returns to monitored results to using assembly.
Administrative center, comprises non-core tactful dispensing unit and notification unit, is the input and output unit of device;
Preferably, the running parameter of tactful dispensing unit comprises that the member composition of service place equipment, service and the operating system classification of job order, service are, data such as the time scheduling of software, hardware resource, monitoring and the recovery of service dependence, communication port, notice object, customized development interface, executive programs; Described parameter be mainly by this unit according to instruction acquisition to, do not need user manually to input, only have non-existent parameter in system to be specified by user.
Preferably, the interval of cleaning unit and performance element is very important to systematic influence, and this parameter is adjustable, and to being generally not less than 30 seconds, they should not be higher than 5 minutes.
Preferably, hardware resource generally comprises the disk array that service is used, and with the resource of the forms such as file system or raw device, hardware resource generally comprises the resource of the forms such as NFS, WebService.
Preferably, running environment and identity that cleaning and the executive program of recovery unit need to be consistent with application, and include signature protection in, unwarranted modification can trigger alarm and automatically recover fail safe when assurance is safeguarded.
The present invention is a kind of application monitoring and restoration methods based on state also, and application monitoring and all links of resuming work of O&M only need to once be disposed, and realize convenient maintenance, and intelligence is supervised automatically.The method comprises:
To computer, with forms such as service, program, application, the service providing in tcp/ip communication port mode, analyzes the correctness of the service state of its communication unit, service response ability, service, and result supplies other unit as foundation;
The service that computer is provided with forms such as service, program, application, is analyzed its running status, operational factor, and result supplies other unit as foundation;
The service that computer is provided with forms such as service, program, application, analyzes the output of its regularity, contingency, and result supplies other unit as foundation;
The service that computer is provided with forms such as service, program, application, the running status that it is moved to required software, hardware resource is analyzed, and result supplies other unit as foundation;
According to the operation result of relevant each unit, when service is broken down, carry out this unit, realize and stop service harmlessly; Releasing resource;
When service is broken down, according to the operation result of relevant each unit, carry out this unit, realize Resume service;
According to strategy, analyze whether need service monitoring, and will start or stop the work of relevant unit;
Obtain pre-configured configuration, strategy that service is monitored, flow to relevant unit, and return to monitored results to using assembly.
Preferably, the running parameter of strategy configuration comprises service place equipment, the member composition of service and the operating system classification of job order, service, data such as the time scheduling of software, hardware resource, monitoring and the recovery of service dependence, communication port, notice object, customized development interface, executive programs; Described parameter is mainly automatically to collect, and does not need user manually to input, and only has non-existent parameter in system to be specified by user.
Preferably, cleaning is very important to systematic influence with the execution interval recovering, and this parameter is adjustable, and to being generally not less than 30 seconds, they should not be higher than 5 minutes.
Preferably, hardware resource generally comprises the disk array that service is used, and with the resource of the forms such as file system or raw device, hardware resource generally comprises the resource of the forms such as NFS, WebService.
Preferably, running environment and identity that cleaning and the execution recovering need to be consistent with application, and include signature protection in, unwarranted modification can trigger alarm and automatically recover fail safe when assurance is safeguarded.
The present invention is based on tactful configuring technical, is that monitoring has realized object-oriented strategy, and deployment, maintenance work are simplified greatly, and after policy object is set up, secondary deployment time minimizing is more than 90%.The flexibility of disposing simultaneously, safeguarding strengthens greatly, can be according to the feature design of business monitoring policy own.Between unit of the present invention, integrated level is high, and work accurately, reliably.In actual test, not only obtained highly desirable result of use, signaling mechanism has also guaranteed to recover with company's two-shipper such as IBM, HP, Oracle (SUN) compatibility of product and offline backup software.
Accompanying drawing explanation:
Fig. 1 is the applied environment of one embodiment of the present invention.
Fig. 2 is the application monitoring of the state of the present invention is based on and the building-block of logic of recovery device.
Fig. 3 is the user interface schematic diagram of strategy configuration of the present invention and management.
Embodiment:
Please refer to Fig. 1, in information system operational monitoring scene, conventionally can adopt and mainframe program is installed on main frame is brought in implementing monitoring, monitoring and Resume service are provided.The present invention is based on the application monitoring of state and the monitoring device of recovery technology is applied in mainframe program end, this device can be realized by software.This device mainly comprises communications analysis unit 11, operating analysis unit 12, output analytic unit 13, resource analysis unit 14, cleaning unit 15, recovery unit 16, scheduling controlling unit 17, protocol interaction unit 18, monitoring and protection unit 19.Performed handling process when being embodied as example and describing the operation of this device with software below.
Step 1, receives and is written into the policy information that allocation engine sends, and described policy information comprises the detailed technology parameter of monitoring and recovery policy; This step is carried out by protocol interaction unit 18.
First need to be in all parameters of strategy configuration end Input Monitor Connector strategy.By cryptographic protocol passage, to protocol interaction unit Input Monitor Connector policing parameter, protocol interaction unit checks strategy according to processing logic, then parameter is injected into status retrieval unit.
Policing parameter is with the required all parameters of monitoring engine work.
Work engine based on tactful can be realized monitoring, resuming work departs from O&M personnel's intervention and management, automatically works flexibly in real time, comprises whether Monitoring Rules enters sleep period automatically.Please refer to Fig. 3.
Step 2, according to the monitoring parameter injecting, carries out scheduling controlling to monitoring and resuming work, and controls the work of each working cell.
At operation time, according to the running parameter receiving, monitoring session (optional) is initiated in communications analysis unit 11; 12(is optional in operating analysis unit), 14(is optional in resource analysis unit), analytic unit 13(is optional in output) continuous operation in order, their selective binding guaranteed comprehensive analysis accurately, reliable, simultaneous adaptation the complexity of user environment.
Due to the multi-protocols of monitoring, different protocols need to be disposed in communications analysis unit, carrys out the powerful and complete of Realization analysis function.
Step 3, above-mentioned steps is carried out respectively collection and treatment by analysis result, then requires to carry out comprehensive analysis and judgement according to strategy, first start the veritification work of signature, the work of selective actuation cleaning unit again, or directly start protocol interaction unit, result is carried out to fault and present situation circular.
Step 4, repeating step 2, carries out respectively collection and treatment by analysis result, according to strategy, require to carry out comprehensive analysis and judgement again, first start the veritification work of signature, more optionally start the work of recovery unit, or directly start protocol interaction unit, result is carried out to fault and present situation circular.
Furthermore, due to diversity and the complexity of service, cause the accuracy that guarantees monitoring to realize difficulty.Need to comprise system process state, communication protocol integrated service state, software and hardware resources, configuration file, working document, all operation processes, service and the parameter thereof etc. that rely on are comprehensively analyzed, these analysis results, as the work foundation of each step of cleaning, recovery unit, are guaranteed effect.
Currently realize service bureau monitoring and the software automatically replying seldom, and mostly be the external product of realizing dual-host backup.The less disclosure of its specific works mechanism.Its major defect is, with high costs; Be single product, there is no unified technology, operation and maintenance system, professional skill requires high, and maintenance difficulties is high.And adopting of the present inventionly based on strategy, based on state, the device of realizing with the unified system working mechanism of allocation engine and the interlock of circular engine has very little maintenance, management workload, has realized monitoring work clothes business, reaches desirable effect.Not only realized monitoring, resumed work, and these change processing by usage policy, with clear, detailed notice classification, send to follower.
With the way unifying operation and maintenance system and realize in information system O&M monitoring, recover extremely rare.
The present invention can realize service monitoring and resume work unified in single O&M system with operation monitoring, network management, safety alarm, ITIL O&M etc., realize system-wide organic management, the O&M level that has greatly improved information system, has reduced O&M workload.
The present invention is by adopting the objectification of policy deployment, realizing high adaptive capacity, on functional safety, reliable basis, the good unified graphical interfaces of compatible UNIX, Linux, Windows system operating system is also provided, for user's operation management provides good experience, realize that secondary is disposed and preventive maintenance time minimizing more than 90%.And in the prior art, ubiquity craft+artificial parameter adjustment, dispose numerous and diverse, without notifying mechanism, the shortcoming that adaptive capacity is poor.The present invention has eliminated the above shortcoming of monitoring system, and other features of fit applications this patent product make user's O&M work substantially mate demand, and the employing device of the present invention of having realized can be deployed in all main flow commercial operation systems.In having a plurality of cases, realize monitoring, resume work unmanned the intervention, satisfactory for result, to quote smooth and easyly, continuous operating time reaches more than 2 years.
Described above is only preferably implementation of the present invention, not in order to limit protection scope of the present invention, within any variation being equal to and modification all should be encompassed in protection scope of the present invention.
Claims (10)
1. the service monitoring based on state and recovery technology and a device, application monitoring and all links of resuming work of O&M only need to once be disposed, and realize convenient maintenance, and intelligence is supervised automatically, and this device comprises:
Communications analysis unit, to computer with forms such as service, program, application, the service providing in tcp/ip communication port mode, analyzes the correctness of the service state of its communication unit, service response ability, service, and result supplies other unit as foundation;
Operating analysis unit, the service that computer is provided with forms such as service, program, application, is analyzed its running status, operational factor, and result supplies other unit as foundation;
Output analytic unit, the service that computer is provided with forms such as service, program, application, analyzes the output of its regularity, contingency, and result supplies other unit as foundation;
Resource analysis unit; The service that computer is provided with forms such as service, program, application, the running status that it is moved to required software, hardware resource is analyzed, and result supplies other unit as foundation;
Cleaning unit, according to the operation result of relevant each unit, when service is broken down, carries out this unit, realizes and stops service harmlessly; Releasing resource;
Recovery unit, when service is broken down, according to the operation result of relevant each unit, carries out this unit, realizes Resume service;
Scheduling controlling unit, according to strategy, analyzes whether need service monitoring, and will start or stop the work of relevant unit;
Protocol interaction unit, obtains pre-configured configuration, strategy that service is monitored, flows to relevant unit, and returns to monitored results to using assembly.
2. according to the device described in claim 1, the running parameter of tactful dispensing unit comprises service place equipment, the member composition of service and the operating system classification of job order, service, data such as the time scheduling of software, hardware resource, monitoring and the recovery of service dependence, communication port, notice object, customized development interface, executive programs; Described parameter be mainly by this unit according to instruction acquisition to, do not need user manually to input, only have non-existent parameter in system to be specified by user.
3. according to the device described in claim 1, the interval of cleaning unit and performance element is very important to systematic influence, and this parameter is adjustable, and to being generally not less than 30 seconds, they should not be higher than 5 minutes.
4. according to the device described in claim 1, hardware resource generally comprises the disk array that service is used, and with the resource of the forms such as file system or raw device, hardware resource generally comprises the resource of the forms such as NFS, WebService.
5. according to the device described in claim 1, running environment and identity that cleaning and the executive program of recovery unit need to be consistent with application, and include signature protection in, unwarranted modification can trigger alarm and automatically recover fail safe when assurance is safeguarded.
6. the service monitoring based on state and a restoration methods, application monitoring and all links of resuming work of O&M only need to once be disposed, and realize convenient maintenance, and intelligence is supervised automatically, and the method comprises:
To computer, with forms such as service, program, application, the service providing in tcp/ip communication port mode, analyzes the correctness of the service state of its communication unit, service response ability, service, and result supplies other unit as foundation;
The service that computer is provided with forms such as service, program, application, is analyzed its running status, operational factor, and result supplies other unit as foundation;
The service that computer is provided with forms such as service, program, application, analyzes the output of its regularity, contingency, and result supplies other unit as foundation;
The service that computer is provided with forms such as service, program, application, the running status that it is moved to required software, hardware resource is analyzed, and result supplies other unit as foundation;
According to the operation result of relevant each unit, when service is broken down, carry out this unit, realize and stop service harmlessly; Releasing resource;
When service is broken down, according to the operation result of relevant each unit, carry out this unit, realize Resume service;
According to strategy, analyze whether need service monitoring, and will start or stop the work of relevant unit;
Obtain pre-configured configuration, strategy that service is monitored, flow to relevant unit, and return to monitored results to using assembly.
7. according to the method described in claim 6, the running parameter of strategy configuration comprises service place equipment, the member composition of service and the operating system classification of job order, service, data such as the time scheduling of software, hardware resource, monitoring and the recovery of service dependence, communication port, notice object, customized development interface, executive programs; Described parameter is mainly automatically to collect, and does not need user manually to input, and only has non-existent parameter in system to be specified by user.
8. according to the method described in claim 6, cleaning is very important to systematic influence with the execution interval recovering, and this parameter is adjustable, and to being generally not less than 30 seconds, they should not be higher than 5 minutes.
9. according to the method described in claim 6, hardware resource generally comprises the disk array that service is used, and with the resource of the forms such as file system or raw device, hardware resource generally comprises the resource of the forms such as NFS, WebService.
10. according to the method described in claim 6, running environment and identity that cleaning and the execution recovering need to be consistent with application, and include signature protection in, unwarranted modification can trigger alarm and automatically recover fail safe when assurance is safeguarded.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310129532.0A CN104104537B (en) | 2013-04-15 | 2013-04-15 | A kind of service monitoring based on state and restoration methods and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310129532.0A CN104104537B (en) | 2013-04-15 | 2013-04-15 | A kind of service monitoring based on state and restoration methods and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN104104537A true CN104104537A (en) | 2014-10-15 |
CN104104537B CN104104537B (en) | 2017-07-07 |
Family
ID=51672361
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201310129532.0A Expired - Fee Related CN104104537B (en) | 2013-04-15 | 2013-04-15 | A kind of service monitoring based on state and restoration methods and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104104537B (en) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060190243A1 (en) * | 2005-02-24 | 2006-08-24 | Sharon Barkai | Method and apparatus for data management |
CN101699824A (en) * | 2009-11-16 | 2010-04-28 | 中兴通讯股份有限公司 | Device and method for failure recovery |
CN102143002A (en) * | 2011-04-07 | 2011-08-03 | 中兴通讯股份有限公司 | Method and system for backing up single-boards |
-
2013
- 2013-04-15 CN CN201310129532.0A patent/CN104104537B/en not_active Expired - Fee Related
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060190243A1 (en) * | 2005-02-24 | 2006-08-24 | Sharon Barkai | Method and apparatus for data management |
CN101699824A (en) * | 2009-11-16 | 2010-04-28 | 中兴通讯股份有限公司 | Device and method for failure recovery |
CN102143002A (en) * | 2011-04-07 | 2011-08-03 | 中兴通讯股份有限公司 | Method and system for backing up single-boards |
Non-Patent Citations (1)
Title |
---|
刘继全: "信息系统运行安全综合管理监控平台的设计与实现", 《铁路计算机应用》 * |
Also Published As
Publication number | Publication date |
---|---|
CN104104537B (en) | 2017-07-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107241224B (en) | Network risk monitoring method and system for transformer substation | |
CN104022904B (en) | Distributed computer room information technoloy equipment management platform | |
CN101877618B (en) | Monitoring method, server and system based on proxy-free mode | |
CN107632918B (en) | Monitoring system and method for computing storage equipment | |
WO2016188100A1 (en) | Information system fault scenario information collection method and system | |
CN103095492A (en) | Data collection method and data collection device | |
CN101222742B (en) | Alarm self-positioning and self-processing method and system for mobile communication network guard system | |
CA2564153A1 (en) | Agent-less systems, methods and computer program products for managing a plurality of remotely located data storage systems | |
CN106201844A (en) | A kind of log collecting method and device | |
US10270859B2 (en) | Systems and methods for system-wide digital process bus fault recording | |
CN109802843A (en) | A kind of network equipment monitoring system based on SNMP | |
US20120072556A1 (en) | Method and System for Detecting Network Upgrades | |
CN104104537A (en) | State-based service monitoring and recovery method and device | |
CN105045100A (en) | Intelligent operation monitoring platform for management by use of mass data | |
CN112506154A (en) | Internet of things monitoring system for domestic sewage treatment station | |
CN104104536B (en) | A kind of concurrent poll monitoring method of self-regulation and device based on strategy | |
CN104104535A (en) | Strategy-based unified monitoring and operation and maintenance method and device | |
CN103576673B (en) | A kind of onboard replaceable unit detection system and detection method | |
Toueir et al. | A goal-oriented approach for adaptive sla monitoring: a cloud provider case study | |
CN107526008A (en) | Business electrical monitoring device and failure analysis methods | |
CN111913448A (en) | Informationized intelligent control system | |
CN112565407A (en) | Large-scale equipment remote cooperative operation and maintenance system based on industrial internet APP | |
CN103903107A (en) | Intelligent real-time alarming method for energy management system | |
Cao et al. | IT Operation and Maintenance Process improvement and design under virtualization environment | |
EP3751420B1 (en) | Maintainable distributed fail-safe real-time computer system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20170707 |