CN110971464A - Operation and maintenance automatic system suitable for disaster recovery center - Google Patents

Operation and maintenance automatic system suitable for disaster recovery center Download PDF

Info

Publication number
CN110971464A
CN110971464A CN201911258019.5A CN201911258019A CN110971464A CN 110971464 A CN110971464 A CN 110971464A CN 201911258019 A CN201911258019 A CN 201911258019A CN 110971464 A CN110971464 A CN 110971464A
Authority
CN
China
Prior art keywords
management
data
alarm
maintenance
configuration
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911258019.5A
Other languages
Chinese (zh)
Inventor
林佳能
苏志勇
林庆瑞
黄燕珊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Grid Information and Telecommunication Co Ltd
Great Power Science and Technology Co of State Grid Information and Telecommunication Co Ltd
Original Assignee
State Grid Information and Telecommunication Co Ltd
Great Power Science and Technology Co of State Grid Information and Telecommunication Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by State Grid Information and Telecommunication Co Ltd, Great Power Science and Technology Co of State Grid Information and Telecommunication Co Ltd filed Critical State Grid Information and Telecommunication Co Ltd
Priority to CN201911258019.5A priority Critical patent/CN110971464A/en
Publication of CN110971464A publication Critical patent/CN110971464A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0654Management of faults, events, alarms or notifications using network fault recovery
    • H04L41/0663Performing the actions predefined by failover planning, e.g. switching to standby network elements
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/069Management of faults, events, alarms or notifications using logs of notifications; Post-processing of notifications
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/08Configuration management of networks or network elements
    • H04L41/0803Configuration setting
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/08Configuration management of networks or network elements
    • H04L41/0876Aspects of the degree of configuration automation

Abstract

The invention discloses an operation and maintenance automation system suitable for a disaster recovery center, which comprises a data storage, a system function, a front-end interface and an external interface, wherein the data storage is used for storing data; the data storage comprises the following steps: the operation data, the performance data and the alarm data information collected in the third-party operation and maintenance automation system are collected and recorded into six resource libraries in an interface mode, and the information which is not provided by the existing system, including log data and information system performance data, is collected through a collection technology, so that a bottom layer data support is provided for an alarm center, a configuration management module, a deployment management module and a diagnosis analysis module. In the invention, the system can effectively promote the optimization of a future 'dispatching, transporting and detecting' system of the disaster recovery center by researching and developing a set of automatic alarming, deploying, configuring and operating diagnosis tools which are suitable for the information operation and maintenance reality of the disaster recovery center and developing the performance monitoring, reliability monitoring, automatic deploying and automatic configuring of business application through the tools.

Description

Operation and maintenance automatic system suitable for disaster recovery center
Technical Field
The invention relates to the technical field of IT operation and maintenance, in particular to an operation and maintenance automation system suitable for a disaster recovery center.
Background
The problems and the current situation of the current disaster recovery center are illustrated in four aspects as follows:
alarm centralized management
At present, a plurality of devices and systems are maintained by a disaster recovery center, and due to the construction reason of historical projects, the monitoring systems facing the prior art are more and complicated, for example, a machine room basic environment monitoring system on the market comprises power and ring monitoring, security monitoring, intelligent buildings and machine room environment monitoring, a comprehensive network management system comprises various servers, network safety devices, storage device monitoring, databases, middleware and the like for monitoring, the phenomenon of repeated and multi-head alarm is prominent, the intensification of alarm management cannot be well embodied, and a lot of inconvenience can be brought to the originally complicated work of operation and maintenance personnel.
Therefore, a centralized operation and maintenance data alarm module for alarm data integration, formatting processing, calculation and unified display is urgently needed to be built, and alarm information can be sent to an operation and maintenance responsible person at the first time through mobile interaction and other means, so that the problems of repeated alarm and inaccurate and untimely alarm data are avoided, the work efficiency of operation and maintenance personnel is improved, and the equipment fault processing time is shortened.
Second, application automation deployment
The national standard GB/T20988-2007 specification of information system disaster recovery divides the disaster-preparation system into 7 levels according to the tolerance to disasters, the time taken for system recovery, and the degree of data loss. Wherein, the 5 th level is to establish a standby system which is completely the same as the source application system at a different place and adopt an asynchronous mode to carry out data synchronization. When a disaster occurs, the standby system replaces the source problem system to continue working, but the problem of data loss exists; and the 6 th level is to establish a standby system which is completely the same as the source application system in a different place and copy data in a synchronous mode. When a disaster occurs, the standby system completely takes over the source problem system to work, and zero data loss can be realized.
In order to solve the requirement that the standby information system completely replaces the source production system to work when a disaster occurs, application-level disaster recovery strategies and tool construction need to be developed, so that the requirements of national standard GB/T20988-.
Third, automatic configuration management
Most of current disaster recovery centers do not have unified configuration management libraries and corresponding system supports, operation of configuration files by operation and inspection personnel is based on offline manual processing, efficiency is extremely low, dependency procedures on the levels of the operation and inspection personnel are very high, and after personnel change, a new catcher is difficult to repair in the first time when information equipment breaks down. In order to solve the problems, the automatic synchronization, the rapid pushing and the switching of the configuration information of the information equipment such as each application system, the router and the like of the disaster recovery center are realized. The method comprises the steps of versioning management of configuration information of a router, a host system and application software, configuration comparison and replacement during upgrading or equipment failure, system restart during load or failure, configuration rollback during upgrading failure and the like. The realization of automatic configuration can greatly liberate the workload of current maintainers.
Fourthly, automatic diagnosis of faults
The existing disaster preparation center generally utilizes monitoring systems such as a network manager and the like to basically realize software and hardware resource inspection, a large amount of monitoring, faults, defects, hidden dangers, overhauling, logs and configuration data are accumulated, but discrete storage of various data fails to well establish correlation analysis, the fault analysis is basically in a post analysis stage, the application performance analysis capability is insufficient, the active discovery capability of operation hidden dangers and the active early warning capability of operation need to be improved, and the traditional data analysis means is difficult to solve the problem facing current operation and maintenance. The failure possibility prediction of the information communication system also depends on the monitoring data collected manually to carry out complicated operation, the occurring failure information cannot be predicted and eliminated timely and effectively, and the loss caused by large-scale operation abnormity due to unpredictable failures is immeasurable. Therefore, it is necessary to use big data technology to deeply mine log, operation and other data, extract hidden danger information, and change passive fault processing into active attack, thereby achieving the effects of early discovery, early evasion and early management.
Disclosure of Invention
The invention aims to: in order to solve the problems in the background art, an operation and maintenance automation system suitable for a disaster recovery center is provided.
In order to achieve the purpose, the invention adopts the following technical scheme:
an operation and maintenance automation system suitable for a disaster recovery center comprises a data storage, a system function, a front-end interface and an external interface;
the data storage comprises the following steps: collecting and inputting operation data, performance data and alarm data information collected in a third-party operation and maintenance automation system into six resource libraries in an interface mode, collecting information which is not provided by the existing system through a collection technology, wherein the information comprises log data and information system performance data, and providing bottom-layer data support for an alarm center, a configuration management module, a deployment management module and a diagnosis analysis module;
the system functions are as follows: all main functions provided by the current system comprise an alarm center, configuration management, deployment management, diagnostic analysis, log data acquisition, performance acquisition and an interface management module;
the front-end interface comprises deployment medium management, configuration file management, a resource scheduling monitoring view, a diagnosis result report, an analysis report and data synchronization result display;
the external interface comprises a reserved external interface and an external data interface and interface calling interface.
As a further description of the above technical solution:
the log data acquisition and the performance data acquisition are as follows: the collection technical means as a supplementary data source provides a basis number for the functions.
As a further description of the above technical solution:
the alarm center comprises alarm index management, alarm resource management, alarm template management, alarm strategy management and alarm display management.
As a further description of the above technical solution:
the configuration management comprises configuration synchronization management, perception strategy management, configuration comparison management, configuration push management, patch acquisition management, client management and patch release management.
As a further description of the above technical solution:
the deployment management comprises the functions of synchronous management, high-availability setting, automatic deployment and disaster-tolerant scheduling strategy.
As a further description of the above technical solution:
the diagnostic analysis includes hardware routing inspection, reliability alarm, health assessment and fault location.
As a further description of the above technical solution:
the six resource libraries comprise a knowledge base, a CMDB library, a software release library, an operation analysis library, a configuration management library and a log storage library.
As a further description of the above technical solution:
the front-end interface display comprises a desktop workbench and a mobile workbench, wherein the desktop workbench comprises an alarm desk, a diagnosis and analysis report, a deployment management view and a configuration management view, and the mobile workbench comprises the alarm desk and an early warning and analysis report.
In summary, due to the adoption of the technical scheme, the invention has the beneficial effects that:
1. according to the invention, intelligent operation and maintenance and active maintenance of information communication are supported, the technical level of information operation and maintenance is improved, the efficiency of information communication operation and maintenance is improved, and the operation quality and the service quality of an information communication system are ensured.
2. The invention promotes the automatic operation and maintenance capability of information communication and simplifies the maintenance work of the service system. Through the research of the project, performance reliability detection, fault analysis and problem positioning are established for information equipment and an application system, and the real-time discovery capability of the service state of the disaster recovery center is improved; the dependence on IT experts in the work of current system problem discovery, fault diagnosis and the like is greatly reduced, the maintenance work of a service system is simplified, and the operation and maintenance cost is further reduced.
3. In the invention, deployment and update are a very important component in operation and maintenance management. The method can enhance the functions of the system, improve the disaster tolerance capability of the system, enable the system to better exert the maximum performance, repair the loopholes of the system, enhance the reliability and the safety of the system and prevent the system from being attacked and damaged by viruses or hackers. Generally, different software systems adopt different upgrading and updating methods, and the offline work of operation and maintenance personnel can be reduced by using different automatic deployment and updating strategies, so that the information in the deployment and updating process can be traced.
4. According to the invention, by realizing a unified automatic configuration management platform, the daily complicated configuration work of operation and maintenance personnel is fully liberated on the basis of ensuring the configuration correctness, safety and compliance.
Drawings
Fig. 1 is a schematic structural diagram of an overall architecture diagram of an operation and maintenance automation system suitable for a disaster recovery center according to the present invention;
fig. 2 is a schematic structural diagram of a service architecture diagram of an operation and maintenance automation system suitable for a disaster recovery center according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Example 1
Referring to fig. 1-2, an operation and maintenance automation system suitable for a disaster recovery center is divided into data storage, system functions, front-end display and external interfaces, and the system construction strictly follows an informatization standard specification system, an informatization operation and maintenance support system and an information security protection system;
data storage: the information such as the operation data, performance data, warning data of gathering in the third party operation and maintenance automation system is gathered through the mode of interface and is typeeed in six resource libraries to and gather the information that present system does not provide through collection technology, include: log data and information system performance data provide bottom data support for modules of an alarm center, configuration management, deployment management, diagnosis analysis and the like;
the system functions are as follows: all subject functions provided by the current system include: the system comprises an alarm center, a configuration management module, a deployment management module, a diagnosis analysis module, a log data acquisition module, a performance acquisition module, an interface management module and the like;
front-end display: deployment medium management, configuration file management, resource scheduling supervision view, diagnosis result report, analysis report, data synchronization result display and the like;
collecting log data and performance data: the collection technical means as a supplementary data source provides basic data for the functions;
external interface: and reserving an external interface, and externally connecting a data supply interface and an interface calling interface.
Further described below:
alarm center
The alarm center is used for carrying out alarm processing on the performance, configuration and state of each software and hardware asset, determining the reason of the problem when an abnormity or fault is found, achieving the capability of rapidly analyzing the problem, finding the problem in advance and giving an alarm in advance, realizing the customization of an alarm template and parameter indexes, automatically recording the whole alarm flow and automatically generating the alarm and analysis processing report.
The operation and maintenance personnel can process the problematic alarm information according to the operation condition of the current system and close the alarm, the system can add, modify, delete and start each index template when the operation and maintenance personnel close the alarm, the ordinary operation and maintenance personnel can add the index templates, but the templates can be started only after the verification of the management personnel, and only the started templates can be selected by the extraction task. The threshold value of each index in the template is used as the analysis measurement standard during analysis, and when the threshold value is reached or exceeded, an alarm is given. The alarm extraction task is configured with a scheduling plan, the extraction task can be triggered and executed manually, can be executed automatically and periodically in a set time window, and can realize intensive inspection in a special period (such as spring inspection, autumn inspection and the like). The operator can also configure the extraction rule according to the daily work requirement, for example, the rule carries out alarm analysis at 10 pm every day, and different tasks can configure the rule.
The system can automatically generate the following alarm report forms through analysis and statistics:
and (4) alarm management report forms: on the basis of alarm analysis, the functions of inquiring, counting and analyzing the current alarm and the historical alarm are realized, information such as a fault analysis report is given, and analysis data is provided for thoroughly mastering the operation condition of the system. The maintenance personnel check and process the alarm and the fault through the report form, and quickly summarize and report the system operation condition; the manager can also see data and charts of fault occurrence, processing, trend and the like through the report as a data basis for decision and examination.
Log analysis report form: by classifying and filing event logs, user access logs, error logs and the like generated by software and hardware devices such as network equipment, an application server, a Web server, a database server, an operating system, an application system and the like, the log types and the log contents capable of reflecting system service failure are screened, a correlation model between log data and user behavior safety and system reliability is established, and bases are provided for user behavior safety audit, behavior abnormity detection, system defect detection and fault early warning, and the method comprises the following steps:
(1) the method has the capability of retrospective tracing of illegal behavior operations (such as data leakage, data tampering and the like) brought into a behavior security audit system, and enhances the capability of pre-identification and blocking in-process monitoring.
(2) Certain legal but abnormal behaviors are analyzed and extracted (such as operation of logging in a business system with an administrator identity in non-working hours), and invasion by non-technical methods such as internal personnel or social engineering is avoided.
(3) The potential defects of software and hardware are summarized, operation and maintenance personnel are prompted to process possible faults in time, and loss caused by downtime of a service system is avoided.
Configuration management
The configuration management module includes the following seven functions:
1. configuration synchronization management
The configuration synchronous management realizes the active acquisition and the versioning management of various resource configuration information, and different expert templates are formed for the configuration information of resources with different types according to different service applications, so that the batch configuration of equipment with the same type is conveniently carried out in different service scenes, and the configuration efficiency is improved.
The functions are mainly that the configuration information of resources such as network, host, storage, platform software and the like is managed in a unified way, and the functions comprise: 1) managing synchronous scripts; 2) managing an expert model; 3) configuring synchronous management; 4) and managing configuration history.
2. Aware policy management
The perception strategy management can configure a danger warning library of the environment according to external factors of software and hardware resources and a basic environment, and when the environment is unsafe or the system is unstable due to hardware faults of the external environment or the factors of excessive load of the software and hardware resources and the like, the tool can automatically make some countermeasures according to a set strategy so as to prevent disasters and ensure the safe and stable operation of the whole environment.
The functions include the following sub-functions: 1) configuring danger warning; 2) sensing strategy management; 3) and managing the strategy script.
3. Configuration comparison management
The method for realizing the compliance comparison of the disaster recovery center basic information by configuring comparison management comprises the following steps: and (3) performing compliance verification on information such as account numbers, null passwords, weak passwords, IP (Internet protocol), ports and the like, and comparing the configuration information of each resource with the corresponding configuration information in the configuration management library, and alarming when the information is not matched.
The functions include the following sub-functions: 1) managing compliance; 2) checking the compliance; 3) and comparing the configuration information.
4. Configuration push management
The overall service condition is as follows: after the configuration push management realizes online editing of configuration information of each software and hardware to form a version, the configured information is pushed to a specified system through the push function of the system, the original configuration information is covered, and subsequent operations such as restarting of an application system or hardware equipment are carried out according to a push strategy. The whole process of configuring the push management comprises the following steps: and (4) policy configuration and configuration pushing.
The functions include the following sub-functions: 1) configuring a strategy; 2) and configuring pushing.
5. Patch collection management
The patch acquisition management module is completed by a patch server and is specifically divided into a parent server and a child server. The parent server (deployed on the extranet) is responsible for automatically downloading the patch from the official network and providing the patch to all subordinate child servers or clients. In the whole patch management system, only a father server obtains patches from an external network, and patches of other components are all sourced from the father server. Therefore, the patch acquisition capability of the parent server determines the patch update capability of the entire system. The sub-server (deployed in the information intranet) and the patch information in the superior parent server are kept synchronous, and a client or the next-level sub-server can be connected in a downloading mode. The setting of the child server can reduce the patch distribution burden of the parent server and facilitate the arbitrary expansion of the network structure of the system.
The functions include the following sub-functions: 1) collecting patch information; 2) automatically downloading the patch; 3) and managing patch storage.
6. Client management
The client, i.e., a host, a database, and middleware in a specific data center, downloads required patches and patch information from a superior server, and realizes automatic management of local patch information. In practical use, connecting the client to the server with a faster communication rate can improve the patch distribution efficiency of the whole system.
Client management includes the following sub-functions: updating patch information and detecting patches.
7. Patch release management
The patch release management is a controllable process, and the active patch management process is not separated from the risk assessment and emergency response. Specifically, the method synchronizes patch information of the server and verifies the pre-conditions such as the environment of the client and the existing patch, and then releases the upgrade patch. The process is actually initiated by a client, the downloading and installation of a newly added patch are carried out, and then the compatibility detection is carried out; and when a problem is found, an alarm is given, and an administrator determines whether to roll back to the state before the patch is upgraded, so that the stability of the system is guaranteed.
The functions mainly comprise the following subfunctions: automatic downloading of patches, automatic installation/rollback, compatibility detection, and patch release log management.
Deployment management
The automated configuration function first needs to build a configuration library. The configuration library can uniformly manage and maintain scripts, application installation media, configuration parameters, parameter files, license files and the like of assets of different resources, different manufacturers and different models. The system comprises a script library, an application library, a parameter library and a file library, and the system can functionally support uploading, downloading, displaying, online editing and the like of files. The system stores the application codes on the SVN server in a centralized manner and stores other media on the FTP.
Configuration management mainly performs operations such as parameter optimization, state update, application release, and the like for devices, operating systems, middleware, databases, application systems, and the like. The method comprises parameter configuration, application release and rollback management.
Through research and development of an automatic configuration management system covering the whole lifecycle of operation and maintenance management of a data center service system, the following functions in the aspect of the automatic configuration management system need to be realized:
a) application release: a management library similar to system source codes and full or incremental release versions of SVN version management and a management library of configuration parameters are established to form standardized templates (determining release environments, processes, schemes and tasks, starting release tasks, monitoring release processes, verifying and testing, releasing subsequent processing and determining whether rollback is needed) for different application environments and software versions, so that the overall process management of release and rollback of an application system is realized.
b) Centralized configuration management: automatic configuration discovery and real-time change detection are realized, and the accuracy and the integrity of configuration information of the whole system are ensured by adopting an incremental backup mechanism. Dynamic configuration of devices, operating systems, middleware, databases, application systems, and the like is achieved based on scripts. The method comprises the steps of optimizing a kernel of an operating system, optimizing a running state of a medium price and a database, pushing system parameters, database connection and other parameters of an application system, and automatically configuring running modes of different scenes in special running periods such as two parties, power conservation, meeting and summer.
c) Unified user rights management: the system realizes the unique entry of the open platform system login, and performs centralized authentication, unified management and Single Sign On (SSO) on all server users of the open platform. The user authority management of the whole open platform can be more flexible and standard.
d) Managing a reinforced patch program: the centralized recording of the patch installation condition of the managed system, the vulnerability inspection based on the rules and the automatic installation of the patch program based on the strategy are realized.
d) Strict compliance audit: and periodically comparing the configuration information of the managed equipment according to a preset configuration template, and warning any non-compliant configuration.
e) Efficient modification means: providing an efficient change means, which can perform change operation on managed equipment from remote place in a virtualized centralized manner; the automatic change may be performed according to a set rule.
f) Scientific visual management: and providing an application relation view between the servers by taking the application as a view angle based on the configuration information. The construction of the server management system can provide assistance for planning, designing and implementing, such as providing the evaluation of the current situation required by planning and designing, finding the bottleneck of the server or the application, and changing and implementing the deployment work through the change flow specification managed by the server.
Diagnostic assay
The diagnostic analysis mainly extracts the operation data, performance data and user behavior data of monitored software and hardware resources in a production environment through an acquisition technology and an interface technology, index extraction is carried out on the data through a mining technology, a set of complete diagnostic model is established by means of big data analysis and various algorithms, the data generated by monitoring resources in real time is analyzed through the model, so that the reliability, the health degree and the accurate positioning of fault problems of the current software and hardware resources are judged, meanwhile, the daily working behavior of a transportation and inspection worker is solidified into a system by means of a flow technological means, and the system replaces a manual operation mode to improve the working efficiency of the transportation and inspection worker, so that the healthy and stable operation of the system is enhanced, and the satisfaction degree of user experience is improved.
Hardware equipment inspection
Establishing a hardware configuration information compliance base, an index threshold base, a configuration base and a knowledge base, carrying out compliance matching and threshold comparison on the extracted index information, and classifying and alarming partial indexes which are not in compliance or exceed the threshold so as to achieve the aim of rapid operation and inspection.
Reliability warning
The method comprises the steps of collecting performance data of an information system by a JAVA byte code-based monitoring technology, integrating collected performance index information and operation index information of related resources, and modeling reliability models of historical data of all indexes by a BP neural network algorithm and relying on a big data technology. And setting a real-time reliability index extraction task, extracting the reliability index of the resource in the production environment in real time, and performing reliability analysis on the real-time data by relying on a reliability model and a big data technology, so as to calculate the reliability condition of the monitored system in a period of time in the future.
Evaluation of health
The relation between the log and the service system reliability analysis is, as the name implies, to find the relation between the log and the service reliability analysis by using a certain mining means, and the means is an association rule algorithm. In the relationship combing assembly, firstly, establishing a service system reliability model; and secondly, mining the association relationship between the log and the reliability of the service system.
The business system reliability model is the technical basis of the business system reliability quantitative analysis, and measures the reliability of software by a statistical method or a fuzzy method according to data related to the business system reliability. The data mainly comprises the number of software failures in a specified time, the reliability of the software, the failure rate of the software and the average failure interval time. Models are currently divided into two categories: a random process class and a non-random process class. The stochastic process type model mainly comprises a Markov process model and a heterogeneous Poisson process model, a typical representation of the Markov process model is a J-M model, a variable of a failure function is the time of occurrence of ith failure taking ith-1 failure as a starting point, and the variable obeys a certain distribution of stochastic variables; the typical representation of the inhomogeneous poisson process model is a G-O model, the basic assumption is that the number of faults which can be detected at the time t is in direct proportion to the number of faults latent in software at the time, the accumulated number of faults detected at the time t is a time function N (t) which obeys homogeneous poisson distribution under certain conditions, and a phase failure rate function comprises a time variable. The non-random process model mainly comprises a model which is represented by an L-V model and uses Bayesian estimation, a feeding model which is represented by a Mills model, a model which is represented by a Nelson model and is based on an input domain, and other reliability models such as a non-parametric analysis method and a structuralization, wherein the Nelson model is one of important software reliability models when being applied in a software confirmation stage.
Event logs, user access logs, error logs and the like generated by software and hardware devices such as network equipment, an application server, a Web server, a database server, an operating system, an application system and the like are classified and filed, and the types and the contents of logs which can reflect the service failure of the system are screened by methods such as a principal component analysis method, an orthogonal defect classification method and the like, so that a correlation model between log data and the reliability of the system is established, and a basis is provided for further researching and developing a reliability analysis technology of the logs and the service system.
Location of faults
The current information system is developed from an original single system mode to networking, and an intricate and complex relationship network is formed between equipment and between systems, which brings great challenges to fault location, often tracks a simple problem to involve the fault risk of the whole system, or the problem of the source and the exposure of the problem is completely the south beam. Compared with the problem solving, the process is more complex and long in finding the root of the problem, which brings great challenges to the routine detection work of the operation personnel. Establishing a set of mechanism to realize accurate positioning of faults is very necessary and is also the core of accurate operation and maintenance.
The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art should be considered to be within the technical scope of the present invention, and the technical solutions and the inventive concepts thereof according to the present invention should be equivalent or changed within the scope of the present invention.

Claims (8)

1. An operation and maintenance automation system suitable for a disaster recovery center is characterized by comprising a data storage, a system function, a front-end interface and an external interface;
the data storage comprises the following steps: collecting and inputting operation data, performance data and alarm data information collected in a third-party operation and maintenance automation system into six resource libraries in an interface mode, collecting information which is not provided by the existing system through a collection technology, wherein the information comprises log data and information system performance data, and providing bottom-layer data support for an alarm center, a configuration management module, a deployment management module and a diagnosis analysis module;
the system functions are as follows: all main functions provided by the current system comprise an alarm center, configuration management, deployment management, diagnostic analysis, log data acquisition, performance acquisition and an interface management module;
the front-end interface comprises deployment medium management, configuration file management, a resource scheduling monitoring view, a diagnosis result report, an analysis report and data synchronization result display;
the external interface comprises a reserved external interface and an external data interface and interface calling interface.
2. The operation and maintenance automation system suitable for the disaster recovery center according to claim 1, wherein the log data collection and the performance data collection are as follows: the collection technical means as a supplementary data source provides a basis number for the functions.
3. The automation system for operation and maintenance suitable for disaster recovery center according to claim 1, wherein the alarm center comprises alarm index management, alarm resource management, alarm template management, alarm policy management and alarm display management.
4. The operation and maintenance automation system suitable for the disaster recovery center as claimed in claim 1, wherein the configuration management comprises configuration synchronization management, perception policy management, configuration comparison management, configuration push management, patch collection management, client management and patch release management.
5. The automation system for operation and maintenance suitable for disaster recovery centers as claimed in claim 1, wherein the deployment management includes synchronization management, high availability setting, automation deployment and disaster recovery scheduling policy functions.
6. The automated operation and maintenance system suitable for the disaster recovery center according to claim 1, wherein the diagnosis and analysis comprises hardware inspection, reliability alarm, health assessment and fault location.
7. The operation and maintenance automation system suitable for the disaster recovery center according to claim 1, wherein the six resource bases comprise a knowledge base, a CMDB base, a software publishing base, an operation analysis base, a configuration management base and a log storage base.
8. The automation system for operation and maintenance suitable for disaster recovery centers as claimed in claim 1, wherein the front end interface display comprises a desktop workbench and a mobile workbench, the desktop workbench comprises an alarm desk, a diagnosis and analysis report, a deployment management view and a configuration management view, and the mobile workbench comprises an alarm desk and an early warning analysis report.
CN201911258019.5A 2019-12-10 2019-12-10 Operation and maintenance automatic system suitable for disaster recovery center Pending CN110971464A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911258019.5A CN110971464A (en) 2019-12-10 2019-12-10 Operation and maintenance automatic system suitable for disaster recovery center

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911258019.5A CN110971464A (en) 2019-12-10 2019-12-10 Operation and maintenance automatic system suitable for disaster recovery center

Publications (1)

Publication Number Publication Date
CN110971464A true CN110971464A (en) 2020-04-07

Family

ID=70033540

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911258019.5A Pending CN110971464A (en) 2019-12-10 2019-12-10 Operation and maintenance automatic system suitable for disaster recovery center

Country Status (1)

Country Link
CN (1) CN110971464A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112084244A (en) * 2020-09-02 2020-12-15 杭州数云信息技术有限公司 SSO and CMDB-based enterprise unified alarm management method
CN112398823A (en) * 2020-11-03 2021-02-23 内蒙古电力(集团)有限责任公司内蒙古电力科学研究院分公司 Network information safety early warning platform based on big data analysis
CN112612831A (en) * 2020-12-14 2021-04-06 南方电网数字电网研究院有限公司 Operation and maintenance flow management performance optimization method of early warning system
CN113434404A (en) * 2021-06-24 2021-09-24 北京同创永益科技发展有限公司 Automatic service verification method and device for verifying reliability of disaster recovery backup system
CN116431454A (en) * 2023-04-17 2023-07-14 石家庄卡尚科技有限公司 Big data computer performance control system and method

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102983999A (en) * 2012-11-22 2013-03-20 安科智慧城市技术(中国)有限公司 Method and system for parameter configuration of monitoring platform system and device group
CN103019159A (en) * 2011-09-20 2013-04-03 朗德华信(北京)自控技术有限公司 Elevator equipment management and control system and method based on cloud computing
CN106330540A (en) * 2016-08-23 2017-01-11 成都聚美优品科技有限公司 Automatic operation and maintenance management method of internet
CN107046481A (en) * 2017-04-18 2017-08-15 国网福建省电力有限公司 A kind of information system integrated network management system comprehensive analysis platform
CN107977287A (en) * 2016-10-21 2018-05-01 中兴通讯股份有限公司 One kind is using disaster tolerance implementation method, apparatus and system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103019159A (en) * 2011-09-20 2013-04-03 朗德华信(北京)自控技术有限公司 Elevator equipment management and control system and method based on cloud computing
CN102983999A (en) * 2012-11-22 2013-03-20 安科智慧城市技术(中国)有限公司 Method and system for parameter configuration of monitoring platform system and device group
CN106330540A (en) * 2016-08-23 2017-01-11 成都聚美优品科技有限公司 Automatic operation and maintenance management method of internet
CN107977287A (en) * 2016-10-21 2018-05-01 中兴通讯股份有限公司 One kind is using disaster tolerance implementation method, apparatus and system
CN107046481A (en) * 2017-04-18 2017-08-15 国网福建省电力有限公司 A kind of information system integrated network management system comprehensive analysis platform

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112084244A (en) * 2020-09-02 2020-12-15 杭州数云信息技术有限公司 SSO and CMDB-based enterprise unified alarm management method
CN112398823A (en) * 2020-11-03 2021-02-23 内蒙古电力(集团)有限责任公司内蒙古电力科学研究院分公司 Network information safety early warning platform based on big data analysis
CN112612831A (en) * 2020-12-14 2021-04-06 南方电网数字电网研究院有限公司 Operation and maintenance flow management performance optimization method of early warning system
CN112612831B (en) * 2020-12-14 2023-10-17 南方电网数字电网研究院有限公司 Operation and maintenance flow management performance optimization method of early warning system
CN113434404A (en) * 2021-06-24 2021-09-24 北京同创永益科技发展有限公司 Automatic service verification method and device for verifying reliability of disaster recovery backup system
CN113434404B (en) * 2021-06-24 2024-03-19 北京同创永益科技发展有限公司 Automatic service verification method and device for verifying reliability of disaster recovery system
CN116431454A (en) * 2023-04-17 2023-07-14 石家庄卡尚科技有限公司 Big data computer performance control system and method
CN116431454B (en) * 2023-04-17 2023-11-14 云上遵义大数据有限公司 Big data computer performance control system and method

Similar Documents

Publication Publication Date Title
US10901727B2 (en) Monitoring code sensitivity to cause software build breaks during software project development
CN110971464A (en) Operation and maintenance automatic system suitable for disaster recovery center
US10310968B2 (en) Developing software project plans based on developer sensitivity ratings detected from monitoring developer error patterns
US10540502B1 (en) Software assurance for heterogeneous distributed computing systems
CN101321084A (en) Method and apparatus for generating configuration rules for computing entities within a computing environment using association rule mining
US8990372B2 (en) Operation managing device and operation management method
US9720999B2 (en) Meta-directory control and evaluation of events
EP2648104A1 (en) Dependability maintenance device, dependability maintenance system, malfunction supporting system, method for controlling dependability maintenance device, control program, computer readable recording medium recording control program
CN111078490A (en) Server safety guarantee method and system based on monitoring analysis of operating system
CN110088744B (en) Database maintenance method and system
CN110063042B (en) Database fault response method and terminal thereof
US11934855B2 (en) System and method to autonomously manage hybrid information technology (IT) infrastructure
CN111181775B (en) Integrated operation and maintenance management alarm method based on automatic host asset discovery
CN112733147A (en) Equipment safety management method and system
KR100496958B1 (en) System hindrance integration management method
CN116149824A (en) Task re-running processing method, device, equipment and storage medium
CN114500106A (en) Security management method, device, equipment and storage medium for server
CN115543377A (en) ERP system upgrading method based on artificial intelligence and ERP system
CN112817827A (en) Operation and maintenance method, device, server, equipment, system and medium
US10735246B2 (en) Monitoring an object to prevent an occurrence of an issue
KR102637540B1 (en) System for configuring cloud computing environment and automating opertation based on standard stack and intelligent operator
CN117914692A (en) Method, system and equipment for processing safety data of built-in data processing unit
CN116232914A (en) Network task processing method and device, electronic equipment and storage medium
CN117670033A (en) Security check method, system, electronic equipment and storage medium
KR20230037743A (en) Cloud-based smart factory platform service provision system and its method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200407

RJ01 Rejection of invention patent application after publication