CN117455458A - Automatic operation and maintenance method, device, equipment and storage medium - Google Patents

Automatic operation and maintenance method, device, equipment and storage medium Download PDF

Info

Publication number
CN117455458A
CN117455458A CN202311504260.8A CN202311504260A CN117455458A CN 117455458 A CN117455458 A CN 117455458A CN 202311504260 A CN202311504260 A CN 202311504260A CN 117455458 A CN117455458 A CN 117455458A
Authority
CN
China
Prior art keywords
maintenance
solution
index
preset
monitoring item
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311504260.8A
Other languages
Chinese (zh)
Inventor
曹明昊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Merchants Bank Co Ltd
Original Assignee
China Merchants Bank Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Merchants Bank Co Ltd filed Critical China Merchants Bank Co Ltd
Priority to CN202311504260.8A priority Critical patent/CN117455458A/en
Publication of CN117455458A publication Critical patent/CN117455458A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/20Administration of product repair or maintenance
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/0709Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a distributed system consisting of a plurality of standalone computer nodes, e.g. clusters, client-server systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0751Error or fault detection not based on redundancy
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/079Root cause analysis, i.e. error or fault diagnosis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0793Remedial or corrective actions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/10Office automation; Time management
    • G06Q10/103Workflow collaboration or project management
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/02Standardisation; Integration
    • H04L41/0246Exchanging or transporting network management information using the Internet; Embedding network management web servers in network elements; Web-services-based protocols
    • H04L41/026Exchanging or transporting network management information using the Internet; Embedding network management web servers in network elements; Web-services-based protocols using e-messaging for transporting management information, e.g. email, instant messaging or chat
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0654Management of faults, events, alarms or notifications using network fault recovery
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0677Localisation of faults
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0681Configuration of triggering conditions
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0686Additional information in the notification, e.g. enhancement of specific meta-data
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/16Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks using machine learning or artificial intelligence

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Signal Processing (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Business, Economics & Management (AREA)
  • Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • General Physics & Mathematics (AREA)
  • Human Resources & Organizations (AREA)
  • General Engineering & Computer Science (AREA)
  • Strategic Management (AREA)
  • Entrepreneurship & Innovation (AREA)
  • General Business, Economics & Management (AREA)
  • Tourism & Hospitality (AREA)
  • Operations Research (AREA)
  • Marketing (AREA)
  • Economics (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Computer Hardware Design (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The application discloses an automatic operation and maintenance method, an automatic operation and maintenance device, automatic operation and maintenance equipment and a storage medium, and belongs to the technical field of operation and maintenance management. The method comprises the steps of obtaining operation and maintenance index monitoring items of target equipment; if the operation and maintenance index monitoring item is monitored to be abnormal, analyzing the abnormality of the operation and maintenance index monitoring item through a preset operation and maintenance robot, and outputting an operation and maintenance solution to related personnel according to an analysis result so as to enable the related personnel to examine whether the operation and maintenance solution is executed or not; under the condition that related personnel agree to execute the operation and maintenance solution, the target equipment is automatically operated and maintained through the operation and maintenance robot according to the operation and maintenance solution, namely the abnormality existing in the operation and maintenance index monitoring item can be analyzed through the preset operation and maintenance robot, and the corresponding operation and maintenance solution is output to related personnel, so that the effect of automatically operating and maintaining the target equipment according to the operation and maintenance solution is realized after the approval of the related personnel is finished, and the operation and maintenance efficiency is improved.

Description

Automatic operation and maintenance method, device, equipment and storage medium
Technical Field
The present disclosure relates to the field of operation and maintenance management technologies, and in particular, to an automatic operation and maintenance method, apparatus, device, and storage medium.
Background
Along with the increasing degree of dependence of enterprises on the cloud platform, the enterprises realize more and more functions through the cloud platform, so that the scale of the cloud platform is also increased, and therefore, under the condition that the indexes such as the number of servers, the storage capacity, the network bandwidth and the like required to be managed by the operation and the maintenance of the cloud platform are continuously increased, the method brings great challenges to the automatic operation and maintenance.
As the scale of the cloud platform is larger, the technical range related to the operation and maintenance technology of the cloud platform is wider, and the skills required to be mastered by operation and maintenance staff are also more and more. Meanwhile, the cloud platform operation and maintenance also needs to face different application scenes, which brings greater complexity to operation and maintenance work.
Therefore, the operation and maintenance content of the current large-scale cloud platform is complex, the operation and maintenance efficiency of the current large-scale cloud platform is low, and the corresponding operation and maintenance efficiency requirement cannot be met.
Content of the application
The main purpose of the application is to provide an automatic operation and maintenance method, device, equipment and storage medium, and aims to solve the technical problem of low operation and maintenance efficiency of a large-scale cloud platform at present.
To achieve the above object, the present application provides an automated operation and maintenance method, which includes the steps of:
Acquiring an operation and maintenance index monitoring item of target equipment;
if the operation and maintenance index monitoring item is monitored to be abnormal, analyzing the abnormality of the operation and maintenance index monitoring item through a preset operation and maintenance robot, and outputting an operation and maintenance solution to related personnel according to an analysis result so as to allow the related personnel to examine whether the operation and maintenance solution is executed;
and under the condition that the related personnel agree to execute the operation and maintenance solution, carrying out automatic operation and maintenance on the target equipment through the operation and maintenance robot according to the operation and maintenance solution.
Optionally, if it is monitored that the operation and maintenance index monitoring item has an abnormality, analyzing, by a preset operation and maintenance robot, the abnormality of the operation and maintenance index monitoring item, and outputting an operation and maintenance solution to a related person according to an analysis result, where the step includes:
if the operation and maintenance index monitoring item is monitored to be abnormal, outputting alarm information to a preset operation and maintenance robot, extracting abnormal characteristic information of the operation and maintenance index monitoring item corresponding to the alarm information through the operation and maintenance robot, and analyzing and obtaining fault information corresponding to the abnormal characteristic information;
And querying an operation and maintenance solution corresponding to the fault information in a preset operation and maintenance database through the operation and maintenance robot, and outputting the operation and maintenance solution to related personnel.
Optionally, after the step of querying, by the operation and maintenance robot, an operation and maintenance solution corresponding to the fault information in a preset operation and maintenance database and outputting the operation and maintenance solution to a relevant person, the method further includes:
acquiring historical operation and maintenance data;
determining corresponding fault content in the historical operation and maintenance data and an operation and maintenance solution for solving the fault content;
extracting the characteristics of the fault content, and generating a characteristic label according to the extracted characteristics, wherein the characteristic label, the fault content and the operation and maintenance solution have a mapping relation;
and generating an operation and maintenance database according to the feature labels, the fault content and the operation and maintenance solution.
Optionally, the step of querying, by the operation and maintenance robot, an operation and maintenance solution corresponding to the fault information in a preset operation and maintenance database, and outputting the operation and maintenance solution to a related person includes:
Inquiring whether a feature tag matched with the abnormal feature information exists in a preset operation and maintenance database through the operation and maintenance robot;
if the fault information exists, determining an operation and maintenance solution corresponding to the fault information according to the characteristic label, and outputting the operation and maintenance solution to related personnel.
Optionally, in the preset operation and maintenance database, the step of querying, by the operation and maintenance robot, an operation and maintenance solution corresponding to the fault information, and outputting the operation and maintenance solution to a relevant person, further includes:
in a preset operation and maintenance database, inquiring an operation and maintenance solution corresponding to the fault information through the operation and maintenance robot, and carrying out solution analysis on the operation and maintenance solution;
predicting the influence condition of the operation and maintenance solution on the target equipment according to the result of the scheme analysis;
and generating approval flows of different levels according to the influence conditions, and outputting the operation and maintenance solution to corresponding related personnel according to the approval flows.
Optionally, after the step of obtaining the operation and maintenance index monitoring item of the target device, the method further includes:
acquiring an index threshold corresponding to the operation and maintenance index monitoring item;
Comparing the operation and maintenance index monitoring item with the index threshold;
if the operation and maintenance index monitoring item is larger than the corresponding index threshold value, generating alarm information, and converting the alarm information into an operation and maintenance work order through a preset operation and maintenance robot, so that related personnel can confirm the current fault of the target equipment according to the operation and maintenance work order.
Optionally, the step of obtaining the index threshold corresponding to the operation and maintenance index monitoring item includes:
acquiring a historical average performance index of the target equipment in a preset running state;
and determining an index threshold corresponding to the operation and maintenance index monitoring item according to a preset weight, the historical average performance index and a preset calibration index, wherein the preset calibration index is a fixed value index preset by related personnel.
In addition, in order to achieve the above object, the present application further provides an automated operation and maintenance device, including:
the acquisition module is used for acquiring the operation and maintenance index monitoring item of the target equipment;
the judging module is used for analyzing the abnormality of the operation and maintenance index monitoring item through a preset operation and maintenance robot if the abnormality of the operation and maintenance index monitoring item is monitored, and outputting an operation and maintenance solution to related personnel according to the analysis result so as to allow the related personnel to examine whether the operation and maintenance solution is executed or not;
And the operation and maintenance module is used for carrying out automatic operation and maintenance on the target equipment through the operation and maintenance robot according to the operation and maintenance solution under the condition that the related personnel agree to execute the operation and maintenance solution.
In addition, in order to achieve the above object, the present application further provides an automated operation and maintenance device, including: a memory, a processor, and an automated operation and maintenance program stored on the memory and executable on the processor, the automated operation and maintenance program configured to implement the steps of the automated operation and maintenance method as described above.
In addition, to achieve the above object, the present application further provides a computer readable storage medium having stored thereon an automated operation and maintenance program which, when executed by a processor, implements the steps of the automated operation and maintenance method as described above.
The method comprises the steps of obtaining operation and maintenance index monitoring items of target equipment; if the operation and maintenance index monitoring item is monitored to be abnormal, analyzing the abnormality of the operation and maintenance index monitoring item through a preset operation and maintenance robot, and outputting an operation and maintenance solution to related personnel according to an analysis result so as to allow the related personnel to examine whether the operation and maintenance solution is executed; under the condition that the related personnel agree to execute the operation and maintenance solution, according to the operation and maintenance solution, the operation and maintenance robot is used for carrying out automatic operation and maintenance on the target equipment, namely, the abnormality existing in the operation and maintenance index monitoring item can be analyzed through the preset operation and maintenance robot, and the corresponding operation and maintenance solution is output to the related personnel, so that the effect of carrying out automatic operation and maintenance on the target equipment according to the operation and maintenance solution is realized after the related personnel are examined and approved, and the operation and maintenance efficiency is improved.
Drawings
FIG. 1 is a schematic flow chart of a first embodiment of an automated operation and maintenance method of the present application;
FIG. 2 is a schematic diagram of a system architecture of an automated operation and maintenance system according to an embodiment of the present application;
FIG. 3 is a schematic diagram of a system architecture of a large-scale cloud platform management system according to an embodiment of the present application;
FIG. 4 is a flow chart of a second embodiment of the automated operation and maintenance method of the present application;
FIG. 5 is a schematic flow chart of interaction between an automated operation and maintenance and related personnel in an embodiment of the present application;
FIG. 6 is a flow chart of a third embodiment of an automated operation and maintenance method of the present application;
FIG. 7 is a schematic flow chart of an analysis process performed by the operation and maintenance robot in the embodiment of the present application;
FIG. 8 is a flow chart of an overall automated operation and maintenance scheme in an embodiment of the present application;
FIG. 9 is a block diagram of an embodiment of an automated operation and maintenance device of the present application;
fig. 10 is a schematic device structure diagram of a hardware running environment according to an embodiment of the present application.
The realization, functional characteristics and advantages of the present application will be further described with reference to the embodiments, referring to the attached drawings.
Detailed Description
It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.
Referring to fig. 1, fig. 1 is a schematic flow chart of a first embodiment of an automated operation and maintenance method of the present application.
In a first embodiment, the automated operation and maintenance method comprises the steps of:
s10, acquiring an operation and maintenance index monitoring item of the target equipment.
It should be noted that, the scale of the current cloud platform is larger and larger, the technology involved in the operation and maintenance of the cloud platform is correspondingly increased, and meanwhile, along with the increase of the scale of the cloud platform, when the virtualization layer fails, a large amount of manpower and material resources are required to be input, the problem of the bottom layer is eliminated in a full scale, and the efficiency is low.
In addition, the cloud platform virtualization layer is closely connected with links such as bottom hardware and network, once faults occur, the whole link needs to be checked, the difficulty is high, the current cloud platform virtualization layer fault repairing method still needs high-degree manual intervention, the degree of automation is lacking, and the purposes of quick response and efficient repairing cannot be achieved.
Therefore, in the present embodiment, an automation operation is required for the corresponding virtualization layer.
It should be further noted that, in the related art, there is still a corresponding automatic operation and maintenance means, but in the current automatic operation and maintenance means, there is a corresponding disadvantage, for example, only the automatic operation and maintenance means is used to examine the entire link related to the corresponding cloud platform, and in the examination process, the deviation may exist in the set part of parameters or the set content to be examined, so that the operation and maintenance effect of the automatic operation and maintenance is general.
In this embodiment, a corresponding accurate investigation item should be set, and when the investigation item is specifically set, the target device to be investigated that needs to be located according to this embodiment is also required, so in this embodiment, an operation and maintenance index monitoring item is set, through which an effect of accurate monitoring can be achieved, and along with continuous promotion of an operation and maintenance means, the scheme can be continuously optimized, and item contents corresponding to the operation and maintenance index monitoring item can be continuously optimized.
It can be understood that, in order to achieve the above objective, the target device in this embodiment refers to all virtual machines that need to be monitored currently, and devices such as a Hyper-V host, an S2D storage device, and an SDN network device that carry the virtual machines.
It can be understood that the operation and maintenance index monitoring item refers to an index data item which is generated by the target device in the operation process and can correspondingly reflect the performance of the target device and needs to be included in the operation and maintenance monitoring item, and the operation and maintenance index monitoring item is divided into a performance index and log data.
For example, the operation and maintenance index monitoring items respectively include: data in aspects of CPU, memory, disk, network and the like.
When the operation and maintenance index monitoring item of the target equipment is acquired, corresponding data monitoring equipment or monitoring software is required to be set.
For example, in this embodiment, advanced Operations Manager may be adopted, and specific usage System Center Operation Manager (SCOM) is an enterprise-level monitoring and management software, which is intended to help an enterprise to implement comprehensive monitoring and management of an IT infrastructure, and through which data corresponding to a corresponding operation and maintenance index monitoring item may be obtained in real time.
The SCOM monitors various servers, application programs, network equipment and the like, provides information such as real-time operation and maintenance index monitoring items, event logs, alarms and the like, helps an administrator to discover and solve problems in time, and improves availability and stability of a system.
The SCOM can monitor various servers, application programs, network equipment and the like, including Windows Server, linux, unix, VMware, hyper-V and the like, and provides information such as real-time operation and maintenance index monitoring items, event logs, alarms and the like.
(1) An alarm: when a problem occurs in the system, the SCOM can automatically trigger an alarm to inform an administrator to timely process the problem, so that the problem is prevented from being further enlarged.
(2) Reporting: the SCOM may generate various reports including performance reports, event reports, alarm reports, etc., to help administrators understand the behavior and trends of the system.
(3) And (3) automation: the SCOM can automatically execute various tasks, including automatic repair of problems, automatic deployment of application programs, automatic backup and the like, and reduces the workload of an administrator.
(4) Integration: SCOM can be integrated with other Microsoft products, including System Center Configuration Manager, system Center Virtual Machine Manager, etc., for more comprehensive management.
(5) And (3) expansibility: SCOM supports various extensions, including custom monitoring, custom alarms, custom reports, etc., to meet the needs of different enterprises.
To sum up, in this embodiment, for the core virtual machine management console, which is the VMM, the interval at which the virtual machine interrupt needs to issue an alarm can be automatically defined by adjusting the parameter time, and the specific monitoring alarm dimension can refer to table 1;
table 1 implemented monitor alarm dimension
And S20, if the operation and maintenance index monitoring item is monitored to be abnormal, analyzing the abnormality of the operation and maintenance index monitoring item through a preset operation and maintenance robot, and outputting an operation and maintenance solution to related personnel according to an analysis result so as to allow the related personnel to examine whether the operation and maintenance solution is executed.
It can be understood that when any data or any index in the operation and maintenance index monitoring item exceeds a preset threshold, it is determined that the operation and maintenance index monitoring item has a corresponding abnormality, and an alarm is automatically triggered at this time to contact related personnel through a preset operation and maintenance robot for processing.
The operation and maintenance robot can automatically discover various faults, including virtual machine downtime, abnormal virtual machine state, virtual machine network faults and the like, and simultaneously provide detailed fault diagnosis information.
The operation and maintenance robot can send alarm and fault notification to related personnel in a mode of mail, short message, instant messaging software and the like.
When an alarm or a fault notification is received, the relevant engineers can manually confirm through the platform, and meanwhile, can check detailed fault diagnosis contents. After confirming the fault, the engineer can execute corresponding operation and maintenance recovery operations through the safety instruction provided by the patent, including restarting the virtual machine, migrating the virtual machine, refreshing the virtual machine and the like. When the operation, maintenance and repair operations are completed, the platform can automatically detect various indexes, ensure that the faults are thoroughly solved, and return the health results to related personnel.
It can be understood that when the operation and maintenance robot analyzes the abnormality of the operation and maintenance index monitoring item, the problem existing in the corresponding operation and maintenance index monitoring item can be analyzed, for example, the current virtual machine is down, the state of the virtual machine needs to be updated by adopting a refreshing or restarting mode, so that the virtual machine can normally operate, that is, the operation and maintenance robot can synchronously analyze the operation and maintenance solution for solving the abnormality problem at present under the condition of the abnormality of the analysis performance, and the analysis result and the operation and maintenance solution are sent to related personnel together for processing by the related personnel.
After receiving the corresponding operation and maintenance solution by the related personnel, the operation and maintenance robot can implement the corresponding operation and maintenance solution through the manual checking mode, namely the interaction between the related personnel and the operation and maintenance robot is realized, the effects of time operation and accurate operation and maintenance are realized, and the operation and maintenance efficiency is greatly improved.
Specifically, in this embodiment, a preset operation and maintenance robot is used, and the operation and maintenance robot may be a chat robot, and through the series connection of the chat robot and the SOCM, the automatic docking of the full-link operation and maintenance system and the real-time man-machine interaction are realized.
Referring to fig. 2, it can be seen that the current architecture of automated operation and maintenance forms a highly reliable and high-performance cloud computing architecture for a pyramid-shaped large-scale cloud platform based on layer-by-layer management of Hyper-V, failover Cluster, SCVMM. The architecture adopts the design concept of separation of storage, calculation and network so as to realize better resource utilization rate and higher availability. In terms of storage, the cloud platform adopts a distributed storage technology to disperse storage resources on a plurality of nodes so as to realize high availability and fault tolerance of data. In terms of calculation, the cloud platform adopts a Hyper-V virtualization technology to divide a physical server into a plurality of virtual machines so as to realize better resource utilization rate and higher flexibility. In the aspect of network, the cloud platform adopts SDN network technology to separate different network traffic into different network channels so as to realize better network performance and higher security.
It should be noted that, in this embodiment, the operation and maintenance solution output by the operation and maintenance robot is mainly a virtual machine fault repair solution of a virtualization layer, and the solution is obtained by relevant personnel through high extraction and summarization by relying on years of operation and maintenance experience. Screening is carried out by combining with alarm characteristics in a knowledge base mode, the matched virtual machine fault type is selected, and the virtual machine fault type is provided for an operation and maintenance engineer to make decisions. The whole course ChatOps only provides data carding, knowledge base experience matching recommendation, suggestion execution operation and the like, and does not autonomously execute any change action, so that the controllability of each link of the system is ensured. The manual interaction is used as a main driving core, the operation and maintenance engineer can make the most reliable execution judgment, and the final execution instruction generated in the process is converted into a formal change system work order flow, different influence ranges need to correspond to different approvers for approval, and the strict compliance of the flow is ensured.
In addition, for the management of the large-scale cloud platform, a corresponding same management system needs to be designed, the system architecture can refer to fig. 3, and in particular, the high-availability cloud architecture is a cloud computing system architecture based on the Hyper-V virtualization technology, and comprises a plurality of functional modules such as instance management, openAPI, expert ability, service management, resource management, task scheduling, engine management, tenant platform and the like. The architecture adopts a Server-Control-Agent technology to monitor and automatically manage the full quantity of instances, so that management can reach each terminal node in real time. The OpenAPI technology supported by the framework provides a unified API interface for a developer, and facilitates application development and integration. The operation and maintenance engineer can access various monitoring services and repairing methods of the cloud platform through the OpenAPI interface, so that quick deployment and implementation of operation and maintenance operations are realized. The architecture adopts expert ability realization technology based on a knowledge base, and can provide intelligent fault diagnosis and optimization suggestion for an operation and maintenance manager (ChatOps is used for combination in the patent). The architecture adopts technologies such as service registration, discovery, routing, load balancing and the like, and can provide unified management and scheduling for the services on the cloud platform so as to realize high availability and high efficiency of operation and maintenance services. The architecture adopts a resource pooling and dynamic allocation technology, and can provide uniform management and scheduling for resources on the cloud platform so as to realize efficient utilization and saving of the resources. When the problem cannot be solved, the influence can be eliminated through quick isolation. The architecture adopts a distributed task scheduling technology, and can provide unified management and scheduling for tasks on a cloud platform so as to realize efficient execution and optimization of the tasks. In addition, the architecture also adopts engine management monitoring, has multiple functions, and enables the monitoring to be kept in an on-line state for 24 hours in a multi-dimension mode without dead angles.
And S30, under the condition that the related personnel agree to execute the operation and maintenance solution, carrying out automatic operation and maintenance on the target equipment through the operation and maintenance robot according to the operation and maintenance solution.
It can be appreciated that in the case where the relevant personnel agree to execute the operation and maintenance solution, the operation and maintenance robot re-develops the corresponding operation and maintenance solution to achieve the effect of automated operation and maintenance on the target device.
In addition, in the embodiment, a visual content is also provided, and specifically, the core idea of the visual part is to realize the readability of the health examination after repair and the attractive appearance of the display of the repair result.
For example, the abnormal VM list and the heartbeat state are obtained through metadata, the VM list object is returned after information is summarized, and csv is written to form an archive document, and comparison is made before and after repair, so that the effectiveness of repair is verified.
The Grafana is a popular open source monitoring and data visualization tool for monitoring and alarming various data sources, can support various data sources, including Prometheus, influxDB, elasticsearch and the like, can conveniently integrate various monitoring data sources to monitor and alarm various data, and also provides flexible alarm rule configuration, and can set alarm rules according to different monitoring indexes and thresholds to monitor and alarm various abnormal conditions.
The method comprises the steps of obtaining operation and maintenance index monitoring items of target equipment; if the operation and maintenance index monitoring item is monitored to be abnormal, analyzing the abnormality of the operation and maintenance index monitoring item through a preset operation and maintenance robot, and outputting an operation and maintenance solution to related personnel according to an analysis result so as to allow the related personnel to examine whether the operation and maintenance solution is executed; under the condition that the related personnel agree to execute the operation and maintenance solution, according to the operation and maintenance solution, the operation and maintenance robot is used for carrying out automatic operation and maintenance on the target equipment, namely, the abnormality existing in the operation and maintenance index monitoring item can be analyzed through the preset operation and maintenance robot, and the corresponding operation and maintenance solution is output to the related personnel, so that the effect of carrying out automatic operation and maintenance on the target equipment according to the operation and maintenance solution is realized after the related personnel are examined and approved, and the operation and maintenance efficiency is improved.
As shown in fig. 4, a second embodiment of the automated operation and maintenance method of the present application is proposed based on the first embodiment, and in this embodiment, step S20 specifically includes:
s21, if the operation and maintenance index monitoring item is monitored to be abnormal, outputting alarm information to a preset operation and maintenance robot, extracting abnormal characteristic information of the operation and maintenance index monitoring item corresponding to the alarm information through the operation and maintenance robot, and analyzing and obtaining fault information corresponding to the abnormal characteristic information.
It can be understood that when the monitored operation and maintenance index monitoring item is abnormal, the operation and maintenance robot preset by the child can output corresponding alarm information at present, and the operation and maintenance robot analyzes the abnormality to realize an automatic operation and maintenance process.
The analysis process of the operation and maintenance robot mainly analyzes the content of the alarm information, so that the abnormal characteristic information of the alarm information can be extracted, and the abnormal characteristic information can comprise information such as data types of abnormal operation and maintenance index monitoring items, specific characterization values corresponding to the data and the like.
It can be understood that the preset operation and maintenance robot can analyze abnormal fault information existing in the current operation and maintenance index monitoring item according to the abnormal characteristic information, wherein the fault information refers to information such as specific fault types and fault conditions shown by the abnormal characteristic information, and the fault information and the abnormal characteristic information are correspondingly associated.
S22, inquiring an operation and maintenance solution corresponding to the fault information in a preset operation and maintenance database through the operation and maintenance robot, and outputting the operation and maintenance solution to related personnel.
It can be understood that the preset operation and maintenance database refers to a database of operation and maintenance related data summarized by operation and maintenance related personnel of the cloud platform according to the self knowledge of the operation and maintenance related personnel, wherein the main content of the database also comprises data such as corresponding operation and maintenance problems encountered by the operation and maintenance related personnel in daily operation and maintenance work, two major types of data including fault information and operation and maintenance solutions are correspondingly contained in the database, the operation and maintenance solutions can be matched and found through the operation and maintenance database, and after the operation and maintenance robot inquires the corresponding operation and maintenance solutions from the operation and maintenance database, the operation and maintenance solutions can be sent to the related personnel and checked by the related personnel.
The operation and maintenance database can be regarded as a knowledge base, wherein the corresponding steps for executing the operation and maintenance solution are saved in the form of documents, and are mainly embodied in the form of software codes.
Specifically, when a corresponding operation and maintenance database is generated, the operation and maintenance database is generated mainly by acquiring historical operation and maintenance data, determining corresponding fault content in the historical operation and maintenance data and an operation and maintenance solution for solving the fault content, extracting characteristics of the fault content, and generating a characteristic label according to the extracted characteristics, wherein the characteristic label, the fault content and the operation and maintenance solution have mapping relations, and finally generating the operation and maintenance database according to the characteristic label, the fault content and the operation and maintenance solution.
It can be understood that the feature tag is the most simplified information tag for characterizing the abnormal feature, and the feature tag, the fault content and the operation and maintenance solution have mapping relations, so that a corresponding operation and maintenance database can be generated according to the three data.
It can be understood that after the operation and maintenance database with the corresponding feature labels is generated, whether the feature labels matched with the abnormal feature information exist or not can be inquired in the operation and maintenance database according to the abnormal feature information as an index item, if so, an operation and maintenance solution corresponding to the fault information is determined according to the feature labels, and the operation and maintenance solution is output to related personnel.
It can be understood that by means of the feature tag matching, the time required by inquiring is reduced, the total time required by the operation and maintenance robot for analyzing the abnormality and outputting the operation and maintenance solution is effectively prolonged, and the operation and maintenance efficiency is further improved.
In addition, in a preset operation and maintenance database, inquiring an operation and maintenance solution corresponding to the fault information through the operation and maintenance robot, analyzing the operation and maintenance solution, predicting the influence condition of the operation and maintenance solution on the target equipment according to the result of the scheme analysis, generating different levels of approval flows according to the influence condition, and outputting the operation and maintenance solution to corresponding related personnel according to the approval flows.
It can be understood that the influence condition refers to the influence condition on the user after the operation condition corresponding to the target equipment is changed by carrying out automatic operation and maintenance on the target equipment through the operation and maintenance solution.
For example, when the situation that the target device is down is analyzed currently, the target device needs to be restarted to restore the normal operation of the target device, but restarting the change action affects the use situation of the user, if the data temporarily stored in part before restarting is not actually stored in the local disk, the user is seriously affected, and if the refresh mode is adopted, the downtime of the target device is refreshed, so that the effect of attempting to restore the normal operation of the target device is achieved, the refresh mode does not substantially affect the target device, and irreversible loss and the like are not caused to the operation process of the user.
Therefore, before executing the corresponding operation and maintenance solutions, the influence conditions expected to be generated by different operation and maintenance solutions are analyzed, and different levels of approval processes are generated according to the severity of the influence conditions, wherein the approval processes are corresponding to different levels, if the influence degree of the influence conditions is higher, the approval processes are allocated to related personnel with higher management authority for approval, otherwise, if the influence degree of the influence conditions is lower, the approval processes are allocated to related personnel with lower management authority for approval, wherein the related personnel can be classified into a product responsible person, a technical expert and the like according to the corresponding authority level or the job title level of the related personnel, and can be approved by any one of the two parties, or a combined approval process for approval of both the two parties can be set so as to ensure that loss is avoided.
Furthermore, it should be noted that, referring to fig. 5, there may be a risk that the operation solution also has a certain risk, for example, the current operation solution only gives a solution of a change operation, for example, the current state of the target device is refreshed, such change operation has less influence on the user, for example, the current operation solution gives a solution formed by combining solutions of a plurality of change operations, such solution may have a plurality of steps, involves operations of different contents, for example, operations of data update, state update, log data pulling and automatic repair, etc., such operations may generate unpredictable risk to the user due to high complexity, and thus, a corresponding risk level should be set to the operation solution, for example, when the operation solution involves a change operation, it is a low level, when the operation solution device has a plurality of change operations, it is a high level, i.e. the operation solution is classified into a normal change and a high risk.
In addition, emergency change conditions can be set according to actual conditions, related personnel are prompted to carry out emergency treatment when the emergency change conditions are treated, and pre-approved change items can be set according to the actual conditions, so that the approval stage of the related personnel is avoided, and the operation and maintenance efficiency is improved.
Furthermore, according to whether the operation and maintenance result corresponding to the change operation related to the final operation and maintenance solution accords with the expectation, if so, the corresponding work order can be cleared correspondingly, and if not, a fault flow needs to be started, namely, the fault flow needs to be cleared by manual intervention.
The flow sequence corresponding to the above description may refer to fig. 5.
According to the method, if the operation and maintenance index monitoring item is monitored to be abnormal, alarm information is output to a preset operation and maintenance robot, abnormal characteristic information of the operation and maintenance index monitoring item corresponding to the alarm information is extracted through the operation and maintenance robot, and fault information corresponding to the abnormal characteristic information is obtained through analysis; in a preset operation and maintenance database, inquiring an operation and maintenance solution corresponding to the fault information through the operation and maintenance robot, and outputting the operation and maintenance solution to related personnel, namely, analyzing and outputting abnormal characteristic information corresponding to alarm information, and matching the operation and maintenance solution corresponding to the abnormal characteristic information from the preset operation and maintenance database, so that the overall efficiency of operation and maintenance is improved.
As shown in fig. 6, a third embodiment of the automated operation and maintenance method of the present application is provided based on the first embodiment and the second embodiment, and in this embodiment, the method further includes:
s101, acquiring an index threshold corresponding to the operation and maintenance index monitoring item.
It is understood that the index threshold is used as a threshold for evaluating whether an abnormality exists in the operation and maintenance index monitoring item.
According to the above embodiment, the operation and maintenance index monitoring item includes a plurality of data, different data correspond to different thresholds, and the size of the index threshold corresponding to each data is different.
Wherein the index threshold may be determined by a corresponding technician or expert.
In addition, the index threshold can be combined with the history data of the corresponding operation and maintenance index monitoring items, and the currently required index threshold is calculated so as to cater to the conditions of the operation and maintenance index monitoring items corresponding to target equipment at different time points, thereby improving the accuracy of judging the abnormal conditions of the operation and maintenance index monitoring items.
Specifically, a historical average performance index of the target equipment in a preset running state is obtained, and an index threshold corresponding to the operation and maintenance index monitoring item is determined according to a preset weight, the historical average performance index and a preset calibration index, wherein the preset calibration index is a fixed value index preset by related personnel.
It is understood that the preset operation state refers to a state in which the target device is in normal operation.
It can be understood that the historical average performance index refers to an average value of historical performance indexes of the target equipment in a normal running state, for example, performance indexes represented by operation and maintenance index monitoring items of the target equipment in the previous day and performance indexes represented by operation and maintenance index monitoring items of the target equipment in the next day, the performance indexes in other time are comprehensively obtained to control the duration corresponding to the historical performance indexes to be one month or other durations, further, the historical performance indexes in the period of time are comprehensively calculated to average to obtain corresponding historical average performance indexes, the historical average performance indexes refer to variation fluctuation conditions of the performance indexes, for example, peak values of the performance indexes are determined, and whether the corresponding operation and maintenance index monitoring items have abnormality or not can be determined.
Further, the preset weight, the historical average performance index and the preset calibration index can be integrated, and the index threshold corresponding to the corresponding operation and maintenance index monitoring item can be determined.
The preset calibration index is a fixed value preset by the number of related personnel, and the fixed value is used for balancing the continuously changed historical average performance index, so that the historical average performance index is prevented from being excessively changed.
The preset weight is a weight value preset by the number of people, the weight value can be distributed to the historical average performance index and the preset calibration index, and the historical average performance index and the preset calibration index are calculated to finally obtain an index threshold.
Specifically, the preset weight is determined according to the scale of the cloud platform, when the scale of the cloud platform is larger, the weight corresponding to the historical average performance index is increased, the index corresponding to the preset index is reduced, after the preset weight is distributed to the historical average performance index and the preset calibration index, the sum of the historical average performance index and the preset calibration index is calculated, an average value is calculated, and if the difference between the average value and the preset calibration index is smaller than a preset threshold, the average value is used as an index threshold.
It should be noted that, through the preset calibration index and combining with the historical average performance index, an index threshold adapting to the current scene can be calculated, and meanwhile, the preset calibration index is used as a standard for measuring whether the index threshold obtained by current calculation is reasonable, so that the finally selected index threshold meets the size of the current cloud platform, and whether the current operation and maintenance index monitoring item is abnormal or not can be reasonably monitored by the index threshold.
S102, comparing the operation and maintenance index monitoring item with the index threshold.
It can be understood that after the index threshold is obtained, the operation and maintenance index monitoring item and the corresponding index threshold can be compared and analyzed, and when the operation and maintenance index monitoring item is greater than the index threshold corresponding to the operation and maintenance index monitoring item, the operation and maintenance index monitoring item is determined to be abnormal.
And S103, if the operation and maintenance index monitoring item is larger than the corresponding index threshold, generating alarm information, and converting the alarm information into an operation and maintenance work order through a preset operation and maintenance robot so that related personnel can confirm the current fault of the target equipment according to the operation and maintenance work order.
It can be understood that when the operation and maintenance index monitoring item is abnormal, corresponding alarm information can be generated, and the alarm information is converted into a corresponding operation and maintenance work order through a preset operation and maintenance robot, so that related personnel can confirm the faults of the current target equipment according to the operation and maintenance work order.
Specifically, referring to fig. 7, the core function point of the ChatOps module adopted in the present embodiment is to provide a method for converting the alarms of the operation and maintenance monitoring center into an operation and maintenance work order, retrieving the solutions from the knowledge base, and notifying the associated engineer of the co-disposition. The specific implementation process is as follows:
The alert is converted to a work order. When the monitoring center detects an abnormality, an alarm is automatically triggered, and alarm information is sent to the ChatOps back-end server. The chataops converts the alert information into an operation and maintenance work order and sends the work order information to the relevant engineers.
Knowledge base retrieval solution: when the engineer receives the work order, the knowledge base can be searched through chataops to find the relevant solution. If a solution is found, the engineer can directly perform the corresponding operation on chataops, solving the problem.
Notifying associated engineers of co-disposition: if the engineer fails to solve the problem, or if other engineers are needed to assist, the relevant engineer can be notified on chataops. Chataops will automatically send a notification and add the relevant engineer to the co-disposal list of the work order.
Co-treatment: when a plurality of engineers participate in the co-treatment, the ChatOps can track the state and the progress condition of the work order in real time, so that the problem is solved in time.
In summary, referring to fig. 8, an overall process of automatically monitoring faults and automatic operation and maintenance is also provided in this embodiment, where the process mainly depends on a cloud platform to alarm, when a virtual machine level alarm is generated, a corresponding automatic operation and maintenance process needs to be invoked, when other alarms are generated, conventional manual processing is required, and by placing the alarms in a conventional alarm queue, waiting for manual monitoring and response, and in this process, various alarm related logs, influence ranges, fault cause positioning and repair execution and the like need to be manually associated.
When an automatic operation and maintenance process is executed, an alarm is required to be put into an automatic processing alarm queue, alarm information is output to an operation and maintenance robot, the operation and maintenance robot analyzes the abnormality of a corresponding operation and maintenance index monitoring item, accesses corresponding information obtained by monitoring of a service monitoring system, analyzes the abnormality of the current abnormal state, network interruption, task failure and the like, and correspondingly analyzes a corresponding operation and maintenance solution, at the moment, corresponding chatting content can be generated, the operation and maintenance solution and the analysis result are sent to related personnel, such as an operation and maintenance engineer and the like, after the operation and maintenance engineer confirms, automatic repair operation is executed, health check can be carried out after the execution is successful, contents such as conversation chat and the like are filed and closed, rollback is required when the execution fails, log recording and the like are carried out, and corresponding manual specific processing and the like are waited.
The embodiment obtains the index threshold corresponding to the operation and maintenance index monitoring item; comparing the operation and maintenance index monitoring item with the index threshold; if the operation and maintenance index monitoring item is larger than the corresponding index threshold value, generating alarm information, and converting the alarm information into an operation and maintenance work order through a preset operation and maintenance robot, so that related personnel can confirm the faults existing in the target equipment according to the operation and maintenance work order, and accordingly whether the current operation and maintenance index monitoring item is abnormal or not can be accurately judged through the index threshold value, a corresponding operation and maintenance work order can be generated, and the related personnel can repair the corresponding abnormality through the operation and maintenance work order in time.
In addition, an embodiment of the present application further provides an automated operation and maintenance device, referring to fig. 9, where the automated operation and maintenance device includes:
the acquisition module 10 is used for acquiring the operation and maintenance index monitoring item of the target equipment;
the judging module 20 is configured to analyze, if it is monitored that the operation and maintenance index monitoring item has an abnormality, the abnormality existing in the operation and maintenance index monitoring item through a preset operation and maintenance robot, and output an operation and maintenance solution to a related person according to an analysis result, so that the related person examines whether the operation and maintenance solution is executed;
and the operation and maintenance module 30 is used for carrying out automatic operation and maintenance on the target equipment through the operation and maintenance robot according to the operation and maintenance solution under the condition that the related personnel agree to execute the operation and maintenance solution.
The operation and maintenance index monitoring item of the target equipment is obtained; if the operation and maintenance index monitoring item is monitored to be abnormal, analyzing the abnormality of the operation and maintenance index monitoring item through a preset operation and maintenance robot, and outputting an operation and maintenance solution to related personnel according to an analysis result so as to allow the related personnel to examine whether the operation and maintenance solution is executed; under the condition that the related personnel agree to execute the operation and maintenance solution, according to the operation and maintenance solution, the operation and maintenance robot is used for carrying out automatic operation and maintenance on the target equipment, namely, the abnormality existing in the operation and maintenance index monitoring item can be analyzed through the preset operation and maintenance robot, and the corresponding operation and maintenance solution is output to the related personnel, so that the effect of carrying out automatic operation and maintenance on the target equipment according to the operation and maintenance solution is realized after the related personnel are examined and approved, and the operation and maintenance efficiency is improved.
It should be noted that each module in the above apparatus may be used to implement each step in the above method, and achieve a corresponding technical effect, which is not described herein again.
Referring to fig. 10, fig. 10 is a schematic structural diagram of a device of a hardware running environment according to an embodiment of the present application.
As shown in fig. 10, the apparatus may include: a processor 1001, such as a CPU, a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005. Wherein the communication bus 1002 is used to enable connected communication between these components. The user interface 1003 may include a Display, an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may further include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a stable memory (non-volatile memory), such as a disk memory. The memory 1005 may also optionally be a storage device separate from the processor 1001 described above.
It will be appreciated by those skilled in the art that the structure shown in fig. 10 is not limiting of the apparatus and may include more or fewer components than shown, or certain components may be combined, or a different arrangement of components.
As shown in fig. 10, an operating system, a network communication module, a user interface module, and an automation operation and maintenance program may be included in the memory 1005 as one type of computer storage medium.
In the device shown in fig. 10, the network interface 1004 is mainly used for data communication with an external network; the user interface 1003 is mainly used for receiving an input instruction of a user; the device invokes, via the processor 1001, an automated operation and maintenance program stored in the memory 1005 and performs the following operations:
acquiring an operation and maintenance index monitoring item of target equipment;
if the operation and maintenance index monitoring item is monitored to be abnormal, analyzing the abnormality of the operation and maintenance index monitoring item through a preset operation and maintenance robot, and outputting an operation and maintenance solution to related personnel according to an analysis result so as to allow the related personnel to examine whether the operation and maintenance solution is executed;
and under the condition that the related personnel agree to execute the operation and maintenance solution, carrying out automatic operation and maintenance on the target equipment through the operation and maintenance robot according to the operation and maintenance solution.
Further, the processor 1001 may call an automated operation and maintenance program stored in the memory 1005, and further perform the following operations:
If the operation and maintenance index monitoring item is monitored to be abnormal, outputting alarm information to a preset operation and maintenance robot, extracting abnormal characteristic information of the operation and maintenance index monitoring item corresponding to the alarm information through the operation and maintenance robot, and analyzing and obtaining fault information corresponding to the abnormal characteristic information;
and querying an operation and maintenance solution corresponding to the fault information in a preset operation and maintenance database through the operation and maintenance robot, and outputting the operation and maintenance solution to related personnel.
Further, the processor 1001 may call an automated operation and maintenance program stored in the memory 1005, and further perform the following operations:
acquiring historical operation and maintenance data;
determining corresponding fault content in the historical operation and maintenance data and an operation and maintenance solution for solving the fault content;
extracting the characteristics of the fault content, and generating a characteristic label according to the extracted characteristics, wherein the characteristic label, the fault content and the operation and maintenance solution have a mapping relation;
and generating an operation and maintenance database according to the feature labels, the fault content and the operation and maintenance solution.
Further, the processor 1001 may call an automated operation and maintenance program stored in the memory 1005, and further perform the following operations:
Inquiring whether a feature tag matched with the abnormal feature information exists in a preset operation and maintenance database through the operation and maintenance robot;
if the fault information exists, determining an operation and maintenance solution corresponding to the fault information according to the characteristic label, and outputting the operation and maintenance solution to related personnel.
Further, the processor 1001 may call an automated operation and maintenance program stored in the memory 1005, and further perform the following operations:
in a preset operation and maintenance database, inquiring an operation and maintenance solution corresponding to the fault information through the operation and maintenance robot, and carrying out solution analysis on the operation and maintenance solution;
predicting the influence condition of the operation and maintenance solution on the target equipment according to the result of the scheme analysis;
and generating approval flows of different levels according to the influence conditions, and outputting the operation and maintenance solution to corresponding related personnel according to the approval flows.
Further, the processor 1001 may call an automated operation and maintenance program stored in the memory 1005, and further perform the following operations:
acquiring an index threshold corresponding to the operation and maintenance index monitoring item;
comparing the operation and maintenance index monitoring item with the index threshold;
If the operation and maintenance index monitoring item is larger than the corresponding index threshold value, generating alarm information, and converting the alarm information into an operation and maintenance work order through a preset operation and maintenance robot, so that related personnel can confirm the current fault of the target equipment according to the operation and maintenance work order.
Further, the processor 1001 may call an automated operation and maintenance program stored in the memory 1005, and further perform the following operations:
acquiring a historical average performance index of the target equipment in a preset running state;
and determining an index threshold corresponding to the operation and maintenance index monitoring item according to a preset weight, the historical average performance index and a preset calibration index, wherein the preset calibration index is a fixed value index preset by related personnel.
The operation and maintenance index monitoring item of the target equipment is obtained; if the operation and maintenance index monitoring item is monitored to be abnormal, analyzing the abnormality of the operation and maintenance index monitoring item through a preset operation and maintenance robot, and outputting an operation and maintenance solution to related personnel according to an analysis result so as to allow the related personnel to examine whether the operation and maintenance solution is executed; under the condition that the related personnel agree to execute the operation and maintenance solution, according to the operation and maintenance solution, the operation and maintenance robot is used for carrying out automatic operation and maintenance on the target equipment, namely, the abnormality existing in the operation and maintenance index monitoring item can be analyzed through the preset operation and maintenance robot, and the corresponding operation and maintenance solution is output to the related personnel, so that the effect of carrying out automatic operation and maintenance on the target equipment according to the operation and maintenance solution is realized after the related personnel are examined and approved, and the operation and maintenance efficiency is improved.
In addition, an embodiment of the present application further proposes a computer readable storage medium, on which an automated operation and maintenance program is stored, the automated operation and maintenance program implementing the following operations when executed by a processor:
acquiring an operation and maintenance index monitoring item of target equipment;
if the operation and maintenance index monitoring item is monitored to be abnormal, analyzing the abnormality of the operation and maintenance index monitoring item through a preset operation and maintenance robot, and outputting an operation and maintenance solution to related personnel according to an analysis result so as to allow the related personnel to examine whether the operation and maintenance solution is executed;
and under the condition that the related personnel agree to execute the operation and maintenance solution, carrying out automatic operation and maintenance on the target equipment through the operation and maintenance robot according to the operation and maintenance solution.
The operation and maintenance index monitoring item of the target equipment is obtained; if the operation and maintenance index monitoring item is monitored to be abnormal, analyzing the abnormality of the operation and maintenance index monitoring item through a preset operation and maintenance robot, and outputting an operation and maintenance solution to related personnel according to an analysis result so as to allow the related personnel to examine whether the operation and maintenance solution is executed; under the condition that the related personnel agree to execute the operation and maintenance solution, according to the operation and maintenance solution, the operation and maintenance robot is used for carrying out automatic operation and maintenance on the target equipment, namely, the abnormality existing in the operation and maintenance index monitoring item can be analyzed through the preset operation and maintenance robot, and the corresponding operation and maintenance solution is output to the related personnel, so that the effect of carrying out automatic operation and maintenance on the target equipment according to the operation and maintenance solution is realized after the related personnel are examined and approved, and the operation and maintenance efficiency is improved.
It should be noted that, when the computer readable storage medium is executed by the processor, each step in the method may be further implemented, and meanwhile, the corresponding technical effects are achieved, which is not described herein.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.
The foregoing embodiment numbers of the present application are merely for describing, and do not represent advantages or disadvantages of the embodiments.
From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) as described above, including several instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method described in the embodiments of the present application.
The foregoing description is only of the preferred embodiments of the present application, and is not intended to limit the scope of the claims, and all equivalent structures or equivalent processes using the descriptions and drawings of the present application, or direct or indirect application in other related technical fields are included in the scope of the claims of the present application.

Claims (10)

1. An automated operation and maintenance method, characterized in that the automated operation and maintenance method comprises the following steps:
acquiring an operation and maintenance index monitoring item of target equipment;
if the operation and maintenance index monitoring item is monitored to be abnormal, analyzing the abnormality of the operation and maintenance index monitoring item through a preset operation and maintenance robot, and outputting an operation and maintenance solution to related personnel according to an analysis result so as to allow the related personnel to examine whether the operation and maintenance solution is executed;
and under the condition that the related personnel agree to execute the operation and maintenance solution, carrying out automatic operation and maintenance on the target equipment through the operation and maintenance robot according to the operation and maintenance solution.
2. The automated operation and maintenance method according to claim 1, wherein if it is monitored that the operation and maintenance index monitoring item is abnormal, the step of analyzing the abnormality of the operation and maintenance index monitoring item by a preset operation and maintenance robot and outputting an operation and maintenance solution to a related person according to the result of the analysis comprises:
If the operation and maintenance index monitoring item is monitored to be abnormal, outputting alarm information to a preset operation and maintenance robot, extracting abnormal characteristic information of the operation and maintenance index monitoring item corresponding to the alarm information through the operation and maintenance robot, and analyzing and obtaining fault information corresponding to the abnormal characteristic information;
and querying an operation and maintenance solution corresponding to the fault information in a preset operation and maintenance database through the operation and maintenance robot, and outputting the operation and maintenance solution to related personnel.
3. The automated operation and maintenance method according to claim 2, wherein after the steps of querying an operation and maintenance solution corresponding to the fault information in a preset operation and maintenance database by the operation and maintenance robot and outputting the operation and maintenance solution to a related person, the method further comprises:
acquiring historical operation and maintenance data;
determining corresponding fault content in the historical operation and maintenance data and an operation and maintenance solution for solving the fault content;
extracting the characteristics of the fault content, and generating a characteristic label according to the extracted characteristics, wherein the characteristic label, the fault content and the operation and maintenance solution have a mapping relation;
And generating an operation and maintenance database according to the feature labels, the fault content and the operation and maintenance solution.
4. The automated operation and maintenance method according to claim 3, wherein the step of querying an operation and maintenance solution corresponding to the fault information in a preset operation and maintenance database through the operation and maintenance robot and outputting the operation and maintenance solution to a related person comprises:
inquiring whether a feature tag matched with the abnormal feature information exists in a preset operation and maintenance database through the operation and maintenance robot;
if the fault information exists, determining an operation and maintenance solution corresponding to the fault information according to the characteristic label, and outputting the operation and maintenance solution to related personnel.
5. The automated operation and maintenance method according to claim 2, wherein the steps of querying an operation and maintenance solution corresponding to the fault information in a preset operation and maintenance database by the operation and maintenance robot and outputting the operation and maintenance solution to a related person, further comprise:
in a preset operation and maintenance database, inquiring an operation and maintenance solution corresponding to the fault information through the operation and maintenance robot, and carrying out solution analysis on the operation and maintenance solution;
Predicting the influence condition of the operation and maintenance solution on the target equipment according to the result of the scheme analysis;
and generating approval flows of different levels according to the influence conditions, and outputting the operation and maintenance solution to corresponding related personnel according to the approval flows.
6. The automated operation and maintenance method according to claim 1, wherein after the step of obtaining the operation and maintenance index monitoring item of the target device, the method further comprises:
acquiring an index threshold corresponding to the operation and maintenance index monitoring item;
comparing the operation and maintenance index monitoring item with the index threshold;
if the operation and maintenance index monitoring item is larger than the corresponding index threshold value, generating alarm information, and converting the alarm information into an operation and maintenance work order through a preset operation and maintenance robot, so that related personnel can confirm the current fault of the target equipment according to the operation and maintenance work order.
7. The automated operation and maintenance method of claim 6, wherein the step of obtaining the index threshold corresponding to the operation and maintenance index monitoring item comprises:
acquiring a historical average performance index of the target equipment in a preset running state;
And determining an index threshold corresponding to the operation and maintenance index monitoring item according to a preset weight, the historical average performance index and a preset calibration index, wherein the preset calibration index is a fixed value index preset by related personnel.
8. An automated operation and maintenance device, characterized in that the automated operation and maintenance device comprises:
the acquisition module is used for acquiring the operation and maintenance index monitoring item of the target equipment;
the judging module is used for analyzing the abnormality of the operation and maintenance index monitoring item through a preset operation and maintenance robot if the abnormality of the operation and maintenance index monitoring item is monitored, and outputting an operation and maintenance solution to related personnel according to the analysis result so as to allow the related personnel to examine whether the operation and maintenance solution is executed or not;
and the operation and maintenance module is used for carrying out automatic operation and maintenance on the target equipment through the operation and maintenance robot according to the operation and maintenance solution under the condition that the related personnel agree to execute the operation and maintenance solution.
9. An automated operation and maintenance device, characterized in that the automated operation and maintenance device comprises: a memory, a processor, and an automated operation and maintenance program stored on the memory and executable on the processor, the automated operation and maintenance program configured to implement the steps of the automated operation and maintenance method of any one of claims 1 to 7.
10. A storage medium, wherein a program for realizing the automated operation and maintenance method is stored on the storage medium, and the program for realizing the automated operation and maintenance method is executed by a processor to realize the steps of the automated operation and maintenance method according to any one of claims 1 to 7.
CN202311504260.8A 2023-11-10 2023-11-10 Automatic operation and maintenance method, device, equipment and storage medium Pending CN117455458A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311504260.8A CN117455458A (en) 2023-11-10 2023-11-10 Automatic operation and maintenance method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311504260.8A CN117455458A (en) 2023-11-10 2023-11-10 Automatic operation and maintenance method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN117455458A true CN117455458A (en) 2024-01-26

Family

ID=89590836

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311504260.8A Pending CN117455458A (en) 2023-11-10 2023-11-10 Automatic operation and maintenance method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN117455458A (en)

Similar Documents

Publication Publication Date Title
US9275172B2 (en) Systems and methods for analyzing performance of virtual environments
US9548886B2 (en) Help desk ticket tracking integration with root cause analysis
US9413597B2 (en) Method and system for providing aggregated network alarms
US10536348B2 (en) Operational micro-services design, development, deployment
CN107534570A (en) Virtualize network function monitoring
CN110581773A (en) automatic service monitoring and alarm management system
KR20180068002A (en) Cloud infra real time analysis system based on big date and the providing method thereof
CN111124830B (en) Micro-service monitoring method and device
US20140023185A1 (en) Characterizing Time-Bounded Incident Management Systems
US10372572B1 (en) Prediction model testing framework
US11743237B2 (en) Utilizing machine learning models to determine customer care actions for telecommunications network providers
CN115860729A (en) IT operation and maintenance integrated management system
CN107566172B (en) Active management method and system based on storage system
US20220182851A1 (en) Communication Method and Apparatus for Plurality of Administrative Domains
US20210184925A1 (en) Model-driven technique for virtual network function rehoming for service chains
US20210263718A1 (en) Generating predictive metrics for virtualized deployments
CN103326880B (en) Genesys calling system high availability cloud computing monitoring system and method
CN117455458A (en) Automatic operation and maintenance method, device, equipment and storage medium
CN115080363A (en) System capacity evaluation method and device based on service log
CN112416719B (en) Monitoring processing method, system, equipment and storage medium for database container
Mormul et al. Dear: Distributed evaluation of alerting rules
CN108123821B (en) Data analysis method and device
CN104883273A (en) Method and system for processing service influence model in virtualized service management platform
KR20200063343A (en) System and method for managing operaiton in trust reality viewpointing networking infrastucture
CN109766238A (en) Operation platform method for monitoring performance, device and relevant device based on session number

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination