CN111865673A - Automatic fault management method, device and system - Google Patents

Automatic fault management method, device and system Download PDF

Info

Publication number
CN111865673A
CN111865673A CN202010652478.8A CN202010652478A CN111865673A CN 111865673 A CN111865673 A CN 111865673A CN 202010652478 A CN202010652478 A CN 202010652478A CN 111865673 A CN111865673 A CN 111865673A
Authority
CN
China
Prior art keywords
fault
troubleshooting
recovery plan
work order
executing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010652478.8A
Other languages
Chinese (zh)
Inventor
杨微
何俊敏
易玉凤
马兴
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Yanxi Software Information Technology Co ltd
Original Assignee
Shanghai Yanxi Software Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Yanxi Software Information Technology Co ltd filed Critical Shanghai Yanxi Software Information Technology Co ltd
Priority to CN202010652478.8A priority Critical patent/CN111865673A/en
Publication of CN111865673A publication Critical patent/CN111865673A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0654Management of faults, events, alarms or notifications using network fault recovery
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0677Localisation of faults
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/08Configuration management of networks or network elements
    • H04L41/0876Aspects of the degree of configuration automation
    • H04L41/0886Fully automatic configuration

Abstract

The invention discloses an automatic fault management method, a device and a system, the method firstly identifies system faults according to received fault prompt information and triggers a fault work order of corresponding dimensionality, then generates a plurality of fault troubleshooting tasks in the corresponding dimensionality according to the fault work order, parallelly executes the plurality of fault troubleshooting tasks to locate fault points, finally searches a recovery plan matched with the fault points in a preset recovery plan matching relation, and executes the recovery plan to repair the system faults.

Description

Automatic fault management method, device and system
Technical Field
The invention relates to the technical field of information system operation and maintenance, in particular to an automatic fault management method, device and system.
Background
The on-line fault management of the IT system is particularly important in the daily operation and maintenance of the system, and not only is the technology examined, but also the timeliness examined.
In the conventional online fault management process of the system at the present stage, it takes too long time to recover the whole link from fault identification to fault, and if the fault root cause cannot be found and repaired at one time in a short time, the whole fault time has multiplication risk. Service disruption due to system failure is often unacceptable to an enterprise, may be loss of large numbers of orders or loss of customers, and may in extreme cases cause adverse social effects.
Therefore, a method for quickly and accurately finding and processing faults is needed.
Disclosure of Invention
In order to solve the technical problems, the invention provides an automatic fault management method, device and system, which realize full automation when processing system faults and improve the accuracy and timeliness of fault processing.
The technical scheme provided by the invention is as follows:
in a first aspect, an automated fault management method is provided, the method at least comprising the following steps:
identifying system faults according to the received fault prompt information and triggering fault work orders with corresponding dimensions;
generating a plurality of troubleshooting tasks in corresponding dimensionalities according to the troubleshooting work order, and executing the troubleshooting tasks in parallel to position fault points;
And searching a recovery plan matched with the fault point in a preset recovery plan matching relation, and executing the recovery plan to repair the system fault.
In some preferred embodiments, the identifying a system fault according to the received fault prompt information and triggering a fault work order of a corresponding dimension includes the following sub-steps:
receiving at least one fault prompt message of multidimensional monitoring alarm information or manual alarm information;
and generating a fault work order with corresponding dimensionality according to the multi-dimensionality alarm information and the manual alarm information based on a pre-constructed fault work order triggering model.
In some preferred embodiments, when the fault prompting information is manual alarm information, before generating a fault work order of a corresponding dimension according to the multidimensional alarm information and the manual alarm information based on a pre-constructed fault work order trigger model, the method further includes:
processing the received artificial alarm information by adopting a natural language processing algorithm to obtain a semantic analysis result;
searching whether an alarm dimension matched with the semantic analysis result exists in a pre-constructed fault classification table, and if so, marking a corresponding alarm dimension for the artificial alarm information;
If not, marking the general alarm dimension.
In some preferred embodiments, the generating a plurality of troubleshooting tasks in the corresponding dimension according to the trouble order, and executing the plurality of troubleshooting tasks in parallel to locate the failure point includes the following sub-steps:
based on a preset troubleshooting task incidence relation, searching a plurality of troubleshooting tasks matched with the troubleshooting work order in corresponding dimensionality according to the failure information of the troubleshooting work order;
executing the plurality of troubleshooting tasks in parallel and obtaining a corresponding number of troubleshooting results;
and acquiring a fault point according to the fault troubleshooting result based on the preset correlation between the troubleshooting result and the fault point.
In some preferred embodiments, the method searches a preset recovery plan matching relationship for a recovery plan matching the fault point, and executes the recovery plan to repair the system fault, including the following sub-steps:
based on a preset recovery plan matching relation, searching a plurality of recovery plans matched with the fault point and corresponding priority sequence according to the fault point;
executing an optimal recovery plan to repair the system fault and obtain a repair result;
and judging a repair result, if the repair result is not repaired, executing a suboptimal recovery plan until the repair result is repaired, and removing the fault work order.
In some preferred embodiments, after the searching for the recovery plan matching the failure point in the preset recovery plan matching relationship, and executing the recovery plan to repair the system failure, the method further includes:
acquiring a recovery plan adopted by repairing a fault point;
judging whether the recovery plan is the optimal recovery plan matched with the fault point in the recovery plan matching relationship;
if not, optimizing the recovery plan matching relationship.
In some preferred embodiments, after the generating a plurality of troubleshooting tasks in corresponding dimensions according to the trouble order and executing the plurality of troubleshooting tasks in parallel to locate the failure point, the method further includes: pushing a substitute plan corresponding to the fault work order to an access user, specifically comprising the following substeps:
searching a plurality of alternative plans matched with the fault work order in a preset fault alternative plan relation;
and pushing the substitute plan information to the user side when the user accesses the related link of the fault work order.
In a second aspect, an automated fault management apparatus is provided, the apparatus comprising at least:
the fault work order triggering module is used for identifying system faults according to the received fault prompt information and triggering fault work orders with corresponding dimensions;
The fault point positioning module is used for generating a plurality of troubleshooting tasks in corresponding dimensionalities according to the fault work order and executing the troubleshooting tasks in parallel to position fault points;
and the fault repairing module is used for searching a recovery plan matched with the fault point in a preset recovery plan matching relation and executing the recovery plan to repair the system fault.
In some preferred embodiments, the trouble order triggering module includes:
the receiving unit is used for receiving at least one fault prompt message of multi-dimensional monitoring alarm messages or manual alarm messages;
and generating a fault work order with corresponding dimensionality according to the multi-dimensionality alarm information and the manual alarm information based on a pre-constructed fault work order triggering model.
In a third aspect, there is provided a computer system comprising:
one or more processors; and
a memory associated with the one or more processors for storing program instructions that, when read and executed by the one or more processors, perform operations comprising:
identifying system faults according to the received fault prompt information and triggering fault work orders with corresponding dimensions;
Generating a plurality of troubleshooting tasks in corresponding dimensionalities according to the troubleshooting work order, and executing the troubleshooting tasks in parallel to position fault points;
and searching a recovery plan matched with the fault point in a preset recovery plan matching relation, and executing the recovery plan to repair the system fault.
Compared with the prior art, the invention has the beneficial effects that:
the invention provides an automatic fault management method, a device and a system, the method firstly identifies system faults according to received fault prompt information and triggers a fault work order of corresponding dimensionality, then generates a plurality of fault troubleshooting tasks in the corresponding dimensionality according to the fault work order, parallelly executes the plurality of fault troubleshooting tasks to locate fault points, finally searches a recovery plan matched with the fault points in a preset recovery plan matching relation, and executes the recovery plan to repair the system faults, the method realizes the whole-process automatic fault processing, avoids the influence of artificial subjective factors, enables the fault identification, location and repair processes to be more accurate and efficient, and adopts a parallel fault troubleshooting mode to further shorten the fault troubleshooting time, improve the fault troubleshooting comprehensiveness and improve the fault management efficiency;
Furthermore, the fault prompt information comprises at least one of multi-dimensional monitoring alarm information or manual alarm information, the method is divided into different dimensions according to the fault types, and the different dimensions are respectively monitored to trigger the fault work order, so that the fault identification sensitivity can be greatly improved, the fault reporting time and the fault processing time are shortened, the fault multiplication risk can be avoided, and the user experience can be improved; furthermore, when a fault work order is triggered, the corresponding fault type can be obtained, only the corresponding dimension is needed to be checked, the checking range is reduced, the fault positioning accuracy and timeliness are improved, meanwhile, the fault identification is carried out in a mode of combining multi-dimension monitoring alarm information and manual alarm information, all system faults can be comprehensively covered, and omission is avoided;
after the fault work order is triggered, a preset substitution plan corresponding to the fault work order is pushed to an access user, and during the waiting of system recovery, the user can still realize the system function by executing the substitution plan, so that unnecessary pressure on service and flow caused by service interruption and repeated operation of the user is effectively avoided, and the use experience of the user is improved.
The scheme of the application can be realized only by realizing any technical effect.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a flow chart of an automated fault management method according to a first embodiment of the present invention;
FIG. 2 is a block diagram of an automated fault management apparatus according to a second embodiment of the present invention;
FIG. 3 is a diagram of a computer system architecture in a third embodiment of the invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In view of the fact that in the current online fault management process, the fault handling, positioning and repairing process needs to be participated by processing personnel to a certain extent and is judged based on experience of the processing personnel, the process has high dependence on people, is greatly influenced by subjective judgment of people, has poor fault handling timeliness and brings inconvenience. Therefore, the present embodiment provides an automatic fault management method, apparatus and system, which can effectively overcome the above-mentioned problems.
The following describes the automated fault management method, apparatus and system with reference to the embodiments and accompanying drawings 1-3.
Example one
Referring to fig. 1, the present embodiment provides an automated fault management method, which at least includes the following steps:
and S1, identifying system faults according to the received fault prompt information and triggering fault work orders with corresponding dimensions.
Specifically, S1 includes at least the following sub-steps:
s11, receiving at least one fault prompt message of multi-dimensional monitoring alarm information or manual alarm information;
and S12, generating the fault work order with corresponding dimensionality according to the multi-dimensional alarm information and the manual alarm information based on the pre-constructed fault work order triggering model.
The fault prompt information at least comprises one of multi-dimensional monitoring alarm information and manual alarm information triggered by an alarm platform.
In this embodiment, the faults are divided into different dimensions according to the fault types. The alarm platform is used for monitoring the fault dimensionality with the high triggering probability and sending corresponding dimensionality monitoring alarm information. When the user triggers other faults beyond the multiple dimensions of the alarm platform, system faults are prompted by generating manual alarm information.
Illustratively, when the system fault management method is applied to a logistics management system, an alarm platform monitors three dimensions of service layer running data, link layer running data and background running data, and when the running data of the three dimensions exceed an alarm threshold, an alarm is triggered to generate service alarm information (STP), link monitoring alarm information (TRO) or background monitoring alarm information (AOPS), namely multi-dimensional monitoring alarm information. And when the operating data of the other dimensions except the three dimensions are abnormal, executing manual reporting to generate manual alarm information.
The method is divided into different dimensions according to fault types, and the different dimensions are respectively monitored to trigger a fault work order, so that the fault identification sensitivity can be greatly improved, the fault reporting time and the fault processing time are shortened, the fault multiplication risk can be avoided, and the user experience can be improved; furthermore, when a fault work order is triggered, the corresponding fault type can be known, only the corresponding dimensionality needs to be checked, the checking range is reduced, the fault positioning accuracy and the timeliness are improved, meanwhile, the mode of combining multi-dimensionality monitoring alarm information and manual alarm information is adopted for fault recognition, all system faults can be comprehensively covered, and omission is avoided.
The manual alarm information is reported to the fault alarm of the service desk by the scattered users, and the alarm information is only the general text description of the fault by the user and can be used for triggering the work order with corresponding dimensionality only by further processing the fault. Therefore, when the fault notification information is the manual warning information, before step S12, the method further includes:
sa1, processing the received artificial alarm information by adopting a natural language processing algorithm (NLP) to obtain a semantic analysis result;
sa2, searching whether an alarm dimension matched with the semantic analysis result exists in a pre-constructed fault classification table, and if so, marking a corresponding alarm dimension for the artificial alarm information; if not, marking the general alarm dimension.
The use of Natural Language Processing (NLP) algorithm for semantic analysis is a common technical means in the art, and the technology is mature, which is not further described in this embodiment.
The method also comprises the following steps before receiving the fault prompt message: and a fault work order triggering model is constructed in advance. The model is obtained by taking a decision binary tree as a model basis, calling a past fault data sample, taking fault prompt information and corresponding dimensionality as input, and taking a link node related to the content of the fault prompt information as output for training.
After triggering the trouble order, the trouble handling process usually requires a certain time, and the user cannot use the related function in this time period, as a preferred embodiment, the method further includes: sb: pushing a substitute plan corresponding to the fault work order to an access user, specifically comprising the following substeps:
sb1, searching a plurality of alternative plans matched with the fault work order in a preset fault alternative plan relation;
sb2, push alternative plan information to the user side when the user accesses the failed work order related link.
Therefore, the method further comprises the following steps: and pre-constructing a fault substitution plan relation corresponding to the fault work order and the substitution plan. The construction method comprises the following steps: and checking historical fault scenes and relevant system functions influenced by the faults, presetting a substitution plan for each influenced system function, and forming an incidence relation between the faults and the substitution plans. When the user triggers the failure, a substitute plan is automatically sent to the corresponding user.
Such as: the fault scene is as follows: the WeChat code scanning is abnormal in order function, and the services influenced in the scene are as follows: and (5) a business of placing the order of the bulk order. And aiming at the scene, finding out the alternative plan in the pre-constructed fault alternative plan relation as ordering by adopting a payment instrument, and sending an alternative plan suggestion to the user under the condition of confirming that the alternative plan is available.
When the fault never occurs and the corresponding substitution plan does not exist in the fault substitution plan relationship, the fault and the corresponding substitution plan are added into the fault substitution plan relationship after the substitution plan is received and effectively executed.
Therefore, after the fault work order is triggered, a preset substitution plan corresponding to the fault point is pushed to the access user, and during the waiting of system recovery, the system function can still be realized by executing the substitution plan, so that unnecessary pressure on service and flow caused by service interruption and repeated operation of the user is effectively avoided, and the use experience of the user is improved.
And S2, generating a plurality of troubleshooting tasks in corresponding dimensions according to the fault work order, and executing the troubleshooting tasks in parallel to position fault points. Specifically, S2 includes at least the following sub-steps:
and S21, based on the preset correlation relationship of the troubleshooting tasks, searching a plurality of troubleshooting tasks matched with the troubleshooting work order in the corresponding dimensionality according to the failure information of the troubleshooting work order.
And S22, executing a plurality of troubleshooting tasks in parallel and obtaining a corresponding number of troubleshooting results.
The step of executing the troubleshooting task specifically comprises the following substeps:
S221, acquiring a plurality of node indexes associated with each sub-dimension in the dimension to be investigated;
s222, judging whether each node index is within a preset threshold range, and if not, prompting the fault of the sub-dimension.
And S23, acquiring the fault point according to the fault troubleshooting result based on the preset correlation between the troubleshooting result and the fault point.
In this embodiment, each fault dimension includes a plurality of sub-dimensions representing different fault positions, and after a fault work order belonging to a certain dimension is triggered, in order to thoroughly perform troubleshooting to locate a fault point comprehensively and as quickly as possible, all the sub-dimensions in the dimension need to be subjected to troubleshooting. The parallel execution means that all fault troubleshooting tasks are finished within a certain time threshold, and compared with the traditional serial execution of the troubleshooting tasks, the parallel execution method can effectively shorten the fault troubleshooting time, improve the fault troubleshooting comprehensiveness and improve the fault management efficiency.
Taking the fault management of the logistics management system as an example, the sub-dimensions included in the troubleshooting dimension include: the method comprises the steps of storage, application, DBA, development, a machine room, a data center network, a park network and the like, and the troubleshooting tasks of each sub-dimension are executed in a parallel mode, and corresponding troubleshooting results are obtained. Furthermore, the checking process mainly judges whether the running state of the application container, the running state of the middleware, the running state of the database and the like are normal running states.
Likewise, the method further comprises: and (4) pre-constructing an incidence relation between the fault work order and the troubleshooting task and an incidence relation between the troubleshooting result and the fault point. The establishment of the two incidence relations can be respectively established in a manual or machine learning mode based on the called past fault data samples, and can be realized through common classification marking or mapping relations, which is a mature technical means and is not further described.
And S3, searching a recovery plan matched with the fault point in the preset recovery plan matching relation, and executing the recovery plan to repair the system fault.
Specifically, S3 includes at least the following sub-steps:
s31, based on the preset matching relationship of the recovery plans, searching a plurality of recovery plans matched with the fault points and corresponding priority sequence;
s32, executing an optimal recovery plan to repair the system fault and obtain a repair result;
and S33, judging the repair result, if not, executing a suboptimal recovery plan until the repair result is repaired, and removing the fault work order.
Illustratively, the recovery plan includes, but is not limited to, a system restart or a rollback version, etc., and the rollback version is usually a 24H rollback version in consideration of a short occurrence time of a failure.
Similarly, before the fault handling, the method further comprises
S4, the fault processing process is duplicated and optimized, and S4 comprises the following substeps:
s41, acquiring a recovery plan adopted by the fault point repair;
s42, judging whether the recovery plan is the optimal recovery plan matched with the fault point in the recovery plan matching relationship; if not, optimizing the recovery plan matching relationship.
Therefore, after each system fault occurs and is repaired, the matching relation of the recovery plan is optimized as a newly added sample.
The automatic fault management method described in this embodiment realizes full-process automatic fault processing, avoids the influence of artificial subjective factors, makes the fault identification, location and repair processes more accurate and efficient, and adopts a parallel fault troubleshooting mode, so that the fault troubleshooting time can be further shortened, the fault troubleshooting comprehensiveness can be improved, and the fault management efficiency can be improved.
Example two
In order to implement the method for automated fault management in the first embodiment, this embodiment provides a corresponding automated fault management apparatus, as shown in fig. 2, the apparatus at least includes:
the fault work order triggering module is used for identifying system faults according to the received fault prompt information and triggering fault work orders with corresponding dimensions;
The fault point positioning module is used for generating a plurality of troubleshooting tasks in corresponding dimensionalities according to the fault work order and executing the troubleshooting tasks in parallel to position fault points;
and the fault repairing module is used for searching a recovery plan matched with the fault point in a preset recovery plan matching relation and executing the recovery plan to repair the system fault.
Wherein, trouble work order trigger module includes:
the receiving unit is used for receiving at least one fault prompt message of multi-dimensional monitoring alarm messages or manual alarm messages;
and the generating unit is used for generating the fault work order with corresponding dimensionality according to the multi-dimensionality alarm information and the manual alarm information based on a pre-constructed fault work order trigger model.
The trouble work order trigger module still includes:
the semantic analysis unit is used for processing the received artificial alarm information by adopting a natural language processing algorithm to obtain a semantic analysis result;
the first searching unit is used for searching whether an alarm dimension matched with the semantic analysis result exists in a pre-constructed fault classification table, and if so, marking the corresponding alarm dimension for the artificial alarm information; if not, marking the general alarm dimension.
The fault point positioning module comprises:
the second searching unit is used for searching a plurality of troubleshooting tasks matched with the troubleshooting work order in corresponding dimensionality according to the fault information of the troubleshooting work order based on the preset troubleshooting task incidence relation;
the execution unit is used for executing the plurality of troubleshooting tasks in parallel and obtaining troubleshooting results with corresponding quantity;
and the fault point unit is used for acquiring a fault point according to the fault troubleshooting result based on the preset correlation between the troubleshooting result and the fault point.
The fault repair module includes:
a third searching unit, configured to search, based on a preset recovery plan matching relationship, a plurality of recovery plans matched with the fault point and corresponding priority ranks according to the fault point;
the system comprises a restoration unit, a fault detection unit and a fault detection unit, wherein the restoration unit is used for executing an optimal restoration plan so as to restore system faults and obtain a restoration result;
and the first judgment unit is used for judging the repair result, if the repair result is not repaired, executing a suboptimal recovery plan until the repair result is repaired, and removing the fault work order.
The device also includes:
the duplication and optimization module is used for duplicating and optimizing the fault processing process and comprises the following steps:
the acquisition unit is used for acquiring a recovery plan adopted by the fault point repair;
The second judging unit is used for judging whether the recovery plan is the optimal recovery plan matched with the fault point in the recovery plan matching relationship; if not, optimizing the recovery plan matching relationship.
The device also includes: the replacing plan module is used for pushing a replacing plan corresponding to the fault work order to the access user and comprises the following components:
the fourth searching unit is used for searching a plurality of alternative plans matched with the fault work order in a preset fault alternative plan relation;
and the pushing unit is used for pushing the alternative plan information to the user side when the user accesses the link related to the fault work order.
It should be noted that: in the automatic fault management device provided in the foregoing embodiment, when triggering the automatic fault management service in the first embodiment, only the division of the functional modules is described as an example, and in practical applications, the function distribution may be completed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to complete all or part of the functions described above. In addition, the automatic fault management device provided in the above embodiment and the automatic fault management method provided in the first embodiment belong to the same concept, that is, the device is based on the method, and the specific implementation process thereof is described in the method embodiment, and is not described herein again.
EXAMPLE III
Corresponding to the above method and apparatus, a third embodiment of the present application provides a computer system, including:
one or more processors; and
a memory associated with the one or more processors for storing program instructions that, when read and executed by the one or more processors, perform operations comprising:
identifying system faults and triggering fault work orders according to received fault prompt information, wherein the fault prompt information at least comprises one of multidimensional monitoring alarm information and manual alarm information;
generating parallel troubleshooting tasks according to the troubleshooting work order, respectively pushing the troubleshooting tasks to corresponding troubleshooting personnel, and positioning fault points according to received troubleshooting results corresponding to the troubleshooting tasks;
searching a recovery plan matched with the fault point in a preset recovery plan matching relation, and pushing the recovery plans to fault processing personnel after the recovery plans are sorted according to priorities;
and receiving and executing a recovery plan selected by the fault handling personnel to repair the system fault.
FIG. 3 illustrates an architecture of a computer system that may include, in particular, a processor 1510, a video display adapter 1511, a disk drive 1512, an input/output interface 1513, a network interface 1514, and a memory 1520. The processor 1510, video display adapter 1511, disk drive 1512, input/output interface 1513, network interface 1514, and memory 1520 may be communicatively coupled via a communication bus 1530.
The processor 1510 may be implemented by using a general CXU (Central processing Unit), a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits, and is configured to execute a relevant program to implement the technical solution provided by the present application.
The Memory 1520 may be implemented in the form of a ROM (Read Only Memory), a RAM (Random access Memory), a static storage device, a dynamic storage device, or the like. The memory 1520 may store an operating system 1521 for controlling the operation of the computer system 1500, a Basic Input Output System (BIOS) for controlling low-level operations of the computer system 1500. In addition, a web browser 1523, a data storage management system 1524, an icon font processing system 1525, and the like can also be stored. The icon font processing system 1525 may be an application program that implements the operations of the foregoing steps in this embodiment of the application. In summary, when the technical solution provided by the present application is implemented by software or firmware, the relevant program codes are stored in the memory 1520 and called for execution by the processor 1510.
The input/output interface 1513 is used for connecting an input/output module to realize information input and output. The i/o module may be configured as a component in a device (not shown) or may be external to the device to provide a corresponding function. The input devices may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc., and the output devices may include a display, a speaker, a vibrator, an indicator light, etc.
The network interface 1514 is used to connect a communication module (not shown) to enable the device to communicatively interact with other devices. The communication module can realize communication in a wired mode (such as USB, network cable and the like) and also can realize communication in a wireless mode (such as mobile network, WIFI, Bluetooth and the like).
The bus 1530 includes a path to transfer information between the various components of the device, such as the processor 1510, the video display adapter 1511, the disk drive 1512, the input/output interface 1513, the network interface 1514, and the memory 1520.
In addition, the computer system 1500 may also obtain information of specific extraction conditions from the virtual resource object extraction condition information database 1541 for performing condition judgment, and the like.
It should be noted that although the above devices only show the processor 1510, the video display adapter 1511, the disk drive 1512, the input/output interface 1513, the network interface 1514, the memory 1520, the bus 1530, etc., in a specific implementation, the devices may also include other components necessary for proper operation. Furthermore, it will be understood by those skilled in the art that the apparatus described above may also include only the components necessary to implement the solution of the present application, and not necessarily all of the components shown in the figures.
From the above description of the embodiments, it is clear to those skilled in the art that the present application can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, or the like, and includes several instructions for enabling a computer device (which may be a personal computer, a cloud server, or a network device) to execute the method according to the embodiments or some parts of the embodiments of the present application.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, the system or system embodiments are substantially similar to the method embodiments and therefore are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for related points. The above-described system and system embodiments are only illustrative, wherein the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement the data without inventive effort.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the invention.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims (10)

1. An automated fault management method, comprising the steps of:
identifying system faults according to the received fault prompt information and triggering fault work orders with corresponding dimensions;
generating a plurality of troubleshooting tasks in corresponding dimensionalities according to the troubleshooting work order, and executing the troubleshooting tasks in parallel to position fault points;
and searching a recovery plan matched with the fault point in a preset recovery plan matching relation, and executing the recovery plan to repair the system fault.
2. The method according to claim 1, wherein the identifying system faults and triggering fault work orders of corresponding dimensions according to the received fault prompt information comprises the following sub-steps:
receiving at least one fault prompt message of multidimensional monitoring alarm information or manual alarm information;
and generating a fault work order with corresponding dimensionality according to the multi-dimensionality alarm information and the manual alarm information based on a pre-constructed fault work order triggering model.
3. The method according to claim 2, wherein when the fault prompting information is manual alarm information, before generating the fault work order of the corresponding dimension according to the multidimensional alarm information and the manual alarm information based on the pre-constructed fault work order triggering model, further comprising:
processing the received artificial alarm information by adopting a natural language processing algorithm to obtain a semantic analysis result;
searching whether an alarm dimension matched with the semantic analysis result exists in a pre-constructed fault classification table, and if so, marking a corresponding alarm dimension for the artificial alarm information;
if not, marking the general alarm dimension.
4. The method according to claim 1, wherein the generating of the plurality of troubleshooting tasks in the corresponding dimension according to the troubleshooting task order and the parallel execution of the plurality of troubleshooting tasks to locate the failure point comprises the following sub-steps:
based on a preset troubleshooting task incidence relation, searching a plurality of troubleshooting tasks matched with the troubleshooting work order in corresponding dimensionality according to the failure information of the troubleshooting work order;
executing the plurality of troubleshooting tasks in parallel and obtaining a corresponding number of troubleshooting results;
And acquiring a fault point according to the fault troubleshooting result based on the preset correlation between the troubleshooting result and the fault point.
5. The method according to claim 1, wherein a recovery plan matching the fault point is found in a preset recovery plan matching relationship, and the recovery plan is executed to repair the system fault, comprising the following sub-steps:
based on a preset recovery plan matching relation, searching a plurality of recovery plans matched with the fault point and corresponding priority sequence according to the fault point;
executing an optimal recovery plan to repair the system fault and obtain a repair result;
and judging a repair result, if the repair result is not repaired, executing a suboptimal recovery plan until the repair result is repaired, and removing the fault work order.
6. The method of claim 5, wherein after searching for a recovery plan matching the failure point in a preset recovery plan matching relationship, and executing a recovery plan to repair a system failure, the method further comprises: the fault handling process is duplicated and optimized, and the fault handling process comprises the following substeps:
acquiring a recovery plan adopted by repairing a fault point;
judging whether the recovery plan is the optimal recovery plan matched with the fault point in the recovery plan matching relationship;
If not, optimizing the recovery plan matching relationship.
7. The method according to any one of claims 1 to 6, wherein after the generating of the plurality of troubleshooting tasks in the corresponding dimension according to the troubleshooting work order and the parallel execution of the plurality of troubleshooting tasks to locate the failure point, the method further comprises: pushing a substitute plan corresponding to the fault work order to an access user, specifically comprising the following substeps:
searching a plurality of alternative plans matched with the fault work order in a preset fault alternative plan relation;
and pushing the substitute plan information to the user side when the user accesses the related link of the fault work order.
8. An automated fault management apparatus, characterized in that the apparatus comprises at least:
the fault work order triggering module is used for identifying system faults according to the received fault prompt information and triggering fault work orders with corresponding dimensions;
the fault point positioning module is used for generating a plurality of troubleshooting tasks in corresponding dimensionalities according to the fault work order and executing the troubleshooting tasks in parallel to position fault points;
and the fault repairing module is used for searching a recovery plan matched with the fault point in a preset recovery plan matching relation and executing the recovery plan to repair the system fault.
9. The apparatus of claim 8, wherein the trouble order triggering module comprises:
the receiving unit is used for receiving at least one fault prompt message of multi-dimensional monitoring alarm messages or manual alarm messages;
and the generating unit is used for generating the fault work order with corresponding dimensionality according to the multi-dimensionality alarm information and the manual alarm information based on a pre-constructed fault work order trigger model.
10. A computer system, comprising:
one or more processors; and
a memory associated with the one or more processors for storing program instructions that, when read and executed by the one or more processors, perform operations comprising:
identifying system faults according to the received fault prompt information and triggering fault work orders with corresponding dimensions;
generating a plurality of troubleshooting tasks in corresponding dimensionalities according to the troubleshooting work order, and executing the troubleshooting tasks in parallel to position fault points;
and searching a recovery plan matched with the fault point in a preset recovery plan matching relation, and executing the recovery plan to repair the system fault.
CN202010652478.8A 2020-07-08 2020-07-08 Automatic fault management method, device and system Pending CN111865673A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010652478.8A CN111865673A (en) 2020-07-08 2020-07-08 Automatic fault management method, device and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010652478.8A CN111865673A (en) 2020-07-08 2020-07-08 Automatic fault management method, device and system

Publications (1)

Publication Number Publication Date
CN111865673A true CN111865673A (en) 2020-10-30

Family

ID=73153706

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010652478.8A Pending CN111865673A (en) 2020-07-08 2020-07-08 Automatic fault management method, device and system

Country Status (1)

Country Link
CN (1) CN111865673A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112804105A (en) * 2021-01-19 2021-05-14 上海七牛信息技术有限公司 Method and system for rapidly repairing terminal communication fault in RTC network
CN114915541A (en) * 2022-04-08 2022-08-16 北京快乐茄信息技术有限公司 System fault elimination method and device, electronic equipment and storage medium
CN115421950A (en) * 2022-08-25 2022-12-02 广东博成网络科技有限公司 Automatic system operation and maintenance management method and system based on machine learning

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103905240A (en) * 2012-12-28 2014-07-02 中国电信股份有限公司 Method and system for active network service fault reminding and processing
CN106960274A (en) * 2017-03-01 2017-07-18 武汉烽火技术服务有限公司 A kind of fault ticket processing system and method
CN107196804A (en) * 2017-06-01 2017-09-22 国网山东省电力公司信息通信公司 Power system terminal communication access network Centralized Alarm Monitoring system and method
CN108631298A (en) * 2018-03-30 2018-10-09 国电南瑞科技股份有限公司 A kind of unit style distribution network restoration strategy-generating method based on load balancing distribution
CN108989132A (en) * 2018-08-24 2018-12-11 深圳前海微众银行股份有限公司 Fault warning processing method, system and computer readable storage medium
CN109783257A (en) * 2019-01-29 2019-05-21 清华大学 Selection method of replacing and system towards batch Web service Passive fault-tolerant control
CN109993550A (en) * 2019-04-17 2019-07-09 连云港杰瑞电子有限公司 After-sale service system and method based on wechat small routine and smart allocation algorithm
CN110635954A (en) * 2019-10-21 2019-12-31 中国民航信息网络股份有限公司 Method and system for processing network fault of data center
CN110727531A (en) * 2019-09-18 2020-01-24 上海麦克风文化传媒有限公司 Fault prediction and processing method and system for online system
CN110796343A (en) * 2019-10-10 2020-02-14 深圳中集智能科技有限公司 Intelligent dispatching method, device and system
CN110930075A (en) * 2019-12-16 2020-03-27 云南电网有限责任公司信息中心 Power equipment fault positioning method and device, computer equipment and storage medium

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103905240A (en) * 2012-12-28 2014-07-02 中国电信股份有限公司 Method and system for active network service fault reminding and processing
CN106960274A (en) * 2017-03-01 2017-07-18 武汉烽火技术服务有限公司 A kind of fault ticket processing system and method
CN107196804A (en) * 2017-06-01 2017-09-22 国网山东省电力公司信息通信公司 Power system terminal communication access network Centralized Alarm Monitoring system and method
CN108631298A (en) * 2018-03-30 2018-10-09 国电南瑞科技股份有限公司 A kind of unit style distribution network restoration strategy-generating method based on load balancing distribution
CN108989132A (en) * 2018-08-24 2018-12-11 深圳前海微众银行股份有限公司 Fault warning processing method, system and computer readable storage medium
CN109783257A (en) * 2019-01-29 2019-05-21 清华大学 Selection method of replacing and system towards batch Web service Passive fault-tolerant control
CN109993550A (en) * 2019-04-17 2019-07-09 连云港杰瑞电子有限公司 After-sale service system and method based on wechat small routine and smart allocation algorithm
CN110727531A (en) * 2019-09-18 2020-01-24 上海麦克风文化传媒有限公司 Fault prediction and processing method and system for online system
CN110796343A (en) * 2019-10-10 2020-02-14 深圳中集智能科技有限公司 Intelligent dispatching method, device and system
CN110635954A (en) * 2019-10-21 2019-12-31 中国民航信息网络股份有限公司 Method and system for processing network fault of data center
CN110930075A (en) * 2019-12-16 2020-03-27 云南电网有限责任公司信息中心 Power equipment fault positioning method and device, computer equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
"《人工智能》", 成都:四川科学技术出版社 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112804105A (en) * 2021-01-19 2021-05-14 上海七牛信息技术有限公司 Method and system for rapidly repairing terminal communication fault in RTC network
CN112804105B (en) * 2021-01-19 2023-07-11 上海七牛信息技术有限公司 Method and system for rapidly repairing terminal communication faults in RTC network
CN114915541A (en) * 2022-04-08 2022-08-16 北京快乐茄信息技术有限公司 System fault elimination method and device, electronic equipment and storage medium
CN114915541B (en) * 2022-04-08 2023-03-10 北京快乐茄信息技术有限公司 System fault elimination method and device, electronic equipment and storage medium
CN115421950A (en) * 2022-08-25 2022-12-02 广东博成网络科技有限公司 Automatic system operation and maintenance management method and system based on machine learning
CN115421950B (en) * 2022-08-25 2024-01-23 广东博成网络科技有限公司 Automatic system operation and maintenance management method and system based on machine learning

Similar Documents

Publication Publication Date Title
CN110704231A (en) Fault processing method and device
CN111865673A (en) Automatic fault management method, device and system
CN110928772A (en) Test method and device
JP2011076161A (en) Incident management system
US20230129123A1 (en) Monitoring and Management System for Automatically Generating an Issue Prediction for a Trouble Ticket
US9489379B1 (en) Predicting data unavailability and data loss events in large database systems
CN113360722B (en) Fault root cause positioning method and system based on multidimensional data map
CN113672427A (en) Exception handling method, device, equipment and medium based on RPA and AI
CN116107589B (en) Automatic compiling method, device and equipment of software codes and storage medium
CN111835566A (en) System fault management method, device and system
CN114693116A (en) Method and device for detecting code review validity and electronic equipment
CN112068979B (en) Service fault determination method and device
CN113010339A (en) Method and device for automatically processing fault in online transaction test
CN113934595A (en) Data analysis method and system, storage medium and electronic terminal
CN114169776A (en) Task processing method and device
CN117149501B (en) Problem repair system and method
CN112650796A (en) Automatic application data collection and storage management system
CN111352818A (en) Application program performance analysis method and device, storage medium and electronic equipment
CN109885505A (en) A kind of method of fault location, system and associated component
US11822578B2 (en) Matching machine generated data entries to pattern clusters
CN116431872B (en) Observable system and service observing method based on observable system
US20240054509A1 (en) Intelligent shelfware prediction and system adoption assistant
CN116628108A (en) Index tracing method, device, storage medium and equipment
CN114693115A (en) Method and device for detecting code review validity and electronic equipment
CN116974801A (en) Transaction link abnormality analysis method and device, storage medium and computer equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20201030

RJ01 Rejection of invention patent application after publication